CN114338309A - Method and system for optimizing Volterra equalizer structure based on deep reinforcement learning - Google Patents

Method and system for optimizing Volterra equalizer structure based on deep reinforcement learning Download PDF

Info

Publication number
CN114338309A
CN114338309A CN202111572693.8A CN202111572693A CN114338309A CN 114338309 A CN114338309 A CN 114338309A CN 202111572693 A CN202111572693 A CN 202111572693A CN 114338309 A CN114338309 A CN 114338309A
Authority
CN
China
Prior art keywords
state
memory length
volterra
agent
equalizer
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202111572693.8A
Other languages
Chinese (zh)
Other versions
CN114338309B (en
Inventor
义理林
徐永鑫
黄璐瑶
蒋文清
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Jiaotong University
Original Assignee
Shanghai Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Jiaotong University filed Critical Shanghai Jiaotong University
Priority to CN202111572693.8A priority Critical patent/CN114338309B/en
Publication of CN114338309A publication Critical patent/CN114338309A/en
Application granted granted Critical
Publication of CN114338309B publication Critical patent/CN114338309B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D30/00Reducing energy consumption in communication networks
    • Y02D30/70Reducing energy consumption in communication networks in wireless communication networks

Landscapes

  • Cable Transmission Systems, Equalization Of Radio And Reduction Of Echo (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention provides a method and a system for optimizing a Volterra equalizer structure based on deep reinforcement learning, which comprises the following steps: initializing memory length states of an Agent, an experience playback pool and a Volterra equalizer; randomly generating actions for the Agent, updating the memory length state of the Volterra equalizer until the memory length state reaches the end state, calculating a reward value according to the complexity of the Volterra equalizer and the error rate after signal equalization, and storing the transfer process as experience into an experience playback pool; sampling experiences from the experience playback pool, and training and soft updating the Agent; and determining the memory length of each order of the Volterra equalizer according to the convergence value. The invention realizes the automatic searching method of the optimal structure of the Volterra equalizers of different types under the condition of given computing resources, and compared with the traditional greedy search, the method not only can further improve the equalizing effect, but also greatly reduces the complexity of the equalizers.

Description

Method and system for optimizing Volterra equalizer structure based on deep reinforcement learning
Technical Field
The invention relates to the technical field of optical communication, in particular to a method and a system for optimizing a Volterra equalizer structure based on deep reinforcement learning.
Background
The Volterra non-linear equalizer is widely applied to an optical fiber communication system and is used for relieving linear and non-linear damages to signals in a transmission process. In long-distance optical fiber communication systems, nonlinear damage mainly comes from optical fiber nonlinearity, and in short-distance optical fiber communication systems, nonlinear damage mainly comes from transceiver devices, such as nonlinear response of a modulator, square-law detection of a photodetector, and the like. The equalization effect and the realization complexity are important indexes for evaluating the equalizer, and in order to realize the equalization effect and the realization complexity on hardware in real time, the structure of the high-performance low-complexity Volterra nonlinear equalizer becomes a hot point of research.
The equalization effect and complexity of the Volterra non-linear equalizer depend on the order, the memory length of each order, the existence of feedback, the feedback memory length and the like, and the higher the order and the larger the memory length, the higher the complexity. In order to ensure the equalization effect, in a short-distance optical fiber communication system, the order of a Volterra nonlinear equalizer usually takes 3 orders, and the memory length of each order is usually determined by manual experience or greedy search. In reducing the complexity of the Volterra non-linear equalizer, the current method mainly adopts two ways: one is pruning, which comprises unstructured pruning realized by setting a pruning threshold and L1 regularization and structured pruning realized by setting a correlation distance between Volterra cross term signals; the other is to design a simplified Volterra non-linear equalizer structure based on device physical characteristics and channel characteristics.
However, the existing method still needs to determine the memory length, pruning threshold or related distance of each order of the Volterra nonlinear equalizer through manual experience or greedy search, so that the efficiency is low, the best performance of the Volterra nonlinear equalizer can not be well exerted, and the compromise between the equalization effect and the complexity is difficult to realize.
Disclosure of Invention
Aiming at the defects in the prior art, the invention aims to provide a method and a system for optimizing a Volterra equalizer structure based on deep reinforcement learning.
The method for optimizing the structure of the Volterra equalizer based on the deep reinforcement learning provided by the invention comprises the following steps:
step S1: initializing an Agent of an intelligent Agent, initializing an experience playback pool, initializing a memory length state of a Volterra equalizer and defining a state transition process;
step S2: starting from the initial memory length state of the Volterra equalizer, randomly generating action for the Agent, updating the memory length state of the Volterra equalizer until the memory length state reaches the end state, calculating a reward value according to the complexity of the Volterra equalizer and the error rate after signal equalization, storing the transfer process as experience into an experience playback pool, and circulating from the initial state again until a specified amount of experience is generated;
step S3: sampling experiences from an experience playback pool, training the agents, and then performing soft updating on the agents every preset step number;
step S4: and generating a deterministic action for the updated Agent from the initial memory length state of the Volterra equalizer until the state transition process is finished, calculating an incentive value and storing the transition process into an experience playback pool, then repeating the step S3 and the step S4 until the incentive value and the action output by the Agent are converged, and finally determining each order of memory length of the Volterra equalizer according to the convergence value.
Preferably, the step S1 includes:
step S11: four neural networks in the Agent are defined: actor network muθCritic network QwTarget Actor network
Figure BDA0003423734980000021
And Target critical network
Figure BDA0003423734980000022
Actor network mu initialization using random parameters theta, wθAnd Critic network QwUsing random parameters
Figure BDA0003423734980000023
Initializing Target Actor networks
Figure BDA0003423734980000024
And Target critical networkCollaterals of kidney meridian
Figure BDA0003423734980000025
Wherein is provided with
Figure BDA0003423734980000026
Is equal to theta, set
Figure BDA0003423734980000027
Is equal to w;
step S12: initializing an experience playback pool storing experiences in a format of(s)i,ai,ri,si+1Done), wherein siRepresenting the memory length state of the current Volterra equalizer; a isiRepresenting Agent according to current state siGenerating an action in which the memory length of each step is in proportion to the maximum memory length limit; r isiRepresenting Agent facing status siTaking action ofiThe reward earned; si+1Representing Agent taking action aiThen, the memory length state after the Volterra equalizer is updated; done is a mark for judging whether the whole state transition process is finished or not;
step S13: initializing the memory length state of the Volterra equalizer according to the type of the Volterra equalizer and defining the state transition process.
Preferably, the step S2 includes:
step S21: selecting a state transition process according to the type of the Volterra equalizer, starting by the Agent from an initial state, generating random actions uniformly distributed according to [0, 1], updating the memory length state of the Volterra equalizer, and continuing to generate the random actions according to the current state until the memory length state of the Volterra equalizer is updated to an end state;
step S22: calculating a reward value, determining the memory length of each order according to the maximum memory length limit of each order of the Volterra equalizer and the action of the Agent, performing 2-fold cross validation on signal data, and calculating the reward value by using the complexity of the current equalizer and the average error rate after equalization;
step S23: transferring the state to the process(s)i,ai,ri,si+1Done) as experience, and storing the experience into an experience playback pool;
step S24: the steps S21 to S23 are repeated until a preset number of experiences are generated.
Preferably, the step S3 includes:
sampling N experiences(s) from an experience replay pool by using a data guided pool techniquei,ai,ri,si+1Done) for Agent training, including K experiences with the greatest reward;
calculating a target Q value
Figure BDA0003423734980000031
Wherein the content of the first and second substances,
Figure BDA0003423734980000032
ε~clip(N(0,σ2) -c, c), wherein,
Figure BDA0003423734980000033
representing a TargetActor network
Figure BDA0003423734980000034
Facing state is si+1A motion of a temporal output; ε is the strategic noise, obeys a mean of 0, and the variance is σ2And is truncated between-c and c; n (0, sigma)2) Representing a mean of 0 and a variance of σ2(ii) a gaussian distribution of; clip represents truncation; gamma is a discount factor;
Figure BDA0003423734980000035
representing Target critical network
Figure BDA0003423734980000036
Facing state is si+1And act as
Figure BDA0003423734980000037
An output of time;
by minimizing errors
Figure BDA0003423734980000038
Updating critical network Qw,Qw(si,ai) Indicating criticic network QwFacing state is siAnd the action is aiAn output of time;
every d steps by minimizing the error
Figure BDA0003423734980000039
Updating Actor network muθ,μθ(si) Represents an Actor network muθFacing state is siA motion of a temporal output; qw(siθ(si) Denotes a Critic network QwFacing state is siAnd the sum action is muθ(si) An output of time; for TargetActor networks as follows
Figure BDA00034237349800000310
And Target critical network
Figure BDA00034237349800000311
And (3) carrying out soft updating:
Figure BDA00034237349800000312
tau is a positive number much smaller than 1 and is responsible for regulating the degree of soft update.
Preferably, the step S4 includes:
the updated Agent starts to generate actions from the initial state of the Volterra equalizer, updates the memory length state of the Volterra equalizer, continues to generate actions according to the current state until the memory length state of the Volterra equalizer is updated to the end state, and adds obedient mean value of 0 and variance sigma of the action generated each time by the Agent2The search noise e of the gaussian distribution of (a);
determining the memory length of each order according to the maximum memory length limit of each order of the Volterra equalizer and the action of the Agent, performing 2-fold cross validation on signal data, and calculating a reward value by using the complexity of the current equalizer and the average bit error rate after equalization; transferring the state to the process(s)i,ai,ri,si+1Done) as experience, storing the experience in an experience playback pool; then, step S3 is executed;
after each update, the variance of the search noise e is attenuated: sigma2←σ2ξnXi is attenuation rate, n is updating times;
repeating the above operations until the absolute value of the difference between the current reward value and the last reward value is less than x1The absolute value of the difference between the current Agent output action and the last Agent output action is less than x2Then, it is determined that the training result has converged, wherein χ1≥0,χ2And (3) more than or equal to 0 is a set decision threshold, and finally, the memory length of each position of the Volterra equalizer is determined according to the convergence value of the Agent output action and the maximum memory length limit of each position, so that the determination of the optimal structure of the Volterra equalizer is completed.
The system for optimizing the structure of the Volterra equalizer based on the deep reinforcement learning provided by the invention comprises the following steps:
module M1: initializing an Agent of an intelligent Agent, initializing an experience playback pool, initializing a memory length state of a Volterra equalizer and defining a state transition process;
module M2: starting from the initial memory length state of the Volterra equalizer, randomly generating action for the Agent, updating the memory length state of the Volterra equalizer until the memory length state reaches the end state, calculating a reward value according to the complexity of the Volterra equalizer and the error rate after signal equalization, storing the transfer process as experience into an experience playback pool, and circulating from the initial state again until a specified amount of experience is generated;
module M3: sampling experiences from an experience playback pool, training the agents, and then performing soft updating on the agents every preset step number;
module M4: and generating a deterministic action for the updated Agent from the initial memory length state of the Volterra equalizer until the state transition process is finished, calculating an incentive value and storing the transition process into an experience playback pool, then repeating the module M3 and the module M4 until the incentive value and the action output by the Agent are converged, and finally determining each order of memory length of the Volterra equalizer according to the convergence value.
Preferably, the module M1 includes:
module M11: four neural networks in the Agent are defined: actor network muθCritic network QwTargetactor network
Figure BDA0003423734980000041
And Target critical network
Figure BDA0003423734980000042
Actor network mu initialization using random parameters theta, wθAnd Critic network QwUsing random parameters
Figure BDA0003423734980000043
Initializing TargetActor networks
Figure BDA0003423734980000044
And Target critical network
Figure BDA0003423734980000045
Wherein is provided with
Figure BDA0003423734980000046
Is equal to theta, set
Figure BDA0003423734980000047
Is equal to w;
module M12: initializing an experience playback pool storing experiences in a format of(s)i,ai,ri,si+1Done), wherein siRepresenting the memory length state of the current Volterra equalizer; a isiRepresenting Agent according to current state siGenerating an action in which the memory length of each step is in proportion to the maximum memory length limit; r isiRepresenting Agent facing status siTaking action ofiThe reward earned; si+1Representing Agent taking action aiThen, the memory length state after the Volterra equalizer is updated; done is a mark for judging whether the whole state transition process is finished or not;
module M13: initializing the memory length state of the Volterra equalizer according to the type of the Volterra equalizer and defining the state transition process.
Preferably, the module M2 includes:
module M21: selecting a state transition process according to the type of the Volterra equalizer, starting by the Agent from an initial state, generating random actions uniformly distributed according to [0, 1], updating the memory length state of the Volterra equalizer, and continuing to generate the random actions according to the current state until the memory length state of the Volterra equalizer is updated to an end state;
module M22: calculating a reward value, determining the memory length of each order according to the maximum memory length limit of each order of the Volterra equalizer and the action of the Agent, performing 2-fold cross validation on signal data, and calculating the reward value by using the complexity of the current equalizer and the average error rate after equalization;
module M23: transferring the state to the process(s)i,ai,ri,si+1Done) as experience, and storing the experience into an experience playback pool;
module M24: the blocks M21 through M23 are repeated until a predetermined number of experiences are made.
Preferably, the module M3 includes:
sampling N experiences(s) from an experience replay pool by using a data guided pool techniquei,ai,ri,si+1Done) for Agent training, including K experiences with the greatest reward;
calculating a target Q value
Figure BDA0003423734980000051
Wherein the content of the first and second substances,
Figure BDA0003423734980000052
ε~clip(N(0,σ2) -c, c), wherein,
Figure BDA0003423734980000053
representing a TargetActor network
Figure BDA0003423734980000054
Facing state is si+1A motion of a temporal output; ε is the strategic noise, obeys a mean of 0, and the variance is σ2And is truncated between-c and c; n (0, sigma)2) Representing a mean of 0 and a variance of σ2(ii) a gaussian distribution of; clip represents truncation; gamma is a discount factor;
Figure BDA0003423734980000055
representing Target critical network
Figure BDA0003423734980000056
Facing state is si+1And act as
Figure BDA0003423734980000057
An output of time;
by minimizing errors
Figure BDA0003423734980000058
Updating critical network Qw,Qw(si,ai) Indicating criticic network QwFacing state is siAnd the action is aiAn output of time;
every d steps by minimizing the error
Figure BDA0003423734980000059
Updating Actor network muθ,μθ(si) Represents an Actor network muθFacing state is siA motion of a temporal output; qw(siθ(si) Denotes a Critic network QwFacing state is siAnd the sum action is muθ(si) An output of time; for TargetActor networks as follows
Figure BDA0003423734980000061
And Target critical network
Figure BDA0003423734980000062
And (3) carrying out soft updating:
Figure BDA0003423734980000063
tau is a positive number much smaller than 1 and is responsible for regulating the degree of soft update.
Preferably, the module M4 includes:
the updated Agent starts to generate actions from the initial state of the Volterra equalizer, updates the memory length state of the Volterra equalizer, continues to generate actions according to the current state until the memory length state of the Volterra equalizer is updated to the end state, and adds obedient mean value of 0 and variance sigma of the action generated each time by the Agent2The search noise e of the gaussian distribution of (a);
determining the memory length of each order according to the maximum memory length limit of each order of the Volterra equalizer and the action of the Agent, performing 2-fold cross validation on signal data, and calculating a reward value by using the complexity of the current equalizer and the average bit error rate after equalization; transferring the state to the process(s)i,ai,ri,si+1Done) as experience, storing the experience in an experience playback pool; then calling module M3;
after each update, the variance of the search noise e is attenuated: sigma2←σ2ξnXi is attenuation rate, n is updating times;
repeating the above operations until the absolute value of the difference between the current reward value and the last reward value is less than x1The absolute value of the difference between the current Agent output action and the last Agent output action is less than x2Then, it is determined that the training result has converged, wherein χ1≥0,χ2And (3) more than or equal to 0 is a set decision threshold, and finally, the memory length of each position of the Volterra equalizer is determined according to the convergence value of the Agent output action and the maximum memory length limit of each position, so that the determination of the optimal structure of the Volterra equalizer is completed.
Compared with the prior art, the invention has the following beneficial effects:
(1) the invention realizes the automatic searching method of the optimal structure of the Volterra equalizers of different types under the condition of given computing resources, and the equalizing effect is superior to the traditional greedy searching;
(2) different types of reward value calculation modes are designed aiming at a scene of singly pursuing high performance and a scene of considering compromise between performance and complexity, the application is more flexible, the reward value calculation mode of considering compromise between performance and complexity is adopted, the optimal structures of a feedforward Volterra equalizer and a feedback Volterra equalizer are searched out, the complexity of the equalizer can be greatly reduced under the condition of ensuring small loss of an equalization effect, and the pruning quality of the Volterra equalizer for structured pruning can be improved;
(3) according to the invention, a strategy noise and data guide pool technology is introduced on the basis of a DDPG basic framework, so that the stability of a training process and the generalization of a finally searched structure are improved.
Drawings
Other features, objects and advantages of the invention will become more apparent upon reading of the detailed description of non-limiting embodiments with reference to the following drawings:
FIG. 1 is a schematic flow diagram of the present invention;
FIG. 2 is an exemplary diagram of a direct alignment test system employed in the present invention;
FIG. 3 is a graph comparing error rates of the present invention and Volterra-FFE based on greedy search when a reward value calculation mode of simply pursuing high performance is adopted;
FIG. 4 is a graph comparing error rates for the present invention and a greedy search based Volterra-DFE when a reward value calculation mode is used that simply pursues high performance;
FIG. 5 is a graph comparing error rates of the present invention and Volterra-FFE based on greedy search when a reward value calculation mode considering compromise of performance and complexity is adopted;
FIG. 6 is a graph of complexity comparison of the present invention with a Volterra-FFE based greedy search when employing a reward value calculation mode that considers a compromise of performance and complexity;
FIG. 7 is a graph comparing bit error rates for the present invention and a Volterra-DFE based on greedy search, using a reward value calculation mode that considers a compromise of performance and complexity;
FIG. 8 is a graph of complexity comparison of the present invention and a greedy search based Volterra-DFE when employing a reward value calculation that considers a performance and complexity tradeoff;
FIG. 9 is a graph comparing error rates of Volterra-rounding based on greedy search according to the present invention when a reward value calculation mode considering compromise of performance and complexity is adopted;
FIG. 10 is a graph of complexity comparison of the present invention and Volterra-rounding based on greedy search when employing a reward value calculation that considers a compromise of performance and complexity.
Detailed Description
The present invention will be described in detail with reference to specific examples. The following examples will assist those skilled in the art in further understanding the invention, but are not intended to limit the invention in any way. It should be noted that it would be obvious to those skilled in the art that various changes and modifications can be made without departing from the spirit of the invention. All falling within the scope of the present invention.
Example 1:
the invention discloses a method for determining an optimal structure of a Volterra equalizer based on deep reinforcement learning. The method takes a depth deterministic strategy gradient algorithm (DDPG) algorithm in depth reinforcement learning as an Agent, calculates a reward value according to the complexity of a Volterra equalizer and an error rate after signal equalization, optimizes the decision of the Agent, and selects an optimal structure for a feedforward Volterra equalizer, a feedback Volterra equalizer and a third-order structured pruning Volterra equalizer. Referring to fig. 1, the steps of the method of the present invention are as follows:
s1: an initialization stage: initializing an Agent; initializing an experience playback pool; initializing a memory length state of a Volterra equalizer and defining a state transition process;
s2: a preheating stage: starting from the initial memory length state of the Volterra equalizer, the Agent randomly generates actions, namely the memory length of each order accounts for the proportion of the maximum memory length limit, the Volterra equalizer updates the memory length state of the Volterra equalizer until the memory length state is finished, a reward value is calculated according to the complexity of the Volterra equalizer and the error rate after signal equalization, and the transfer process is used as experience and stored in an experience playback pool; cycling again from the initial state until a specified number of experiences are generated;
s3: agent training and updating stage: sampling a batch of experiences from an experience playback pool, training the agents, and then performing soft updating on the agents at regular intervals;
s4: and (3) exploring and converging: the updated Agent starts to generate deterministic action from the initial memory length state of the Volterra equalizer until the state transition process is finished, calculates an incentive value and stores the transition process into an experience playback pool; then repeating the step S3 until the reward value and the action output by the Agent are converged; and finally, determining the memory length of each order of the Volterra equalizer according to the convergence value.
Further, the specific steps of step S1 are as follows:
s11: four neural networks in the Agent are defined: actor network muθCritic network QwTarget Actor network
Figure BDA0003423734980000081
And Target critical network
Figure BDA0003423734980000082
Actor network mu initialization using random parameters theta, wθAnd Critic network QwUsing random parameters
Figure BDA0003423734980000083
Initializing TargetActor networks
Figure BDA0003423734980000084
And Target critical network
Figure BDA0003423734980000085
Wherein is provided with
Figure BDA0003423734980000086
Is equal to theta, set
Figure BDA0003423734980000087
Is equal to w;
s12: initializing an experience playback pool storing experiences in a format of(s)i,ai,ri,si+1Done), where siRepresenting the memory length state of the current Volterra equalizer, aiRepresenting Agent according to current state siThe resulting action, i.e. the ratio of the memory length of each stage to the maximum memory length limit, riRepresenting Agent facing status siTaking action ofiAwarded prize, si+1Representing Agent taking action aiThen, the memory length state after the Volterra equalizer is updated, done is a mark for judging whether the whole state transition process is finished;
s13: initializing the memory length state of the Volterra equalizer according to the type of the Volterra equalizer and defining the state transition process:
(1) for a Volterra equalizer of the M-order feedforward type (Volterra-FFE), the formula is as follows:
Figure BDA0003423734980000091
where x (n) is the input signal of the equalizer, y (n) is the equalized output signal, hi(k1,k2,...,ki) 1,2, M denotes the ith tap weight of Volterra-FFE, 2Li+1, i 1, 2.. M denotes the ith memory length of Volterra-FFE; the ith state of the memory length of Volterra-FFE is defined as si=(id,Llimit(id),l1,l2,...,lM) Id is the index of the position of the memory length to be determined, the positions are sorted according to the sequence from 1-order memory length, 2-order memory length to M-order memory length, Llimit(id) represents the maximum memory length limit for the id-th position,/id1,2, wherein M represents the proportion of the memory length of the id position to the maximum memory length limit;
initializing the memory length state of the Volterra-FFE equalizer to s0=(1,Llimit(1),0,0,...,0);
Defining the state transition process of the Volterra-FFE equalizer as follows:
from the initial memory length state s0=(1,Llimit(1) 0, 0.., 0) is started, the Agent generates action a0=l1I.e. the ratio of the memory length of the 1 st position to its maximum memory length limit, and then the prize r0Update state to s ═ 01=(2,Llimit(2),l10,0), setting a state transition end flag done ═ False, which indicates that the whole state transition process is not ended;
from memorizing the length state si=(i+1,Llimit(i+1),l1,l2,...,li,., 0), the Agent generates action ai=li+1I.e. the ratio of the memory length of the i +1 th position to its maximum memory length limit, and then the prize riUpdate state to s ═ 0i+1=(i+2,Llimit(i+2),l1,l2,...,li,li+1,., 0), setting a state transition end flag done ═ False, which indicates that the whole state transition process is not ended;
and so on until the update state reaches the end state sM=(0,0,l1,l2,...,lM) Setting a state transition end flag done equal to True to indicate that the whole state transition process is ended, and when done equal to True, letting r0=...=ri=...=rM-1The reward represents an incentive value calculated according to the complexity of the current equalizer and the average error rate after performing 2-fold cross validation on the signal data after the whole state transition process is finished;
(2) for a Volterra equalizer (Volterra-DFE) of the M-order feedback type, the formula is expressed as:
Figure BDA0003423734980000101
where x (n) is the input signal to the equalizer, y (n) is the equalized output signal,
Figure BDA0003423734980000102
is to the output signal after y (n) hard decision, hi(k1,k2,...,ki) I 1, 2.. M denotes the ith order tap weight of the Volterra-DFE, 2Li+1, i-1, 2, M denotes the ith memory length of Volterra-FFE, hfb(k1) Representing Volterra-DFE feedback tap weights, LfbRepresenting the feedback memory length of the Volterra-DFE; the ith state of the memory length of the Volterra-DFE is defined as si=(id,Llimit(id),l1,l2,...,lM,lM+1) Id is the index of the position of the memory length to be determined, the positions are sorted according to the sequence from 1-order memory length, 2-order memory length to M-order memory length and feedback memory length, Llimit(id) represents the maximum memory length limit for the id-th position,/id1,2, M +1 represents the ratio of the memory length of the id-th position to the maximum memory length limit;
initializing the memory length state of the Volterra-DFE equalizer to s0=(1,Llimit(1),0,0,...,0);
The state transition process for the Volterra-DFE equalizer is defined as:
from the initial memory length state s0=(1,Llimit(1) 0, 0.., 0) is started, the Agent generates action a0=l1I.e. the ratio of the memory length of the 1 st position to its maximum memory length limit, and then the prize r0Update state to s ═ 01=(2,Llimit(2),l10,0), setting a state transition end flag done ═ False, which indicates that the whole state transition process is not ended;
from memorizing the length state si=(i+1,Llimit(i+1),l1,l2,...,li,., 0), the Agent generates action ai=li+1I.e. the ratio of the memory length of the i +1 th position to its maximum memory length limit, and then the prize riUpdate state to s ═ 0i+1=(i+2,Llimit(i+2),l1,l2,...,li,li+1,., 0), setting state transitionsAn end flag done ═ False, which indicates that the whole state transition process is not ended;
and so on until the update state reaches the end state sM+1=(0,0,l1,l2,...,lM+1) Setting a state transition end flag done equal to True to indicate that the whole state transition process is ended, and when done equal to True, letting r0=...=ri=...=rMThe reward represents an incentive value calculated according to the complexity of the current equalizer and the average error rate after performing 2-fold cross validation on the signal data after the whole state transition process is finished;
(3) for an M-order structured Pruning type Volterra equalizer (Volterra-Pruning), the formula is as follows:
Figure BDA0003423734980000111
where x (n) is the input signal of the equalizer, y (n) is the equalized output signal, hi(k1,k2,...,ki) 1,2, M denotes the ith tap weight of Volterra-FFE, 2Li+1, i-1, 2, M denotes the ith memory length of Volterra-FFE, LmdM is 2, wherein M represents the mth order correlation distance of Volterra-Pruning and is used for regulating and controlling the Pruning degree; the ith state of the memory length of Volterra-Pruning is defined as si=(id,Llimit(id),l1,l2,...,lM,l2M-1) Id represents the index of the position of the memory length or the related distance to be determined, the positions are sorted according to the sequence from the 1-order memory length, the 2-order related distance, the 3-order memory length, the 3-order related distance to the M-order memory length and the M-order related distance, Llimit(id) represents the maximum memory length limit or maximum correlation distance limit for the id-th position,/id1, 2., 2M-1 denotes the proportion of the memory length or correlation distance of the id-th position to its maximum memory length limit or maximum correlation distance limit;
initializing the memory length state of the Volterra-rounding equalizer to s0=(1,Llimit(1),0,0,...,0);
Defining the state transition process of the Volterra-rounding equalizer as follows:
from the initial memory length state s0=(1,Llimit(1) 0, 0.., 0) is started, the Agent generates action a0=l1I.e. the ratio of the memory length of the 1 st position to its maximum memory length limit, and then the prize r0Update state to s ═ 01=(2,Llimit(2),l10,0), setting a state transition end flag done ═ False, which indicates that the whole state transition process is not ended;
from memorizing the length state si=(i+1,Llimit(i+1),l1,l2,...,li,., 0), the Agent generates action ai=li+1I.e. the ratio of the memory length of the i +1 th position to its maximum memory length limit, and then the prize riUpdate state to s ═ 0i+1=(i+2,Llimit(i+2),l1,l2,...,li,li+1,., 0), setting a state transition end flag done ═ False, which indicates that the whole state transition process is not ended;
and so on until the update state reaches the end state s2M-1=(0,0,l1,l2,...,l2M-1) Setting a state transition end flag done equal to True to indicate that the whole state transition process is ended, and when done equal to True, letting r0=...=ri=...=r2M-2And the reward represents an incentive value calculated according to the complexity of the current equalizer and the average error rate after performing 2-fold cross validation on the signal data after the whole state transition process is finished.
Further, the specific steps of step S2 are as follows:
s21: firstly, selecting a state transition process according to the type of a Volterra equalizer, starting from an initial state by an Agent, generating random actions uniformly distributed according to [0, 1], namely updating the memory length state of the Volterra equalizer by the Agent according to the proportion of each memory length in the maximum memory length limit, and continuing to generate the random actions according to the current state by the Agent until the memory length state of the Volterra equalizer is updated to an end state;
s22: calculating a reward value, determining the memory length of each order according to the maximum memory length limit of each order of the Volterra equalizer and the action of the Agent, performing 2-fold cross validation on signal data, and calculating the reward value reward according to the average bit error rate after equalization and the complexity of the equalizer; for the scenario of simply pursuing high performance, i.e. low Bit Error Rate (BER), the reward value reward is 100 × (1-BER)valid) (ii) a For scenarios that take into account a trade-off between performance and complexity, the reward value reward-100 BERvalid*log(Vol_MACs),BERvalidPerforming double-fold cross validation on the signal for the Volterra equalizer to obtain an average error rate; vol _ MACs is the number of multipliers required by the Volterra equalizer;
s23: then transferring the state to the process(s)i,ai,ri,si+1Done) as experience, and storing the experience into an experience playback pool;
s24: finally, the above operations are repeated until a specified number of experiences are generated.
Further, the specific method of step S3 is as follows:
sampling N experiences(s) from an experience replay pool by using a data guided pool techniquei,ai,ri,si+1Done) for Agent training, including K experiences with the greatest reward;
calculating a target Q value
Figure BDA0003423734980000121
Wherein
Figure BDA0003423734980000122
ε~clip(N(0,σ2) -c, c), wherein,
Figure BDA0003423734980000123
represents the Target Actor network muθFacing state is si+1Time-out action, epsilon is the strategy noise, obey mean is 0, variance is sigma2And truncated between-c and c, N (0, σ)2) Representing a mean of 0 and a variance of σ2The clip represents truncation, γ is the discounting factor,
Figure BDA0003423734980000124
representing Target critical network QwFacing state is si+1And act as
Figure BDA0003423734980000125
An output of time;
by minimizing errors
Figure BDA0003423734980000126
Updating critical network Qw,Qw(si,ai) Indicating criticic network QwFacing state is siAnd the action is aiAn output of time;
every d steps by minimizing the error
Figure BDA0003423734980000127
Updating Actor network muθ,μθ(si) Represents an Actor network muθFacing state is siMotion of time output, Qw(siθ(si) Denotes a Critic network QwFacing state is siAnd the sum action is muθ(si) An output of time; for Target Actor network according to the following formula
Figure BDA0003423734980000131
And Target critical network
Figure BDA0003423734980000132
And (3) carrying out soft updating:
Figure BDA0003423734980000133
tau is a positive number much smaller than 1 and is responsible for regulating the degree of soft update.
Further, the specific method of step S4 is as follows:
updated Agent from initial state of Volterra equalizerStarting to generate actions, namely, the memory length of each step accounts for the proportion of the maximum memory length limit, updating the memory length state of the Volterra equalizer, and the Agent continues to generate actions according to the current state until the memory length state of the Volterra equalizer is updated to the end state, wherein the action generated by the Agent each time needs to be added with the obedient mean value of 0 and the variance of sigma2The search noise e of the gaussian distribution of (a);
determining the memory length of each order according to the maximum memory length limit of each order of the Volterra equalizer and the action of the Agent, performing 2-fold cross validation on signal data, and calculating a reward value by using the equalized average error rate and the equalizer complexity; transferring the state to the process(s)i,ai,ri,si+1Done) as experience, storing the experience in an experience playback pool; then, step S3 is executed;
after each update, the variance of the search noise e is attenuated: sigma2←σ2ξnXi is attenuation rate, n is updating times;
repeating the above operations until the absolute value of the difference between the current reward value and the last reward value is less than x1The absolute value of the difference between the current Agent output action and the last Agent output action is less than x2Then, it is determined that the training result has converged, wherein χ1≥0,χ2And (3) more than or equal to 0 is a set decision threshold, and finally, the memory length of each position of the Volterra equalizer is determined according to the convergence value of the Agent output action and the maximum memory length limit of each position, so that the determination of the optimal structure of the Volterra equalizer is completed.
Example 2:
example 2 is a preferred example of example 1.
The invention takes the optimization of a three-order Volterra equalizer as an example, and the effectiveness of the invention is experimentally demonstrated. Considering a C-band direct alignment detection system, a sending end generates a PAM4 signal with a rate of 50Gbps through an Arbitrary Waveform Generator (AWG), the PAM4 signal is amplified by an Electric Amplifier (EA), then modulated by a Mach Zehnder Modulator (MZM) with a C-band and 10GHz level, meanwhile, the AWG loads an NRZ signal with a rate of 100Mbps onto a Direct Modulation Laser (DML) to broaden a central carrier, suppress a Stimulated Brillouin Scattering (SBS) effect affected by power, amplified by an erbium-doped fiber amplifier (EDFA), transmitted through a 20-standard km single-mode fiber (SSMF), received by a 30GHz level Avalanche Photodetector (APD), and then performs offline Digital Signal Processing (DSP), including steps of synchronization decision, resampling, equalization, sign, decoding and the like, wherein the overall 3dB bandwidth of the system is about 6GHz, and the specific system configuration is shown in fig. 2.
The fixed APD receive power was-15 dBm, the incoming fiber power was varied from 8dBm to 20dBm, and at each incoming fiber power, 3 sets of 98304 symbols generated by different random numbers were transmitted.
Setting the computing resource limit of the equalizer to 1000MACs, wherein MACs represents the number of multipliers, the input signal of the equalizer is x (n), the equalized output signal is y (n), and the output signal after symbol decision is carried out on y (n) is
Figure BDA0003423734980000141
The maximum memory length limit of each order of the different types of 3-order Volterra equalizers is calculated as follows:
(1) for a third-order feedforward type Volterra equalizer (Volterra-FFE), the formula is:
Figure BDA0003423734980000142
wherein h is1(i),h2(i,j),h3(i, j, k) represent tap weights of 1 st, 2 nd and 3 rd order of Volterra-FFE, respectively, 2L1+1,2L2+1,2L3+1 represents the 1 st, 2 nd and 3 rd order memory lengths of Volterra-FFE, respectively; the tap numbers of 1 st order, 2 nd order and 3 rd order of Volterra-FFE are 2L respectively1+1,(2L2+1)(2L2+2)/2,(2L3+1)(2L3+2)(2L3+3)/6, the number of multipliers required is 2L respectively1+1,(2L2+1)(2L2+2),(2L3+1)(2L3+2)(2L3+3)/2;
Setting the current remaining number of multipliers to rest _ MACs, the 1 st-order maximum memory length limit Llimit(1) Meter (2)The calculation method is as follows:
2Llimit(1)+1≤rest_MACs
Figure BDA0003423734980000143
in the formula (I), the compound is shown in the specification,
Figure BDA0003423734980000144
represents rounding down;
2-order maximum memory length limit Llimit(2) The calculation method of (c) is as follows:
(2Llimit(2)+1)(2Llimit(2)+2)≤rest_MACs
Figure BDA0003423734980000145
3-order maximum memory length limit Llimit(3) The calculation method of (c) is as follows:
(2Llimit(3)+1)(2Llimit(3)+2)(2Llimit(3)+3)/2≤rest_MACs
the simplification is as follows:
4(Llimit(3))3+12(Llimit(3))2+11Llimit(3)+3-rest_MACs≤0
solving the above formula according to the formula of flourishing gold: let a be 4, b be 12, c be 11, d be 3-rest _ MACs, a be b2-3ac,B=bc-9ad,C=c2-3bd, then
Figure BDA0003423734980000151
Wherein the content of the first and second substances,
Figure BDA0003423734980000152
finally, the method is simplified to obtain:
Figure BDA0003423734980000153
wherein the content of the first and second substances,
Figure BDA0003423734980000154
Figure BDA0003423734980000155
(2) for a third order feedback type Volterra equalizer (Volterra-DFE), the formula is:
Figure BDA0003423734980000156
wherein h is1(i),h2(i,j),h3(i,j,k),hfb(i) Respectively representing orders 1,2, 3 and feedback tap weights, 2L, of a Volterra-DFE1+1,2L2+1,2L3+1,LfbRespectively representing 1 st order, 2 nd order, 3 rd order and feedback memory length of the Volterra-DFE; the 1 st, 2 nd, 3 rd order and feedback tap numbers of the Volterra-DFE are 2L, respectively1+1,(2L2+1)(2L2+2)/2,(2L3+1)(2L3+2)(2L3+3)/6,LfbThe number of multipliers required is 2L1+1,(2L2+1)(2L2+2),(2L3+1)(2L3+2)(2L3+3)/2,Lfb
Referring to Volterra-FFE, if the current number of remaining multipliers is rest _ MACs, the maximum memory length limit L of order 1 in the Volterra-DFElimit(1) The calculation method of (c) is as follows:
Figure BDA0003423734980000157
2-order maximum memory length limit Llimit(2) The calculation method of (c) is as follows:
Figure BDA0003423734980000161
3-order maximum memory length limit Llimit(3) The calculation method of (c) is as follows:
Figure BDA0003423734980000162
wherein the content of the first and second substances,
Figure BDA0003423734980000163
Figure BDA0003423734980000164
feedback maximum memory length limit Llimit(4) The calculation method of (c) is as follows:
Llimit(4)=rest_MACs
(3) for a third-order structured Pruning type Volterra equalizer (Volterra-rounding), the formula is expressed as follows:
Figure BDA0003423734980000165
wherein h is1(i),h2(i,j),h3(i, j, k) represent tap weights of 1 st, 2 nd and 3 rd orders of Volterra-rounding, respectively, 2L1+1,2L2+1,2L3+1 represents the 1 st, 2 nd and 3 rd order memory lengths of Volterra-sounding respectively; l is2d,L3dRespectively representing 2-order and 3-order correlation distances of Volterra-Pruning and used for regulating and controlling the Pruning degree;
referring to Volterra-FFE, if the number of the current remaining multipliers is rest _ MACs, the maximum memory length limit L of 1 st order in Volterra-rounding is limitedlimit(1) The calculation method of (c) is as follows:
Figure BDA0003423734980000166
2-order maximum memory length limit Llimit(2) The calculation method of (c) is as follows:
Figure BDA0003423734980000167
maximum length limit L for 2 nd order correlation distancelimit(3) The calculation method of (c) is as follows:
Llimit(3)=2L2+1
3-order maximum memory length limit Llimit(4) The calculation method of (c) is as follows:
Figure BDA0003423734980000168
wherein the content of the first and second substances,
Figure BDA0003423734980000171
Figure BDA0003423734980000172
3-order maximum correlation distance limit Llimit(5) The calculation method of (c) is as follows:
Llimit(5)=2L3+1
the invention adopts a depth deterministic strategy gradient (DDPG) algorithm in the deep reinforcement learning as an Agent to determine the memory length or the related distance of each order of the Volterra equalizer step by step. The DDPG algorithm belongs to an Actor-critical framework, and an Actor part comprises an Actor network muθAnd corresponding Target Actor network
Figure BDA0003423734980000173
The Critic part comprises a Critic network QwAnd corresponding Target critical network
Figure BDA0003423734980000174
The Actor part is responsible for making decisions of as high value as possible in the current environmental state, and the Critic part is responsible for making as accurate an estimate as possible of the value of the action taken in the current environmental state.
Specifically, the Agent receives the memory length state of the current Volterra equalizer as input, the output action is the proportion of the memory length of the current order to the maximum memory length limit, then the Volterra equalizer updates the memory length state of the Volterra equalizer and calculates the reward value, and the process is repeated, and the Agent continuously optimizes the strategy according to the reward. For different types of Volterra equalizers, the initial memory length state and state transition process are defined as follows:
(1) for a third order feedforward type Volterra equalizer (Volterra-FFE):
the memory length state of Volterra-FFE is defined as (id, L)limit(id),l1,l2,l3) Id denotes an index of the memory length position to be determined, Llimit(id) represents the maximum memory length limit for the id-th location; l1,l2,l3Respectively representing the proportion of the memory length of 1 order, 2 order and 3 order to the maximum memory length limit; the initial memory length state of the Volterra-FFE equalizer is s0=(1,Llimit(1) 0,0,0), starting from the initial state, determining the memory lengths of the positions one by one according to the sequence from 1 order, 2 order and 3 order, wherein the whole state transition process is as follows:
Figure BDA0003423734980000175
Figure BDA0003423734980000176
Figure BDA0003423734980000177
done is a flag indicating whether the whole state transition process is finished, and when done is True, r is set0=r1=r2The reward represents an incentive value calculated according to the complexity of the current equalizer and the average error rate after performing 2-fold cross validation on the signal data after the whole state transition process is finished; in the formula, s0Indicates the initial state, s3Indicates the end state, s1,s2Represents an intermediate state, a0,a1,a2Representing the action of Agent output, r0,r1,r2Representing rewards earned during the state transition;
(2) for a third order feedback type Volterra equalizer (Volterra-DFE):
the memory length state of the Volterra-DFE is defined as (id, L)limit(id),l1,l2,l3,lfb) Id denotes an index of the memory length position to be determined, Llimit(id) represents the maximum memory length limit for the id-th location; l1,l2,l3,lfbRepresenting the proportion of 1 order, 2 order, 3 order and feedback memory length in the maximum memory length limit; the initial memory length state of the Volterra-DFE equalizer is s0=(1,Llimit(1) 0,0,0,0), starting from the initial state, determining the memory lengths of the positions one by one according to the sequence of 1 order, 2 order, 3 order and feedback, wherein the whole state transition process is as follows:
Figure BDA0003423734980000181
Figure BDA0003423734980000182
Figure BDA0003423734980000183
Figure BDA0003423734980000184
when done is True, let r0=r1=r2=r3The reward represents an incentive value calculated according to the complexity of the current equalizer and the average error rate after performing 2-fold cross validation on the signal data after the whole state transition process is finished; in the formula, s0Indicates the initial state, s4Indicates the end state, s1,s2,s3Represents an intermediate state, a0,a1,a2,a3Representing the action of Agent output, r0,r1,r2,r3Representing rewards earned during the state transition;
(3) for a third-order structured Pruning type Volterra equalizer (Volterra-rounding):
the memory length state of Volterra-Pruning is defined as (id, L)limit(id),l1,l2,l2d,l3,l3d) Id denotes an index of the memory length position to be determined, Llimit(id) represents a limit of maximum memory length or correlation distance for the id-th location; l1,l2,l2d,l3,l3dRepresenting the proportion of the 1 st order memory length, the 2 nd order correlation distance, the 3 rd order memory length and the 3 rd order correlation distance in the maximum memory length or the correlation distance limit; the memory length state of the Volterra-rounding equalizer is s0=(1,Llimit(1) 0,0,0,0,0), starting from the initial state, determining the memory lengths and the correlation distances of the positions one by one according to the sequence of the correlation distances from 1 order, 2 order, 3 order and 3 order, wherein the whole state transition process is as follows:
Figure BDA0003423734980000191
Figure BDA0003423734980000192
Figure BDA0003423734980000193
Figure BDA0003423734980000194
Figure BDA0003423734980000195
when done is True, let r0=r1=r2=r3=r4The reward represents an incentive value calculated according to the complexity of the current equalizer and the average error rate after performing 2-fold cross validation on the signal data after the whole state transition process is finished; in the formula, s0Indicates the initial state, s5Indicates the end state, s1,s2,s3,s4Represents an intermediate state, a0,a1,a2,a3,a4Representing the action of Agent output, r0,r1,r2,r3,r4Indicating the reward earned during the state transition.
And after the whole state transition process is finished, calculating the memory length of each step according to the action of the Agent, and performing 2-fold cross validation on the Volterra equalizer, wherein the length of each fold of signal data is 16384. Setting the learning rate of Volterra-FFE and Volterra-training to be 0.005, and training for 20 rounds; the learning rate for training the Volterra-DFE was set to 0.001 and 20 rounds of training were performed. Then verifying the BER by averagingvalidA prize value is calculated. The invention designs two calculation modes of the reward value to adapt to different application scenes, and for the scene of simply pursuing high performance (namely low BER), the reward value reward is 100 (1-BER)valid) (ii) a For scenarios that consider a trade-off between performance and complexity, such as Volterra-rounding, the reward value reward-100 BERvalidLog (Vol _ MACs), Vol _ MACs is the number of multipliers required for the Volterra equalizer.
In order to improve the generalization effect and the training stability of DDPG algorithm decision, the invention uses strategy noise and data guide pool technology in the Agent training process. Strategy noise is used for smoothing strategy expectations by adding noise perturbations to the output action of the TargetActor network; the data guide pool technology guides the Agent to make a more optimal decision by regularly playing back the experience with the highest reward value to the Agent in the training process. Therefore, the step of determining the optimal structure of the Volterra equalizer based on the DDPG algorithm is as follows:
step-1: an initialization stage: four neural networks in the Agent are defined: actor network muθCritic network QwTargetactor network
Figure BDA0003423734980000196
And Target critical network
Figure BDA0003423734980000197
Actor network mu in Agent is initialized by using random parameters theta, wθAnd Critic network QwUsing random parameters
Figure BDA0003423734980000198
Initializing TargetActor networks
Figure BDA0003423734980000199
And Target critical network
Figure BDA00034237349800001910
Is provided with
Figure BDA00034237349800001911
Is equal to theta, set
Figure BDA00034237349800001912
Is equal to w; initializing an experience playback pool;
step-2: a preheating stage: the Agent starts from an initial state and generates obedience 0, 1]Updating the memory length state of the Volterra equalizer until reaching the end state by uniformly distributed random actions, namely the proportion of each memory length in the maximum memory length limit; determining the memory length of each order according to the maximum memory length limit of each order of a Volterra equalizer and the action of an Agent, performing 2-fold cross validation on signal data, calculating a reward value by using the complexity of the equalizer and the average validation error rate after equalization, and then transferring the state(s) in the processi,ai,ri,si+1Done) as experience, and storing the experience into an experience playback pool; repeating the above operations untilGenerating a specified amount of experience;
step-3: agent training and updating: sampling N experiences from a pool of experience replays(s)i,ai,ri,si+1Done) for Agent training, including K experiences with the greatest reward;
calculating a target Q value
Figure BDA0003423734980000201
Wherein
Figure BDA0003423734980000202
ε~clip(N(0,σ2) -c, c), wherein,
Figure BDA0003423734980000203
representing a TargetActor network
Figure BDA0003423734980000204
Facing state is si+1Time-of-flight output, ε being the strategy noise, obey mean 0, variance σ2And is truncated between-c and c, gamma is a discount factor,
Figure BDA0003423734980000205
as a Target critical network
Figure BDA0003423734980000206
Receiving state si+1And actions
Figure BDA0003423734980000207
Is the output at the input;
by minimizing the error L-N-1i(yi-Qw(si,ai))2Updating critical network Qw(ii) a Every d steps, passing strategy gradient
Figure BDA0003423734980000208
Updating Actor network muθ
For Target Actor network according to the following formula
Figure BDA0003423734980000209
And Target critical network
Figure BDA00034237349800002010
And (3) carrying out soft updating:
Figure BDA00034237349800002011
tau is a positive number much smaller than 1 and is responsible for regulating the degree of soft update.
Step-4: after the Agent is updated, a determined action is generated from the initial state of the Volterra equalizer, the memory length state of the Volterra equalizer is updated to the end state, the action generated by the Agent each time needs to be added with the obedient mean value of 0 and the variance of sigma2The search noise e of the gaussian distribution of (a); calculating a reward value according to the average verification error rate of the 2-fold cross verification of the Volterra equalizer; transferring the state to the process(s)i,ai,ri,si+1Done) is stored in the experience playback pool, and then Step-3 is executed; after each update, the variance of the search noise e is attenuated: sigma2←σ2ξnXi is attenuation rate, n is updating times;
repeating the above operations until the absolute value of the difference between the current reward value and the last reward value is less than x1The absolute value of the difference between the current Agent output action and the last Agent output action is less than x2Then, it is determined that the training result has converged, wherein χ1≥0,χ2And (3) more than or equal to 0 is a set decision threshold, and finally, determining the memory length of each order of the Volterra equalizer according to the convergence value of the Agent output action and the maximum memory length limit of each order, so as to complete the determination of the optimal structure of the Volterra equalizer.
After the structure of the Volterra equalizer is determined, 3 groups of signal data generated by different random number seeds are selected, the Volterra equalizer is trained by using the first 32768 samples, the Volterra equalizer is tested by using the remaining 65536 samples, then the average test error rate is taken as a performance evaluation index, and the number of consumed multipliers is taken as a complexity evaluation index.
Fig. 3 and 4 respectively show error rate comparison graphs of the invention and the Volterra-FFE and Volterra-DFE based on greedy search when a reward value calculation mode of simply pursuing high performance is adopted. It can be seen from fig. 3 that, when the transmission power is greater than or equal to 14dBm, the error rate performance of the Volterra-FFE structure searched by the invention is improved to a certain extent compared with that of the greedy search structure, and especially, the improvement is more remarkable under the original better transmission power of the error rates of 16dBm and 18 dBm. As can be seen from fig. 4, the error rate performance of the Volterra-DFE structure searched by the present invention is significantly improved compared with the greedy search structure, and particularly, when the transmission power is 16dBm, the curves of the Volterra-DFE structure and the greedy search structure show a great difference, at this time, the 1 st order, the 2 nd order, the 3 rd order and the feedback memory length of the greedy search are [117,29,1,7], the number of consumed multipliers is 997, the 1 st order, the 2 nd order, the 3 rd order and the feedback memory length of the present invention are [105,15,7,11], the number of consumed multipliers is 608, since the greedy search determines the optimal memory length step by step, extreme situations may occur, such as greedy division of most of the calculation resources by step 1 and 2, the memory length of the 3 rd order and the feedback has only a small selection space, resulting in poor performance of the finally searched structure, and the present invention is based on the DDPG algorithm, samples over the whole structure space, and continuously optimizing according to the reward value to learn an optimal structure.
Fig. 5 and fig. 6 respectively show the bit error rate contrast diagram and the complexity contrast diagram of the invention and the Volterra-FFE based on greedy search when a reward value calculation mode considering compromise between performance and complexity is adopted. The invention's Volterra-FFE-TradeOff' in the legend refers to the way of calculating the reward value to be selected as the way of considering the compromise between performance and complexity when the invention is used for searching the optimal structure of the Volterra-FFE. As can be seen from FIG. 5, the loss of the error rate performance of the Volterra-FFE-TradeOff structure searched by the present invention is smaller than that of the Volterra-FFE structure searched by the present invention and that of the Volterra-FFE structure searched by the greedy search. In terms of complexity, as can be seen from fig. 6, the Volterra-FFE-TradeOff structure searched by the present invention is much smaller than the result of the greedy search. Therefore, the invention can greatly reduce the complexity of the Volterra-FFE under the condition of small balance performance loss, and well realizes the compromise between the balance performance and the complexity of the Volterra-FFE.
Fig. 7 and 8 respectively show an error rate comparison graph and a complexity comparison graph of the invention and a greedy search-based Volterra-DFE when a reward value calculation mode considering compromise between performance and complexity is adopted, and the 'Volterra-DFE-trade off' in the legend refers to a mode of calculating a reward value as a compromise between performance and complexity when the optimal structure of the Volterra-DFE is searched by using the invention. As can be seen from FIG. 7, the error rate performance of the Volterra-DFE-TradeOff structure searched by the invention is closer to that of the Volterra-DFE structure searched by the invention, and is generally better than that of the Volterra-DFE structure based on greedy search; in terms of complexity, as can be seen from fig. 8, firstly, compared with the Volterra-DFE-tradeof structure searched by the present invention, the complexity is significantly reduced when the error rate performance is close, for example, when the transmission power is 16dBm, the error rate performance is the same for both structures, but the complexity of the Volterra-DFE-tradeof structure searched by the present invention is reduced by more than half; secondly, compared with the Volterra-DFE-TradeOff structure searched by the invention, the error rate performance of the Volterra-DFE-TradeOff structure searched by the invention is generally superior to that of the Volterra-DFE structure searched by the invention, and meanwhile, under partial transmitting power, such as 8dBm, 16dBm and 18dBm, the complexity of the Volterra-DFE-TradeOff structure is far less than that of the Volterra-DFE structure searched by the invention. Therefore, the balance performance and complexity compromise of the Volterra-DFE can be well realized through the method, and the method has excellent structural generalization and training stability.
Fig. 9 and fig. 10 respectively show an error rate comparison graph and a complexity comparison graph of the present invention and the greedy search based Volterra-rounding when a reward value calculation mode considering compromise of performance and complexity is adopted, where the greedy search Volterra-rounding in the legend refers to determining the 2 nd order correlation distance and the 3 rd order correlation distance in Volterra-rounding by using greedy search on the basis of the searched Volterra-FFE optimal structure in the present invention. As can be seen from FIG. 9, the error rate expressions of the Volterra-FFE-TradeOff structure searched by the present invention, the Volterra-sounding structure searched by the greedy search, and the Volterra-sounding structure searched by the present invention are very close; in terms of complexity, it can be seen from fig. 10 that the complexity of the Volterra-rounding structure searched by the present invention is further reduced compared with the Volterra-rounding structure searched by the greedy search. Therefore, the Pruning quality of the Volterra-Pruning can be improved through the method and the device, and a more compact equalizer structure is obtained.
Those skilled in the art will appreciate that, in addition to implementing the systems, apparatus, and various modules thereof provided by the present invention in purely computer readable program code, the same procedures can be implemented entirely by logically programming method steps such that the systems, apparatus, and various modules thereof are provided in the form of logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers and the like. Therefore, the system, the device and the modules thereof provided by the present invention can be considered as a hardware component, and the modules included in the system, the device and the modules thereof for implementing various programs can also be considered as structures in the hardware component; modules for performing various functions may also be considered to be both software programs for performing the methods and structures within hardware components.
The foregoing description of specific embodiments of the present invention has been presented. It is to be understood that the present invention is not limited to the specific embodiments described above, and that various changes or modifications may be made by one skilled in the art within the scope of the appended claims without departing from the spirit of the invention. The embodiments and features of the embodiments of the present application may be combined with each other arbitrarily without conflict.

Claims (10)

1. A method for optimizing a Volterra equalizer structure based on deep reinforcement learning is characterized by comprising the following steps:
step S1: initializing an Agent of an intelligent Agent, initializing an experience playback pool, initializing a memory length state of a Volterra equalizer and defining a state transition process;
step S2: starting from the initial memory length state of the Volterra equalizer, randomly generating action for the Agent, updating the memory length state of the Volterra equalizer until the memory length state reaches the end state, calculating a reward value according to the complexity of the Volterra equalizer and the error rate after signal equalization, storing the transfer process as experience into an experience playback pool, and circulating from the initial state again until a specified amount of experience is generated;
step S3: sampling experiences from an experience playback pool, training the agents, and then performing soft updating on the agents every preset step number;
step S4: and generating a deterministic action for the updated Agent from the initial memory length state of the Volterra equalizer until the state transition process is finished, calculating an incentive value and storing the transition process into an experience playback pool, then repeating the step S3 and the step S4 until the incentive value and the action output by the Agent are converged, and finally determining each order of memory length of the Volterra equalizer according to the convergence value.
2. The method for optimizing the structure of the Volterra equalizer based on the deep reinforcement learning of claim 1, wherein the step S1 comprises:
step S11: four neural networks in the Agent are defined: actor network muθCritic network QwTarget Actor network
Figure FDA0003423734970000011
And Target critical network
Figure FDA0003423734970000012
Actor network mu initialization using random parameters theta, wθAnd Critic network QwUsing random parameters
Figure FDA0003423734970000013
Initializing Target Actor networks
Figure FDA0003423734970000014
And Target critical network
Figure FDA0003423734970000015
Wherein is provided with
Figure FDA0003423734970000016
Is equal to theta, set
Figure FDA0003423734970000017
Is equal to w;
step S12: initializing an experience playback pool storing experiences in a format of(s)i,ai,ri,si+1Done), wherein siRepresenting the memory length state of the current Volterra equalizer; a isiRepresenting Agent according to current state siGenerating an action in which the memory length of each step is in proportion to the maximum memory length limit; r isiRepresenting Agent facing status siTaking action ofiThe reward earned; si+1Representing Agent taking action aiThen, the memory length state after the Volterra equalizer is updated; done is a mark for judging whether the whole state transition process is finished or not;
step S13: initializing the memory length state of the Volterra equalizer according to the type of the Volterra equalizer and defining the state transition process.
3. The method for optimizing the structure of the Volterra equalizer based on the deep reinforcement learning of claim 2, wherein the step S2 includes:
step S21: selecting a state transition process according to the type of the Volterra equalizer, starting by the Agent from an initial state, generating random actions uniformly distributed according to [0, 1], updating the memory length state of the Volterra equalizer, and continuing to generate the random actions according to the current state until the memory length state of the Volterra equalizer is updated to an end state;
step S22: calculating a reward value, determining the memory length of each order according to the maximum memory length limit of each order of the Volterra equalizer and the action of the Agent, performing 2-fold cross validation on signal data, and calculating the reward value by using the complexity of the current equalizer and the average error rate after equalization;
step S23: transferring the state to the process(s)i,ai,ri,si+1Done) as experience, and storing the experience into an experience playback pool;
step S24: the steps S21 to S23 are repeated until a preset number of experiences are generated.
4. The method for optimizing the structure of the Volterra equalizer based on the deep reinforcement learning of claim 1, wherein the step S3 comprises:
sampling N experiences(s) from an experience replay pool by using a data guided pool techniquei,ai,ri,si+1Done) for Agent training, including K experiences with the greatest reward;
calculating a target Q value
Figure FDA0003423734970000021
Wherein the content of the first and second substances,
Figure FDA0003423734970000022
ε~clip(N(0,σ2) -c, c), wherein,
Figure FDA0003423734970000023
representing a Target Actor network
Figure FDA0003423734970000024
Facing state is si+1A motion of a temporal output; ε is the strategic noise, obeys a mean of 0, and the variance is σ2And is truncated between-c and c; n (0, sigma)2) Representing a mean of 0 and a variance of σ2(ii) a gaussian distribution of; clip represents truncation; gamma is a discount factor;
Figure FDA0003423734970000025
representing Target critical network
Figure FDA0003423734970000026
Facing state is si+1And act as
Figure FDA0003423734970000027
An output of time;
by minimizing errors
Figure FDA0003423734970000028
Updating critical network Qw,Qw(si,ai) Indicating criticic network QwFacing state is siAnd the action is aiAn output of time;
every d steps by minimizing the error
Figure FDA0003423734970000029
Updating Actor network muθ,μθ(si) Represents an Actor network muθFacing state is siA motion of a temporal output; qw(siθ(si) Denotes a Critic network QwFacing state is siAnd the sum action is muθ(si) An output of time; for Target Actor network according to the following formula
Figure FDA00034237349700000210
And Target critical network
Figure FDA00034237349700000211
And (3) carrying out soft updating:
Figure FDA00034237349700000212
tau is a positive number much smaller than 1 and is responsible for regulating the degree of soft update.
5. The method for optimizing the structure of the Volterra equalizer based on the deep reinforcement learning of claim 1, wherein the step S4 comprises:
the updated Agent starts to generate actions from the initial state of the Volterra equalizer, the memory length state of the Volterra equalizer is updated, the Agent continues to generate actions according to the current state until the memory length state of the Volterra equalizer is updated to the end state, and each time the Agent updates the memory length state of the Volterra equalizer to the end stateThe sub-generated actions are all added with the obedient mean of 0 and the variance of sigma2The search noise e of the gaussian distribution of (a);
determining the memory length of each order according to the maximum memory length limit of each order of the Volterra equalizer and the action of the Agent, performing 2-fold cross validation on signal data, and calculating a reward value by using the complexity of the current equalizer and the average bit error rate after equalization; transferring the state to the process(s)i,ai,ri,si+1Done) as experience, storing the experience in an experience playback pool; then, step S3 is executed;
after each update, the variance of the search noise e is attenuated: sigma2←σ2ξnXi is attenuation rate, n is updating times;
repeating the above operations until the absolute value of the difference between the current reward value and the last reward value is less than x1The absolute value of the difference between the current Agent output action and the last Agent output action is less than x2Then, it is determined that the training result has converged, wherein χ1≥0,χ2And (3) more than or equal to 0 is a set decision threshold, and finally, the memory length of each position of the Volterra equalizer is determined according to the convergence value of the Agent output action and the maximum memory length limit of each position, so that the determination of the optimal structure of the Volterra equalizer is completed.
6. A system for optimizing a Volterra equalizer structure based on deep reinforcement learning, comprising:
module M1: initializing an Agent of an intelligent Agent, initializing an experience playback pool, initializing a memory length state of a Volterra equalizer and defining a state transition process;
module M2: starting from the initial memory length state of the Volterra equalizer, randomly generating action for the Agent, updating the memory length state of the Volterra equalizer until the memory length state reaches the end state, calculating a reward value according to the complexity of the Volterra equalizer and the error rate after signal equalization, storing the transfer process as experience into an experience playback pool, and circulating from the initial state again until a specified amount of experience is generated;
module M3: sampling experiences from an experience playback pool, training the agents, and then performing soft updating on the agents every preset step number;
module M4: and generating a deterministic action for the updated Agent from the initial memory length state of the Volterra equalizer until the state transition process is finished, calculating an incentive value and storing the transition process into an experience playback pool, then repeating the module M3 and the module M4 until the incentive value and the action output by the Agent are converged, and finally determining each order of memory length of the Volterra equalizer according to the convergence value.
7. The system for optimizing the structure of the Volterra equalizer based on the deep reinforcement learning of claim 6, wherein the module M1 comprises:
module M11: four neural networks in the Agent are defined: actor network muθCritic network QwTarget Actor network
Figure FDA0003423734970000041
And Target critical network
Figure FDA0003423734970000042
Actor network mu initialization using random parameters theta, wθAnd Critic network QwUsing random parameters
Figure FDA0003423734970000043
Initializing Target Actor networks
Figure FDA0003423734970000044
And Target critical network
Figure FDA0003423734970000045
Wherein is provided with
Figure FDA0003423734970000046
Is equal to theta, set
Figure FDA0003423734970000047
Is equal to w;
module M12: initializing an experience playback pool storing experiences in a format of(s)i,ai,ri,si+1Done), wherein siRepresenting the memory length state of the current Volterra equalizer; a isiRepresenting Agent according to current state siGenerating an action in which the memory length of each step is in proportion to the maximum memory length limit; r isiRepresenting Agent facing status siTaking action ofiThe reward earned; si+1Representing Agent taking action aiThen, the memory length state after the Volterra equalizer is updated; done is a mark for judging whether the whole state transition process is finished or not;
module M13: initializing the memory length state of the Volterra equalizer according to the type of the Volterra equalizer and defining the state transition process.
8. The system for optimizing the structure of the Volterra equalizer based on the deep reinforcement learning of claim 7, wherein the module M2 comprises:
module M21: selecting a state transition process according to the type of the Volterra equalizer, starting by the Agent from an initial state, generating random actions uniformly distributed according to [0, 1], updating the memory length state of the Volterra equalizer, and continuing to generate the random actions according to the current state until the memory length state of the Volterra equalizer is updated to an end state;
module M22: calculating a reward value, determining the memory length of each order according to the maximum memory length limit of each order of the Volterra equalizer and the action of the Agent, performing 2-fold cross validation on signal data, and calculating the reward value by using the complexity of the current equalizer and the average error rate after equalization;
module M23: transferring the state to the process(s)i,ai,ri,si+1Done) as experience, and storing the experience into an experience playback pool;
module M24: the blocks M21 through M23 are repeated until a predetermined number of experiences are made.
9. The system for optimizing the structure of the Volterra equalizer based on the deep reinforcement learning of claim 6, wherein the module M3 comprises:
sampling N experiences(s) from an experience replay pool by using a data guided pool techniquei,ai,ri,si+1Done) for Agent training, including K experiences with the greatest reward;
calculating a target Q value
Figure FDA0003423734970000048
Wherein the content of the first and second substances,
Figure FDA0003423734970000049
ε~clip(N(0,σ2) -c, c), wherein,
Figure FDA0003423734970000051
representing a Target Actor network
Figure FDA0003423734970000052
Facing state is si+1A motion of a temporal output; ε is the strategic noise, obeys a mean of 0, and the variance is σ2And is truncated between-c and c; n (0, sigma)2) Representing a mean of 0 and a variance of σ2(ii) a gaussian distribution of; clip represents truncation; gamma is a discount factor;
Figure FDA0003423734970000053
representing Target critical network
Figure FDA0003423734970000054
Facing state is si+1And act as
Figure FDA0003423734970000055
An output of time;
by minimizing errors
Figure FDA0003423734970000056
Updating critical network Qw,Qw(si,ai) Indicating criticic network QwFacing state is siAnd the action is aiAn output of time;
every d steps by minimizing the error
Figure FDA0003423734970000057
Updating Actor network muθ,μθ(si) Represents an Actor network muθFacing state is siA motion of a temporal output; qw(siθ(si) Denotes a Critic network QwFacing state is siAnd the sum action is muθ(si) An output of time; for Target Actor network according to the following formula
Figure FDA0003423734970000058
And Target critical network
Figure FDA0003423734970000059
And (3) carrying out soft updating:
Figure FDA00034237349700000510
tau is a positive number much smaller than 1 and is responsible for regulating the degree of soft update.
10. The system for optimizing the structure of the Volterra equalizer based on the deep reinforcement learning of claim 6, wherein the module M4 comprises:
the updated Agent starts to generate actions from the initial state of the Volterra equalizer, updates the memory length state of the Volterra equalizer, continues to generate actions according to the current state until the memory length state of the Volterra equalizer is updated to the end state, and adds obedient mean value of 0 and variance sigma of the action generated each time by the Agent2The search noise e of the gaussian distribution of (a);
determining the memory length of each order according to the maximum memory length limit of each order of the Volterra equalizer and the action of the Agent, and performing the comparisonPerforming 2-fold cross validation on the signal data, and calculating a reward value by using the complexity of the current equalizer and the average bit error rate after equalization; transferring the state to the process(s)i,ai,ri,si+1Done) as experience, storing the experience in an experience playback pool; then calling module M3;
after each update, the variance of the search noise e is attenuated: sigma2←σ2ξnXi is attenuation rate, n is updating times;
repeating the above operations until the absolute value of the difference between the current reward value and the last reward value is less than x1The absolute value of the difference between the current Agent output action and the last Agent output action is less than x2Then, it is determined that the training result has converged, wherein χ1≥0,χ2And (3) more than or equal to 0 is a set decision threshold, and finally, the memory length of each position of the Volterra equalizer is determined according to the convergence value of the Agent output action and the maximum memory length limit of each position, so that the determination of the optimal structure of the Volterra equalizer is completed.
CN202111572693.8A 2021-12-21 2021-12-21 Method and system for optimizing Volterra equalizer structure based on deep reinforcement learning Active CN114338309B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111572693.8A CN114338309B (en) 2021-12-21 2021-12-21 Method and system for optimizing Volterra equalizer structure based on deep reinforcement learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111572693.8A CN114338309B (en) 2021-12-21 2021-12-21 Method and system for optimizing Volterra equalizer structure based on deep reinforcement learning

Publications (2)

Publication Number Publication Date
CN114338309A true CN114338309A (en) 2022-04-12
CN114338309B CN114338309B (en) 2023-07-25

Family

ID=81055145

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111572693.8A Active CN114338309B (en) 2021-12-21 2021-12-21 Method and system for optimizing Volterra equalizer structure based on deep reinforcement learning

Country Status (1)

Country Link
CN (1) CN114338309B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115208721A (en) * 2022-06-23 2022-10-18 上海交通大学 Volterra-like neural network equalizer construction method and system

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019191099A1 (en) * 2018-03-26 2019-10-03 Zte Corporation Non-linear adaptive neural network equalizer in optical communication
US20200104714A1 (en) * 2018-10-01 2020-04-02 Electronics And Telecommunications Research Institute System and method for deep reinforcement learning using clustered experience replay memory
CN111461321A (en) * 2020-03-12 2020-07-28 南京理工大学 Improved deep reinforcement learning method and system based on Double DQN
CN112437020A (en) * 2020-10-30 2021-03-02 天津大学 Data center network load balancing method based on deep reinforcement learning

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019191099A1 (en) * 2018-03-26 2019-10-03 Zte Corporation Non-linear adaptive neural network equalizer in optical communication
US20200104714A1 (en) * 2018-10-01 2020-04-02 Electronics And Telecommunications Research Institute System and method for deep reinforcement learning using clustered experience replay memory
CN111461321A (en) * 2020-03-12 2020-07-28 南京理工大学 Improved deep reinforcement learning method and system based on Double DQN
CN112437020A (en) * 2020-10-30 2021-03-02 天津大学 Data center network load balancing method based on deep reinforcement learning

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115208721A (en) * 2022-06-23 2022-10-18 上海交通大学 Volterra-like neural network equalizer construction method and system
CN115208721B (en) * 2022-06-23 2024-01-23 上海交通大学 Volterra-like neural network equalizer construction method and system

Also Published As

Publication number Publication date
CN114338309B (en) 2023-07-25

Similar Documents

Publication Publication Date Title
Häger et al. Physics-based deep learning for fiber-optic communication systems
JP7181303B2 (en) Optical Fiber Nonlinearity Compensation Using Neural Network
Deligiannidis et al. Performance and complexity analysis of bi-directional recurrent neural network models versus volterra nonlinear equalizers in digital coherent systems
Ming et al. Ultralow complexity long short-term memory network for fiber nonlinearity mitigation in coherent optical communication systems
Neskorniuk et al. End-to-end deep learning of long-haul coherent optical fiber communications via regular perturbation model
US20220239371A1 (en) Methods, devices, apparatuses, and medium for optical communication
CN113938188B (en) Construction method and application of optical signal-to-noise ratio monitoring model
CN114338309A (en) Method and system for optimizing Volterra equalizer structure based on deep reinforcement learning
CN114499723A (en) Optical fiber channel rapid modeling method based on Fourier neural operator
He et al. Fourier neural operator for accurate optical fiber modeling with low complexity
Shahkarami et al. Attention-based neural network equalization in fiber-optic communications
Neskorniuk et al. Memory-aware end-to-end learning of channel distortions in optical coherent communications
Kamiyama et al. Neural network nonlinear equalizer in long-distance coherent optical transmission systems
CN107231194A (en) Variable step equalization scheme based on convergence state in indoor visible light communication system
Barbosa et al. On a scalable path for multimode MIMO-DSP
Wang et al. Low-complexity nonlinear equalizer based on artificial neural network for 112 Gbit/s PAM-4 transmission using DML
Kuehl et al. Optimized bandwidth variable transponder configuration in elastic optical networks using reinforcement learning
Irukulapati et al. Tighter lower bounds on mutual information for fiber-optic channels
CN117097409A (en) Nonlinear compensation method, system, medium and equipment for optical communication
CN115208721B (en) Volterra-like neural network equalizer construction method and system
Zhao et al. Accurate nonlinear model beyond nonlinear noise power estimation
CN111988089B (en) Signal compensation method and system for optical fiber communication system
Tomczyk et al. Relaxing dispersion pre-distorsion constraints of receiver-based power profile estimators
CN107911322A (en) A kind of decision feedback equalization algorithm of low complex degree
CN112532314B (en) Method and device for predicting transmission performance of optical network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant