KR102234049B1

KR102234049B1 - Receiver, system and method for adaptive modulation based on reinforcement learning

Info

Publication number: KR102234049B1
Application number: KR1020190178557A
Authority: KR
Inventors: 김동인; 김진영; 이동구; 선영규; 김수현; 심이삭
Original assignee: 성균관대학교산학협력단
Priority date: 2019-12-30
Filing date: 2019-12-30
Publication date: 2021-04-01
Also published as: KR102234049B9

Abstract

According to the present invention, a receiver, to which a reinforcement learning-based adaptive modulation/demodulation way is applied, comprises: a receiving unit for estimating a state of a channel by receiving a pilot signal from a transmitter; a modulation/demodulation selecting unit for selecting a modulation/demodulation way according to the state of the channel based on reinforcement learning; and a feedback unit for transmitting information of the selected modulation/demodulation way to the transmitter, wherein the modulation/demodulation selecting unit adjusts a boundary value of a signal-to-noise ratio of the channel and gives a reward in a direction of maximizing a spectral efficiency value so as to perform reinforcement learning.

Description

Receiver, system and method for adaptive modulation and demodulation based on reinforcement learning {RECEIVER, SYSTEM AND METHOD FOR ADAPTIVE MODULATION BASED ON REINFORCEMENT LEARNING}

본 발명은 아래의 실시예들은 강화학습 기반 적응형 변복조를 위한 수신기, 시스템 및 그 방법에 관한 것으로, 통신채널의 상태에 따라 강화학습에 기초하여 변/복조 방식을 결정하는 수신기, 시스템 및 그 방법에 관한 것이다.The present invention relates to a receiver, a system, and a method for adaptive modulation and demodulation based on reinforcement learning, and the receiver, system, and method for determining a modulation/demodulation method based on reinforcement learning according to a state of a communication channel. It is about.

종래의 적응형 변복조 시스템은 변복조 방식 할당 기준이 효율적이지 못한 수준이다. 또한, 이전의 다양한 최적화 기법을 통해 접근한 연구들이 있지만, 그에 대한 최적해를 명확한 형태로 풀기 어려움이 존재하며, 수치해석 기법이나 경험의 의거한 방법이 주를 이룬다. The conventional adaptive modem system has an inefficient level of allocation criteria for modems. In addition, there are studies that have been approached through various optimization techniques previously, but it is difficult to solve the optimal solution in a clear form, and numerical analysis techniques and methods based on experience are the main ones.

무선통신 분야에서 인공지능 기법은 자동화된 선택작업에 다양하게 응용되는 추세이다. 그러나 인공지능 기법 중 지도학습의 경우, 인공지능 모델을 학습시킬 때 방대한 양의 데이터가 필요하며, 이로 인해 부족한 데이터는 인공지능 모델의 학습률을 떨어뜨리게 된다.In the field of wireless communication, artificial intelligence techniques are widely applied to automated selection tasks. However, in the case of supervised learning among artificial intelligence techniques, a vast amount of data is required when training an artificial intelligence model, and insufficient data degrades the learning rate of the artificial intelligence model.

이에 대해서, 인공지능 기법 중 강화학습 기법은 별도의 외부 데이터 없이, 강화학습 모델의 시행과 이에 따른 보상이 주어지는 방법으로, 보상을 최대화하는 방향으로 학습하는 것이다.On the other hand, the reinforcement learning technique among artificial intelligence techniques is a method in which the reinforcement learning model is executed and rewards are given without additional external data, and learning in the direction of maximizing the reward.

한편, 종래의 강화학습 기반 적응형 변복조 기법에는 알고리즘의 복잡도 문제와 이에 따른 최적해 도출의 어려움이 있다(비특허문헌 1 참조)On the other hand, the conventional reinforcement learning-based adaptive modulation and demodulation technique has a problem of complexity of an algorithm and difficulty in deriving an optimal solution accordingly (see Non-Patent Document 1).

또한, 종래의 심층 Q 네트워크 학습 기법은 알고리즘의 불안정성, 발산의 문제가 있다(비특허문헌 2 참조).In addition, the conventional deep Q network learning technique has problems of algorithm instability and divergence (see Non-Patent Document 2).

J. P. Leite, P. H. P. de Carvalho and R. D. Vieira, "A flexible framework based on reinforcement learning for adaptive modulation and coding in OFDM systems," Proc. 2012 IEEE Wireless Communications and Networking Conference (WCNC), pp. 809-814, Shanghai, China, Apr. 2012. J. P. Leite, P. H. P. de Carvalho and R. D. Vieira, "A flexible framework based on reinforcement learning for adaptive modulation and coding in OFDM systems," Proc. 2012 IEEE Wireless Communications and Networking Conference (WCNC), pp. 809-814, Shanghai, China, Apr. 2012. V. Mnih, et al., "Human-level control through deep reinforcement learning," Nature, vol. 518, no. 7540, pp. 529-533, Feb. 2015. V. Mnih, et al., "Human-level control through deep reinforcement learning," Nature, vol. 518, no. 7540, pp. 529-533, Feb. 2015. R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction, 2nd ed. Cambridge, MA, USA: MIT Press, 2018. R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction, 2nd ed. Cambridge, MA, USA: MIT Press, 2018. D. P. Kingma and J. L. Ba, "ADAM: A method for stochastic optimization," Proc. 3rdInternationalConferenceforLearningRepresentations, pp. 1-41, San Diego, USA, 2015. D. P. Kingma and J. L. Ba, "ADAM: A method for stochastic optimization," Proc. 3rd International Conference for Learning Representations, pp. 1-41, San Diego, USA, 2015. S. T. Chung and A. J. Goldsmith, "Degrees of freedom in adaptive modulation: A unified view," IEEE Trans. on Communications, vol. 49, no. 9, pp. 1561-1571, Sept. 2001. S. T. Chung and A. J. Goldsmith, "Degrees of freedom in adaptive modulation: A unified view," IEEE Trans. on Communications, vol. 49, no. 9, pp. 1561-1571, Sept. 2001. X. Song and J. Cheng, "Quantifying the accuracy of high SNR BER approximation of MPSK in fading channels," Proc. 2014 27th Biennial Symposium on Communications (QBSC), pp. 105-108, Kingston, ON, Canada, June 2014. X. Song and J. Cheng, "Quantifying the accuracy of high SNR BER approximation of MPSK in fading channels," Proc. 2014 27th Biennial Symposium on Communications (QBSC), pp. 105-108, Kingston, ON, Canada, June 2014. M. K. Simon and M.-S. Alouini, Digital Communication over Fading Channels: A Unified Approach to Performance Analysis, Wiley-Interscience, 2000. M. K. Simon and M.-S. Alouini, Digital Communication over Fading Channels: A Unified Approach to Performance Analysis, Wiley-Interscience, 2000.

상술한 문제점을 해결하기 위해, 본 발명은 강화학습을 통해 채널 상태에 적합한 변복조 방식을 선택하는 기준을 선정하고, 최적의 변복조 방식을 할당함으로써, 송신 효율을 최대화하는 수신기, 시스템 및 그 방법을 제공하는데 목적이 있다.In order to solve the above-described problem, the present invention provides a receiver, a system, and a method for maximizing transmission efficiency by selecting a criterion for selecting a modulation/demodulation method suitable for a channel state through reinforcement learning and allocating an optimal modulation and demodulation method. There is a purpose to do it.

본 발명의 명시되지 않은 또 다른 목적들은 하기의 상세한 설명 및 그 효과로부터 용이하게 추론할 수 있는 범위 내에서 추가적으로 고려될 수 있다.Other objects not specified of the present invention may be additionally considered within a range that can be easily deduced from the following detailed description and effects thereof.

상술한 목적을 달성하기 위한 본 실시예의 일 측면에 의하면, 강화학습 기반 적응형 변복조 방식이 적용된 수신기로서, 송신기로부터 파일럿 신호를 수신하여 채널의 상태를 추정하는 수신부; 강화학습에 기반하여 상기 채널의 상태에 따라 변복조 방식을 선택하는 변복조 선택부; 및 선택된 변복조 방식의 정보를 상기 송신기로 전송하는 피드백부를 포함하고, 상기 변복조 선택부는 채널의 신호대비 잡음비의 경계값을 조정하여 스펙트럼 효율값을 최대화하는 방향으로 보상(reward)을 주어 강화학습한다. According to an aspect of the present embodiment for achieving the above object, a receiver to which an adaptive modulation and demodulation method based on reinforcement learning is applied, comprising: a receiver configured to receive a pilot signal from a transmitter to estimate a channel state; A modulation/demodulation selection unit for selecting a modulation/demodulation method according to the state of the channel based on reinforcement learning; And a feedback unit that transmits information of the selected modulation and demodulation method to the transmitter, and the modulation and demodulation selection unit adjusts a boundary value of a signal-to-noise ratio of a channel and performs reinforcement learning by giving a reward in a direction maximizing a spectral efficiency value.

상기 강화학습은 심층 Q 네트워크 알고리즘에 의한 학습이며, 상기 심층 Q 네트워크 알고리즘의 상태(State)를 상기 변조 방식을 할당하는 경계값으로 설정하고, 행동(Action)을 상기 경계값을 올리거나 내리는 조정으로 설정한다. The reinforcement learning is learning by a deep Q network algorithm, and a state of the deep Q network algorithm is set as a threshold value for assigning the modulation scheme, and an action is adjusted to raise or lower the threshold value. Set.

상기 변복조 선택부는 매 시행 시작시, 비트 에러 확률이 미리 설정된 비트 에러 확률 기준값을 초과하는 경우, 상기 상태-값 버퍼로부터 최적의 스펙트럼 효율값에 대응하는 경계값을 불러와 시행을 시작하는 것을 특징으로 한다. The modulation/demodulation selection unit is characterized in that at the start of every trial, when the bit error probability exceeds a preset bit error probability reference value, the threshold value corresponding to the optimal spectral efficiency value is retrieved from the state-value buffer and the trial is started. do.

또한, 본 실시예의 다른 측면에 의하면, 수신기에 의한 강화학습 기반 변복조 방법에 있어서, 송신기로부터 파일럿 신호를 수신하는 단계; 상기 채널의 상태를 추정하는 단계; 강화학습에 기반하여 상기 채널의 상태에 따라 변복조 방식을 선택하는 단계; 및 선택된 변복조 방식의 정보를 상기 송신기로 전송하는 단계를 포함하고, 상기 강화학습은 채널의 신호대비 잡음비의 경계값을 조정하여 스펙트럼 효율값을 최대화하는 방향으로 보상(reward)을 준다. In addition, according to another aspect of the present embodiment, there is provided a method for modulation and demodulation based on reinforcement learning by a receiver, the method comprising: receiving a pilot signal from a transmitter; Estimating the state of the channel; Selecting a modulation/demodulation method according to the state of the channel based on reinforcement learning; And transmitting information of a selected modulation/demodulation method to the transmitter, wherein the reinforcement learning adjusts a boundary value of a signal-to-noise ratio of a channel to give a reward in a direction of maximizing a spectral efficiency value.

또한, 본 실시예의 다른 측면에 의하면, 강화학습 기반 적응형 변복조 시스템은, 파일럿 신호를 송신하는 송신기; 및 상기 파일럿 신호를 수신하여 채널의 상태를 추정하고, 강화학습에 기반하여 상기 채널의 상태에 따라 변복조 방식을 선택하여 선택된 변복조 방식의 정보를 상기 송신기로 전송하는 수신기를 포함하고, 상기 수신기는 채널의 신호대비 잡음비의 경계값을 조정하여 스펙트럼 효율값을 최대화하는 방향으로 보상(reward)을 주어 강화학습한다. In addition, according to another aspect of the present embodiment, the reinforcement learning-based adaptive modulation and demodulation system includes: a transmitter for transmitting a pilot signal; And a receiver configured to receive the pilot signal to estimate a channel state, select a modem based on the channel state based on reinforcement learning, and transmit information of the selected modem to the transmitter, wherein the receiver comprises a channel Reinforcement learning is performed by adjusting the boundary value of the signal-to-noise ratio of and giving a reward in the direction of maximizing the spectral efficiency value.

이상에서 설명한 바와 같이 본 발명의 실시예들에 의하면, 강화학습을 통해 채널 상태에 적합한 변복조 방식을 선택하는 기준을 선정하고, 최적의 변복조 방식을 할당함으로써, 송신효율을 최대화할 수 있는 효과가 있다.As described above, according to the embodiments of the present invention, transmission efficiency can be maximized by selecting a criterion for selecting a modulation/demodulation method suitable for a channel state through reinforcement learning and allocating an optimal modulation/demodulation method. .

또한, 본 발명의 실시예들에 의하면, 채널의 신호 대비 잡음비의 확률밀도분포 범위와 정해진 비트 에러 확률 제한 조건 내에서 최적의 스펙트럼 효율값을 갖는 경계값을 도출하는 효과가 있다. In addition, according to embodiments of the present invention, there is an effect of deriving a boundary value having an optimal spectral efficiency value within a probability density distribution range of a signal-to-noise ratio of a channel and a predetermined bit error probability limitation condition.

또한, 본 발명의 실시예들에 의하면, 최적의 변복조 방식 경계값으로 빠르게 수렴하는 효과가 있다. In addition, according to the embodiments of the present invention, there is an effect of rapidly converging to an optimal modulation/demodulation method boundary value.

여기에서 명시적으로 언급되지 않은 효과라 하더라도, 본 발명의 기술적 특징에 의해 기대되는 이하의 명세서에서 기재된 효과 및 그 잠정적인 효과는 본 발명의 명세서에 기재된 것과 같이 취급된다.Even if it is an effect not explicitly mentioned herein, the effect described in the following specification expected by the technical features of the present invention and the provisional effect thereof are treated as described in the specification of the present invention.

도 1은 본 발명의 실시예들에 따른 강화학습에 기반한 적응형 변복조 시스템의 구성을 개략적으로 나타낸 블록도이다.
도 2는 본 발명의 실시예들에 따른 강화학습에 기반한 적응형 변복조 시스템의 변복조 방법을 설명하는 흐름도이다.
도 3은 본 발명의 실시예들에 따른 수신기의 강화학습에 기반한 변복조 선택부의 구성을 개략적으로 나타낸 블록도이다.
도 4는 본 발명의 실시예들에 따른 수신기의 강화학습에 기반한 변복조 선택부의 변복조 방법을 설명하는 흐름도이다.
도 5는 본 발명의 실시예들에 따른 강화학습에 기반한 변조 시스템의 시뮬레이션 결과를 나타낸 그래프이다.
도 6은 본 발명의 실시예들에 따른 강화학습에 기반한 변조 시스템의 최적 경계값을 불러오지 않은 시뮬레이션 결과를 나타낸 그래프이다.
도 7은 본 발명의 실시예들에 따른 강화학습에 기반한 변조 시스템의 최적 경계값을 불러온 시뮬레이션 결과를 나타낸 그래프이다. 1 is a block diagram schematically showing the configuration of an adaptive modem system based on reinforcement learning according to embodiments of the present invention.
2 is a flowchart illustrating a modulation and demodulation method of an adaptive modulation and demodulation system based on reinforcement learning according to embodiments of the present invention.
3 is a block diagram schematically showing the configuration of a modulation/demodulation selection unit based on reinforcement learning of a receiver according to embodiments of the present invention.
4 is a flowchart illustrating a modulation and demodulation method of a modulation/demodulation selection unit based on reinforcement learning of a receiver according to embodiments of the present invention.
5 is a graph showing a simulation result of a modulation system based on reinforcement learning according to embodiments of the present invention.
6 is a graph showing a simulation result in which an optimal boundary value of a modulation system based on reinforcement learning according to embodiments of the present invention is not called.
7 is a graph showing a simulation result in which an optimal boundary value of a modulation system based on reinforcement learning according to embodiments of the present invention is called.

본 발명은 다양한 변경을 가할 수 있고 여러 가지 실시예를 가질 수 있는 바, 특정 실시예들을 도면에 예시하고 상세한 설명에 상세하게 설명하고자 한다. 그러나, 이는 본 발명을 특정한 실시 형태에 대해 한정하려는 것이 아니며, 본 발명의 사상 및 기술 범위에 포함되는 모든 변경, 균등물 내지 대체물을 포함하는 것으로 이해되어야 한다. 각 도면을 설명하면서 유사한 참조부호를 유사한 구성요소에 대해 사용하였다.In the present invention, various modifications may be made and various embodiments may be provided, and specific embodiments will be illustrated in the drawings and described in detail in the detailed description. However, this is not intended to limit the present invention to a specific embodiment, it should be understood to include all changes, equivalents, and substitutes included in the spirit and scope of the present invention. In describing each drawing, similar reference numerals have been used for similar elements.

본 출원에서 사용한 용어는 단지 특정한 실시예를 설명하기 위해 사용된 것으로, 본 발명을 한정하려는 의도가 아니다. 단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다. 본 출원에서, "포함하다" 또는 "가지다" 등의 용어는 명세서상에 기재된 특징, 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것이 존재함을 지정하려는 것이지, 하나 또는 그 이상의 다른 특징들이나 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다.The terms used in the present application are only used to describe specific embodiments, and are not intended to limit the present invention. Singular expressions include plural expressions unless the context clearly indicates otherwise. In the present application, terms such as "comprise" or "have" are intended to designate the presence of features, numbers, steps, actions, components, parts, or combinations thereof described in the specification, but one or more other features. It is to be understood that the presence or addition of elements or numbers, steps, actions, components, parts, or combinations thereof does not preclude in advance.

다르게 정의되지 않는 한, 기술적이거나 과학적인 용어를 포함해서 여기서 사용되는 모든 용어들은 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자에 의해 일반적으로 이해되는 것과 동일한 의미를 가지고 있다. 일반적으로 사용되는 사전에 정의되어 있는 것과 같은 용어들은 관련 기술의 문맥 상 가지는 의미와 일치하는 의미를 가지는 것으로 해석되어야 하며, 본 출원에서 명백하게 정의하지 않는 한, 이상적이거나 과도하게 형식적인 의미로 해석되지 않는다.Unless otherwise defined, all terms used herein including technical or scientific terms have the same meaning as commonly understood by one of ordinary skill in the art to which the present invention belongs. Terms as defined in a commonly used dictionary should be interpreted as having a meaning consistent with the meaning in the context of the related technology, and should not be interpreted as an ideal or excessively formal meaning unless explicitly defined in the present application. Does not.

이하, 첨부된 도면을 참조하여 강화학습에 기반한 적응형 변복조 시스템에 대해 설명한다.Hereinafter, an adaptive modulation and demodulation system based on reinforcement learning will be described with reference to the accompanying drawings.

도 1은 본 발명의 실시예들에 따른 강화학습에 기반한 적응형 변복조 시스템의 구성을 개략적으로 나타낸 블록도이다.1 is a block diagram schematically showing the configuration of an adaptive modem system based on reinforcement learning according to embodiments of the present invention.

강화학습에 기반한 적응형 변복조 시스템은 송신기(100) 및 수신기(200)를 포함한다. An adaptive modulation and demodulation system based on reinforcement learning includes a transmitter 100 and a receiver 200.

송신기(100)는 송신부(110) 및 변조부(120)를 포함하고, 수신기(200)는 수신부(210), 채널 검출부(220), 변복조 선택부(230), 피드백부(240) 및 복조부(250)를 포함할 수 있다. The transmitter 100 includes a transmission unit 110 and a modulator 120, and the receiver 200 includes a reception unit 210, a channel detection unit 220, a modem selection unit 230, a feedback unit 240, and a demodulation unit. It may include (250).

통신 시스템은 무선 통신에 필요한 다른 구성 요소를 추가로 포함할 수 있으며, 전자기파를 방사하는 안테나를 포함할 수 있고, 신호 처리에 필요한 필터 또는 증폭기 등을 포함할 수 있다.The communication system may further include other components required for wireless communication, may include an antenna that emits electromagnetic waves, and may include a filter or amplifier required for signal processing.

송신부(110)는 파일럿 신호를 송신한다. The transmitter 110 transmits a pilot signal.

변조부(120)는 후술하는 피드백부(230)로부터 전달받은 변복조 방식의 정보에 기초하여 신호의 변조를 수행한다. The modulator 120 modulates a signal based on the modulation/demodulation method information received from the feedback unit 230 to be described later.

수신부(210)는 송신기(100)로부터 파일럿 신호를 수신한다. The receiver 210 receives a pilot signal from the transmitter 100.

채널 검출부(220)는 상기 파일럿 신호에 기초하여 채널의 상태를 추정한다. 추정된 채널은 확률밀도분포 함수의 형태로 도출된다. The channel detector 220 estimates a channel state based on the pilot signal. The estimated channel is derived in the form of a probability density distribution function.

본 발명의 일 실시예의 채널은 다양한 무선통신 채널 모델이 적용가능하고 예를 들어, 레일리 페이딩 채널 모델을 본 발명의 일 실시예의 채널에 적용할 수 있다. Various wireless communication channel models can be applied to the channel according to the embodiment of the present invention, and for example, the Rayleigh fading channel model can be applied to the channel according to the embodiment of the present invention.

변복조 선택부(230)는 강화학습에 기반하여 상기 채널의 상태에 따라 변복조 방식을 선택한다. 변복조 선택부(230)는 채널의 신호대비 잡음비의 경계값을 조정하여 스펙트럼 효율값을 최대화하는 방향으로 보상(reward)을 주어 강화학습을 수행한다. 강화학습은 심층 Q 네트워크 알고리즘에 의한 학습이며, 상기 심층 Q 네트워크 알고리즘의 상태(State)를 상기 변조 방식을 할당하는 경계값으로 설정하고, 행동(Action)을 상기 경계값을 올리거나 내리는 조정으로 설정한다. The modulation/demodulation selection unit 230 selects a modulation/demodulation method according to the state of the channel based on reinforcement learning. The modulation/demodulation selection unit 230 performs reinforcement learning by adjusting the boundary value of the signal-to-noise ratio of the channel and giving a reward in the direction of maximizing the spectral efficiency value. Reinforcement learning is learning by a deep Q network algorithm, and the state of the deep Q network algorithm is set as a threshold value to which the modulation method is assigned, and an action is set as an adjustment to raise or lower the threshold value. do.

바람직하게는, 상기 경계값은 상기 채널의 신호대비 잡음비 확률밀도분포 함수를 다수의 변복조 방식 할당 영역으로 나누는 기준값이다. Preferably, the threshold value is a reference value for dividing the signal-to-noise ratio probability density distribution function of the channel into a plurality of modulation/demodulation scheme allocation regions.

변복조 선택부(230)는 매 시행 시작시 미리 저장된 최적의 스펙트럼 효율값에 대응하는 경계값을 불러와 시행을 시작한다. At the start of each trial, the modem selection unit 230 fetches a threshold value corresponding to the pre-stored optimal spectral efficiency value and starts the trial.

변복조 선택부(230)는 채널에 대한 신호대잡음비에 대해서 비트 에러 확률을 계산하고, 상기 비트 에러 확률이 미리 설정된 비트 에러 확률 기준값을 초과하는 경우, 상기 상태-값 버퍼로부터 최적의 스펙트럼 효율값에 대응하는 경계값을 불러와 시행을 시작한다. The modem and demodulation selection unit 230 calculates a bit error probability for a signal-to-noise ratio for a channel, and when the bit error probability exceeds a preset bit error probability reference value, corresponds to the optimal spectral efficiency value from the state-value buffer. It loads the threshold value and starts the trial.

변복조 선택부(230)가 매 시행 시작시 또한, 비트 에러 확률이 미리 설정된 기준값을 초과하는 경우, 이전까지 시행된 결과값 중에서 가장 높은 스펙트럼 효율값을 갖는 경계값을 불러와 시행을 시작함으로써, 더 빠르게 강화학습 모델이 수렴하는 효과가 있다. When the modulation/demodulation selection unit 230 starts every trial and when the bit error probability exceeds a preset reference value, by calling the threshold value having the highest spectral efficiency value among the result values that have been executed before and starting the trial, further The reinforcement learning model converges quickly.

피드백부(240)는 선택된 변복조 방식의 정보를 상기 송신기(100)로 전송한다. 송신기(100)가 변복조 방식의 정보를 원활히 수신하였음을 인증하는 신호를 수신하지 못하는 경우 피드백부(240)는 선택된 변복조 방식의 정보를 지속적으로 반복하여 송신할 수 있다. The feedback unit 240 transmits the selected modulation/demodulation method information to the transmitter 100. When the transmitter 100 fails to receive a signal certifying that the modem information has been smoothly received, the feedback unit 240 may continuously repeatedly transmit the selected modem information.

복조부(250)는 선택된 변복조 방식에 따라 송신기(100)로부터 수신되는 신호를 복조하고 데이터를 추출한다. The demodulation unit 250 demodulates a signal received from the transmitter 100 and extracts data according to the selected modulation/demodulation method.

도 2는 본 발명의 실시예들에 따른 강화학습에 기반한 적응형 변복조 시스템의 변복조 방법을 설명하는 흐름도이다.2 is a flowchart illustrating a modulation and demodulation method of an adaptive modulation and demodulation system based on reinforcement learning according to embodiments of the present invention.

송신기는 채널 상태를 추정하기 위한 파일럿 신호를 송신한다. The transmitter transmits a pilot signal for estimating the channel condition.

수신기는 송신기로부터 파일럿 신호를 수신한다(S210). The receiver receives a pilot signal from the transmitter (S210).

수신기는 상기 채널의 상태를 추정한다(S220). 파일럿 신호를 이용한 채널 추정은 종래기술이므로 상세한 설명은 생략한다. The receiver estimates the state of the channel (S220). Since channel estimation using a pilot signal is a conventional technique, a detailed description will be omitted.

수신기는 강화학습에 기반한 채널의 상태에 따른 변복조 방식을 선택한다(S230). 강화학습에 기반한 채널의 상태에 따른 변복조 방식을 선택하는 방법은 도 3 내지 도 7을 참조하여 후술한다. The receiver selects a modulation and demodulation method according to the state of the channel based on reinforcement learning (S230). A method of selecting a modulation/demodulation method according to the state of a channel based on reinforcement learning will be described later with reference to FIGS. 3 to 7.

수신기는 선택된 변복조 방식의 정보를 상기 송신기로 전송한다(S240).The receiver transmits the selected modulation/demodulation method information to the transmitter (S240).

송신기는 변복조 방식의 정보를 수신하고, 데이터를 상기 변복조 방식에 따라 변조하여 송신한다(S250). The transmitter receives the modem information, modulates and transmits the data according to the modem (S250).

수신기는 변조된 데이터를 수신하여 변조 방식에 대응하는 복조방식으로 복조한다(S260). The receiver receives the modulated data and demodulates it using a demodulation method corresponding to the modulation method (S260).

도 3은 본 발명의 실시예들에 따른 수신기의 강화학습에 기반한 변복조 선택부의 구성을 개략적으로 나타낸 블록도이다. 3 is a block diagram schematically showing the configuration of a modulation/demodulation selection unit based on reinforcement learning of a receiver according to embodiments of the present invention.

변복조 선택부는 상태-값 버퍼(211), 환경 모델(212), 반복메모리(213), 메인 네트워크(214), 대상 네트워크(215), 및 최적화 모델(216)을 포함할 수 있다.The modulation/demodulation selection unit may include a state-value buffer 211, an environment model 212, an iterative memory 213, a main network 214, a target network 215, and an optimization model 216.

상태-값 버퍼(211)는 시행마다 최적의 스펙트럼 효율값과 신호대잡음비 영역에서의 변복조 할당 경계값을 저장한다.The state-value buffer 211 stores an optimal spectral efficiency value for each trial and a modulation/demodulation allocation boundary value in the signal-to-noise ratio domain.

환경 모델(320)은 상태(State), 행동(Action), 및 보상(Reward)을 조정한다.The environmental model 320 adjusts state, action, and reward.

환경 모델(320)에서 상태는 상기 변조 방식을 할당하는 경계값이고, 행동(Action)은 상기 경계값을 올리거나 내리는 조정이며, 보상은 스펙트럼 효율값을 최대화하는 방향으로 주어진다. 또한, 환경 모델(320)은 경계가 유효하지 않는 상태이면 패널티를 설정한다.In the environmental model 320, a state is a threshold value to which the modulation scheme is assigned, an action is an adjustment of raising or lowering the threshold value, and a compensation is given in the direction of maximizing the spectral efficiency value. In addition, the environmental model 320 sets a penalty if the boundary is in an invalid state.

S={s1, s2, s3}는 상태값으로, 무선 통신 환경에서의 변복조 방식 할당 경계값이다. 경계값의 단위는 dB 이다. 채널의 신호 대비 잡음비의 영역에서의 경계값을 기준으로 구획된 영역별로 변복조 방식이 할당된다. S={s1, s2, s3} is a state value, and is a threshold value assigned to a modem in a wireless communication environment. The unit of the threshold value is dB. Modulation and demodulation methods are allocated for each segmented area based on the boundary value in the area of the signal-to-noise ratio of the channel.

본 발명의 일 실시예에 따른 강화학습 모델에서 신호대잡음비 영역을 0 dB에서 25 dB로 설정하고, 경계값을 3개로 설정하여 4개의 변복조 방식을 채택할 수 있다. 4개의 변복조 방식은 QPSK, 8PSK, 16PSK, 32PSK으로 설정될 수 있다. 예를 들면, 0 dB에서 s1 dB사이에는 QPSK방식이 할당되고, s2 dB에서 s3 dB 사이에는 8PSK 방식이 할당될 수 있다.In the reinforcement learning model according to an embodiment of the present invention, the signal-to-noise ratio area is set from 0 dB to 25 dB, and the boundary values are set to three, thereby adopting four modulation and demodulation methods. Four modulation and demodulation methods can be set to QPSK, 8PSK, 16PSK, and 32PSK. For example, a QPSK scheme may be allocated between 0 dB and s1 dB, and an 8PSK scheme may be allocated between s2 dB and s3 dB.

A={a1, a2, ... , a6}은 행위값으로, 강화학습 모델이 취할 수 있는 행위이다. 순차적으로 2개의 행위는 각각 3개의 경계값에 대응된다. 예를 들면, 만약 강화학습 모델이 a1의 행위를 취한다면, s1의 값이 하향 조정된다. 반대로, 강화학습 모델이 a2의 행위를 취한다면, s1의 값이 상향 조정된다. 유사한 방식으로 다른 행위는 경계값에 대응되어 작용한다.A={a1, a2, ..., a6} is an action value, which is an action that the reinforcement learning model can take. Each of the two actions sequentially corresponds to three threshold values. For example, if the reinforcement learning model takes the action of a1, the value of s1 is adjusted downward. Conversely, if the reinforcement learning model takes the action of a2, the value of s1 is adjusted upward. In a similar way, other actions act in response to the threshold.

R={r1, r2}은 보상값으로, 강화학습 모델이 받을 수 있는 보상이다.R={r1, r2} is a reward value, which is a reward that the reinforcement learning model can receive.

성능 개선이 이루어질 때는 r1을 통해 보상을 주고, 경계값이 유효하지 않은 상태로 진행하면 r2를 통해 패널티를 준다. 이때 패널티는 음수 보상값이다. 유효하지 않은 상태로 가는 것을 효과적으로 방지하기 위해 패널티의 크기를 보상의 크기보다 크게 설정할 수 있다. 예컨대, 크기는 r2=1.25*r1으로 설정할 수 있다.Compensation is given through r1 when performance improvement is made, and penalty is given through r2 when the threshold value is not valid. In this case, the penalty is a negative compensation value. In order to effectively prevent going into an invalid state, the size of the penalty can be set larger than the size of the compensation. For example, the size may be set to r2=1.25*r1.

경계가 유효하지 않는 상태는, 3개의 조건을 만족하도록 설정될 수 있다. 제1 조건은 심층 Q 네트워크 모델에서 다음 상태가 이전 상태보다 작거나 같아서 높은 경계가 낮은 경계를 침범하는 경우에 해당한다. 즉, 제1 조건은 더 높은 변복조 경계값이 낮은 경계값으로 침범하는 경우이다.The state in which the boundary is not valid may be set to satisfy three conditions. The first condition corresponds to a case in which the next state in the deep Q network model is less than or equal to the previous state, so that the high boundary invades the lower boundary. That is, the first condition is a case in which a higher modem threshold value violates a lower threshold value.

제2 조건은 경계가 미리 설정된 신호대잡음비의 범위를 넘는 경우이다. 예컨대, 신호대 잡음비의 범위는 0 dB 내지 25 dB로 설정될 수 있다. 이 경우, 제2 조건은 0 dB 내지 25 dB를 벗어나는 경우일 수 있다.The second condition is a case where the boundary exceeds the preset signal-to-noise ratio range. For example, the range of the signal-to-noise ratio may be set to 0 dB to 25 dB. In this case, the second condition may be out of 0 dB to 25 dB.

제3 조건은 이웃하는 경계값이 미리 설정된 간격보다 좁게 설정되는 경우이다. 예컨대, 제3 조건은 이웃하는 경계값들 사이의 간격이 3 dB보다 좁게 붙는 경우일 수 있다.The third condition is a case where the neighboring boundary value is set to be narrower than a preset interval. For example, the third condition may be a case where an interval between neighboring boundary values is narrower than 3 dB.

반복 메모리(213)는 s_t, a_t, r_t, s_t+1의 정보를 저장한다. 아래첨자 t는 이산적인 시간 단위를 나타내고, 각각의 요소는 환경 모델에서 서술한 해당 시간 단위에 맞는 값이 주어진다.The repetition memory 213 stores s _t , a _t , r _t , and s _t+1 information. The subscript t represents a discrete unit of time, and each element is given a value that fits the unit of time described in the environmental model.

메인 네트워크(214)와 대상 네트워크(215)는 인공 신경망 구조로 구성된다. 인공 신경망 구조는 레이어가 연결된 네트워크이며 가중치 및 바이어스를 학습하는 모델이다.The main network 214 and the target network 215 are composed of an artificial neural network structure. The artificial neural network structure is a network with connected layers and is a model that learns weights and biases.

메인 네트워크(214)는 환경과 상호 작용하여 행위에 따른 결과와 다음 상태를 반복 메모리(213)에 저장하고, 시스템에서 지정한 시행의 주기마다 무작위로 샘플링한다.The main network 214 interacts with the environment to store the result of the action and the next state in the iterative memory 213, and randomly sample it at each trial period designated by the system.

최적화 모델(216)은 메인 네트워크(214)와 대상 네트워크(215)의 출력값을 토대로 손실함수를 정의한다. 손실함수를 정의하기에 앞서 먼저 Q함수를 정의한다. Q 함수는 수학식 1과 같이 표현된다.The optimization model 216 defines a loss function based on the output values of the main network 214 and the target network 215. Before defining the loss function, first define the Q function. The Q function is expressed as in Equation 1.

(비특허문헌3 참조)(Refer to Non-Patent Document 3)

[수학식 1][Equation 1]

Q^*는 최적의 Q 함수값, d는 감소 변수, π는 강화학습 모델의 행위 판단 정책을 의미한다. 즉, 강화학습 모델이 받을 수 있는 누적 보상의 최적값을 나타낸다. Q 함수에 따라 손실함수를 정의하면 수학식 2와 같이 표현된다.Q ^* is the optimal Q function value, d is the reduction variable, and π is the behavior judgment policy of the reinforcement learning model. That is, it represents the optimal value of the cumulative reward that the reinforcement learning model can receive. If the loss function is defined according to the Q function, it is expressed as in Equation 2.

[수학식 2][Equation 2]

메인 네트워크의 Q 함수값은

이다.

는 메인 네트워크 내의 파라미터를 의미한다. 대상 네트워크의 Q 함수값은

으로 나타낸다.

는 대상 네트워크 내의 파라미터를 의미한다. 최적화 알고리즘으로 비특허문헌 4의 Adam optimizer를 적용할 수 있다.The value of the Q function of the main network is

to be.

Denotes a parameter in the main network. The value of the Q function of the target network is

Represented by

Denotes a parameter in the target network. As an optimization algorithm, the Adam optimizer of Non-Patent Document 4 can be applied.

반복 메모리(213)는 메인 네트워크(214)로의 무작위적 미니 배치 샘플링과 메인 네트워크(214)에서 대상 네트워크(215)로의 파라미터 복사를 시스템에서 지정한 시행 주기마다 수행한다.The repetition memory 213 performs random mini-batch sampling to the main network 214 and copying parameters from the main network 214 to the target network 215 at every trial period designated by the system.

도 4는 본 발명의 실시예들에 따른 수신기의 강화학습에 기반한 변복조 선택부의 변복조 방법을 설명하는 흐름도이다.4 is a flowchart illustrating a modulation and demodulation method of a modulation/demodulation selection unit based on reinforcement learning of a receiver according to embodiments of the present invention.

단계 S410에서 통신 시스템은 파라미터 초기화를 수행한다.In step S410, the communication system performs parameter initialization.

단계 S420에서 강화학습 모델은 첫 번째 시행인지 확인한다. 단계 S420에서 강화학습 모델은 첫 번째 시행이라면, 단계 S430으로 진행하여 상태-값 버퍼에서 기존 시행 중 최적의 경계값을 불러온 후 단계 S440으로 진행한다.In step S420, it is checked whether the reinforcement learning model is the first implementation. If the reinforcement learning model is the first trial in step S420, the process proceeds to step S430, the optimal boundary value among the existing trials is fetched from the state-value buffer, and then the process proceeds to step S440.

단계 S420에서 강화학습 모델은 첫 번째 시행이 아니라면, 단계 S440으로 진행하여 일련의 경계값 조정 과정을 지정한 회수(M)만큼 진행한다.In step S420, if the reinforcement learning model is not the first implementation, the process proceeds to step S440 and a series of boundary value adjustment processes are performed a specified number of times (M).

다음 단계 S450에서 스펙트럼 효율과 비트 에러 확률을 계산한다.In the next step S450, spectral efficiency and bit error probability are calculated.

스펙트럼 효율은 수학식 3과 같이 표현된다. (비특허문서 5 참조)The spectral efficiency is expressed as Equation 3. (Refer to Non-Patent Document 5)

[수학식 3][Equation 3]

D/B는 기준 대역폭 당 평균 데이터 전송률(Data Rate)를 의미하고, m은 심볼 당 비트 수를 나타내며, γ는 순간 신호대잡음비 값이고, pc(γ)는 위성통신 채널의 신호대잡음비의 확률밀도분포 함수를 나타낸다. 비트에러 확률은 수학식 4 및 수학식 5를 통해 산출할 수 있다. (비특허문서6 및 비특허문헌7 참조)[수학식 4]D/B means the average data rate per reference bandwidth, m represents the number of bits per symbol, γ is the instantaneous signal-to-noise ratio, and pc(γ) is the probability density distribution of the signal-to-noise ratio of the satellite communication channel. Represents a function. The bit error probability can be calculated through Equation 4 and Equation 5. (Refer to Non-Patent Document 6 and Non-Patent Document 7) [Equation 4]

Pe는 비트에러 확률을 나타내고, Pe(γ)는 AWGN(Additive White GAUSSIAN Noise) 채널에서의 비트에러 확률을 나타낸다.Pe represents the bit error probability, and Pe(γ) represents the bit error probability in the AWGN (Additive White GAUSSIAN Noise) channel.

[수학식 5][Equation 5]

수학식 5는 MPSK 변복조 방식의 AWGN 채널에서의 비트에러확률이다. 수학식 5를 수학식 1에 대입하여 각 변복조 구간의 비트에러 확률을 산출할 수 있다.Equation 5 is the bit error probability in the AWGN channel of the MPSK modulation and demodulation method. By substituting Equation 5 into Equation 1, the bit error probability of each modulation/demodulation section can be calculated.

단계 S460에서 강화학습 모델은 경계값을 조정한다. 경계값은 환경 모델에 근거해서 조정된다. 경계값 조정 후, 단계 S470에서 채널에 대한 신호대잡음비에 대해서 비트 에러 확률을 계산한다. 단계 S480에서 계산된 비트 에러 확률이 미리 설정된 비트 에러 확률 기준값을 초과하는지 확인한다. 본 발명의 일 실시예에서 비트 에러 확률 기준을 10^-3으로 설정한다. 그러나, 이는 일 예일 뿐 이로서 한정하는 것은 아니다. In step S460, the reinforcement learning model adjusts the boundary value. The boundary value is adjusted based on the environmental model. After adjusting the threshold, in step S470, a bit error probability is calculated for the signal-to-noise ratio for the channel. It is checked whether the bit error probability calculated in step S480 exceeds a preset bit error probability reference value. In an embodiment of the present invention, the bit error probability criterion is set ^{to 10 -3.} However, this is only an example and is not limited thereto.

단계 S480에서 확인 결과, 채널에 대한 비트 에러 확률이 미리 설정된 비트 에러 확률 기준값을 초과하면 단계 S490으로 진행하고 기존 시행된 결과 중에서 최적 경계값을 불러오고, 단계 S500에서 해당 경계값에 패널티를 부여한다. As a result of checking in step S480, if the bit error probability for the channel exceeds the preset bit error probability reference value, the process proceeds to step S490, the optimal boundary value is loaded from the previously implemented results, and a penalty is given to the boundary value in step S500. .

단계 S480에서 확인 결과, 채널에 대한 비트 에러 확률이 미리 설정된 비트 에러 확률 기준값을 초과하지 않으면, 단계 S510으로 진행하여 스펙트럼 효율의 개선 여부를 확인한다. 단계 S510의 확인결과 스펙트럼 효율이 개선되었다고 판단되면 단계 S530으로 진행하여 해당 경계값에 보상을 부여하고, 단계 S510의 확인결과 스펙트럼 효율이 개선되지 않았다고 판단되면 S520으로 진행하여 해당 경계값에 패널티를 부여한다. 단계 S540에서 경계값, 행위값, 누적보상, 변동된 경계값의 정보를 반복 메모리에 저장하고, 단계 S550으로 진행하여 시행 중 최적 경계값을 저장한다.As a result of checking in step S480, if the bit error probability for the channel does not exceed a preset bit error probability reference value, the process proceeds to step S510 to check whether the spectral efficiency is improved. If it is determined that the spectral efficiency is improved as a result of the check in step S510, the process proceeds to step S530 and compensation is given to the threshold value. do. In step S540, information of the boundary value, action value, cumulative compensation, and changed boundary value is stored in the iterative memory, and the process proceeds to step S550 to store the optimal boundary value during execution.

다음 단계 S560에서 만약 시행이 매 C회 시행되었는지 확인하여, C회 시행되었으면 단계 S570으로 진행하여 최적화 및 네트워크 파라미터 조정을 수행한다. 강화학습 모델은 C를 10으로 설정할 수 있고, 최적화 및 네트워크 파라미터 조정을 매 10회 시행마다 진행할 수 있다. 최적화 및 네트워크 파라미터 조정(S570)은 최적화 모델을 통해 진행되며, 조정한 메인 네트워크의 파라미터를 대상 네트워크로 복사한다.In the next step S560, it is checked whether the trial has been implemented every C times, and if it has been implemented C times, the process proceeds to step S570 to perform optimization and network parameter adjustment. In the reinforcement learning model, C can be set to 10, and optimization and network parameter adjustment can be performed every 10 trials. Optimization and network parameter adjustment (S570) is performed through an optimization model, and the adjusted parameters of the main network are copied to the target network.

단계 S560에서 C회 시행되지 않았으면 단계 S580으로 진행한다. If it has not been executed C times in step S560, the process proceeds to step S580.

단계 S580에서 시행이 종료되었는지 확인하고, 조정 횟수가 남아있다면 시행이 종료되지 않은 것이므로 단계 S440으로 진행하여 조정횟수를 확인하고 다시 진행한다(S590). 단계 S580에서 시행이 종료된 경우, 단계 S420으로 진행한다. In step S580, it is checked whether the trial has been completed, and if the number of adjustments remains, since the trial has not been completed, the process proceeds to step S440 to check the number of adjustments and then proceeds again (S590). If the enforcement is ended in step S580, the process proceeds to step S420.

도 5는 본 발명의 실시예들에 따른 수신기의 강화학습에 기반한 변조 시스템의 최적 경계값을 불러오지 않은 시뮬레이션 결과를 나타낸 그래프이다. 5 is a graph showing a simulation result in which an optimal boundary value of a modulation system based on reinforcement learning of a receiver according to embodiments of the present invention is not called.

도 5를 참조하면, 매 시행에서의 최대 스펙트럼 효율 값을 나타낸다. 시행(episode) 당 최대 스펙트럼 효율 값은 다음의 영역에서 무작위로 선택된 초기상태에 대해 0.8649 bps/Hz 개선되는 것을 확인할 수 있다. 초기상태는 : 0~8dB 사이, : 8~17dB 사이, : 17~25dB 사이에서 무작위로 선택된다.Referring to Fig. 5, the maximum spectral efficiency value in each run is shown. It can be seen that the maximum spectral efficiency value per episode is improved by 0.8649 bps/Hz for the initial state randomly selected in the following areas. The initial state is randomly selected between: 0~8dB,: 8~17dB, and: 17~25dB.

도 6은 본 발명의 실시예들에 따른 강화학습에 기반한 변조 시스템의 최적 경계값을 불러오지 않은 시뮬레이션 결과를 나타낸 그래프이고, 도 7은 본 발명의 실시예들에 따른 강화학습에 기반한 변조 시스템의 최적 경계값을 불러온 시뮬레이션 결과를 나타낸 그래프이다. 6 is a graph showing a simulation result of not loading an optimal boundary value of a modulation system based on reinforcement learning according to embodiments of the present invention, and FIG. 7 is a diagram of a modulation system based on reinforcement learning according to embodiments of the present invention. This is a graph showing the simulation result with the optimal boundary value loaded.

도 6을 참조하면, 만약 시행 시작 시, 비트 에러 확률 기준초과 시에 기존 시행중 최적 경계값을 불러오지 않은 경우, 안정적인 성능개선이 이루어지지 않고, 성능개선의 값도 낮은 것을 확인할 수 있다. Referring to FIG. 6, if the optimal boundary value is not called during the existing trial when the trial starts and the bit error probability exceeds the reference, it can be seen that stable performance improvement is not performed and the performance improvement value is also low.

도 7을 참조하면, 강화학습 모델에 따라 미리 설정해둔 비트 에러 확률 기준을 준수하는 것을 알 수 있다. 또한, 도 7을 도 6과 비교하여 안정적인 성능개선이 이루어짐을 확인할 수 있다. Referring to FIG. 7, it can be seen that the bit error probability criterion set in advance according to the reinforcement learning model is observed. In addition, it can be seen that stable performance improvement is achieved by comparing FIG. 7 with FIG. 6.

송신기 및 수신기에 포함된 구성요소들이 도 1에서는 분리되어 도시되어 있으나, 복수의 구성요소들은 상호 결합되어 적어도 하나의 모듈로 구현될 수 있다. 구성요소들은 장치 내부의 소프트웨어적인 모듈 또는 하드웨어적인 모듈을 연결하는 통신 경로에 연결되어 상호 간에 유기적으로 동작한다. 이러한 구성요소들은 하나 이상의 통신 버스 또는 신호선을 이용하여 통신한다.Although the components included in the transmitter and the receiver are shown separately in FIG. 1, a plurality of components may be combined with each other to be implemented as at least one module. Components are connected to a communication path connecting a software module or a hardware module inside the device and operate organically with each other. These components communicate using one or more communication buses or signal lines.

송신기 및 수신기는 하드웨어, 펌웨어, 소프트웨어 또는 이들의 조합에 의해 로직회로 내에서 구현될 수 있고, 범용 또는 특정 목적 컴퓨터를 이용하여 구현될 수도 있다. 장치는 고정배선형(Hardwired) 기기, 필드 프로그램 가능한게이트 어레이(Field Programmable Gate Array, FPGA), 주문형 반도체(Application Specific Integrated Circuit, ASIC) 등을 이용하여 구현될 수 있다. 또한, 장치는 하나 이상의 프로세서 및 컨트롤러를 포함한 시스템온칩(System on Chip, SoC)으로 구현될 수 있다. The transmitter and receiver may be implemented in a logic circuit by hardware, firmware, software, or a combination thereof, or may be implemented using a general purpose or specific purpose computer. The device may be implemented using a hardwired device, a field programmable gate array (FPGA), an application specific integrated circuit (ASIC), or the like. In addition, the device may be implemented as a System on Chip (SoC) including one or more processors and controllers.

송신기 및 수신기는 하드웨어적 요소가 마련된 컴퓨팅 디바이스에소프트웨어, 하드웨어, 또는 이들의 조합하는 형태로 탑재될 수 있다. 컴퓨팅 디바이스는 각종 기기 또는 통신망과 통신을 수행하기 위한 통신 모뎀 등의 통신장치, 프로그램을 실행하기 위한 데이터를 저장하는 메모리, 프로그램을 실행하여 연산및 명령하기 위한 마이크로프로세서 등을 전부 또는 일부 포함한 다양한 장치를 의미할 수 있다.The transmitter and receiver may be mounted in software, hardware, or a combination thereof on a computing device provided with hardware elements. A computing device is a variety of devices including all or part of a communication device such as a communication modem for performing communication with various devices or communication networks, a memory storing data for executing a program, and a microprocessor for calculating and commanding by executing a program. Can mean

도 2 및 도 4에서는 각각의 과정을 순차적으로 실행하는 것으로 기재하고 있으나 이는 예시적으로 설명한 것에 불과하고, 이 분야의 기술자라면 본발명의 실시예의 본질적인 특성에서 벗어나지 않는 범위에서 도 2 및 도 4에 기재된 순서를 변경하여 실행하거나 또는 하나 이상의 과정을 병렬적으로 실행하거나 다른 과정을 추가하는 것으로 다양하게 수정 및 변형하여 적용 가능할 것이다.In FIGS. 2 and 4, each process is described as sequentially executing, but this is only illustrative, and those skilled in the art are shown in FIGS. 2 and 4 without departing from the essential characteristics of the embodiment of the present invention. By changing the described order, executing one or more processes in parallel, or adding other processes, various modifications and variations may be applied.

본 발명의 일 실시예에 따른동작은 다양한 컴퓨터 수단을 통하여 수행될 수 있는 프로그램 명령 형태로 구현되어 컴퓨터 판독 가능한 매체에 기록될 수 있다. 컴퓨터 판독 가능한 매체는 실행을 위해 프로세서에 명령어를 제공하는 데 참여한 임의의 매체를 나타낸다. 컴퓨터 판독 가능한 매체는 프로그램 명령, 데이터 파일, 데이터 구조 또는 이들의 조합을 포함할 수 있다. 예를 들면, 자기 매체, 광기록매체, 메모리 등이 있을 수 있다. 컴퓨터 프로그램은 네트워크로 연결된 컴퓨터 시스템 상에 분산되어 분산 방식으로 컴퓨터가 읽을 수 있는 코드가 저장되고 실행될 수도 있다. 본 실시예를 구현하기 위한 기능적인(Functional) 프로그램, 코드, 및 코드 세그먼트들은 본 실시예가 속하는 기술분야의 프로그래머들에 의해 용이하게 추론될 수 있을 것이다. The operation according to an embodiment of the present invention may be implemented in the form of program instructions that can be executed through various computer means and recorded in a computer-readable medium. Computer-readable medium refers to any medium that has participated in providing instructions to a processor for execution. The computer-readable medium may include program instructions, data files, data structures, or a combination thereof. For example, there may be a magnetic medium, an optical recording medium, a memory, and the like. Computer programs may be distributed over networked computer systems to store and execute computer-readable codes in a distributed manner. Functional programs, codes, and code segments for implementing this embodiment may be easily inferred by programmers in the art to which this embodiment belongs.

이상의 설명은 본 발명의 기술 사상을 예시적으로 설명한 것에 불과한 것으로, 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 사람이라면 본 발명의 본질적인 특성에서 벗어나지 않는 범위에서 다양한 수정 및 변형이 가능할 것이다. 따라서, 본 발명에 실행된 실시예들은 본 발명의 기술 사상을 한정하기 위한 것이 아니라 설명하기 위한 것이고, 이러한 실시예에 의하여 본 발명의 기술 사상의 범위가 한정되는 것은 아니다. 본 발명의 보호 범위는 아래의 청구범위에 의하여 해석되어야 하며, 그와 동등한 범위 내에 있는 모든 기술 사상은 본 발명의 권리범위에 포함되는 것으로 해석되어야 할 것이다.The above description is merely illustrative of the technical idea of the present invention, and a person of ordinary skill in the art to which the present invention pertains will be able to make various modifications and variations without departing from the essential characteristics of the present invention. Accordingly, the embodiments implemented in the present invention are not intended to limit the technical idea of the present invention, but to explain it, and the scope of the technical idea of the present invention is not limited by these embodiments. The scope of protection of the present invention should be interpreted by the following claims, and all technical ideas within the scope equivalent thereto should be construed as being included in the scope of the present invention.

100 : 송신기
200 : 수신기
210 : 수신부
220 : 채널 검출부
230 : 변복조 선택부
240 : 피드백부
250 : 복조부100: transmitter
200: receiver
210: receiver
220: channel detection unit
230: Modem and demodulation selection unit
240: feedback unit
250: demodulation unit

Claims

A receiver for estimating a channel state by receiving a pilot signal from a transmitter;
A modulation/demodulation selection unit for selecting a modulation/demodulation method according to the state of the channel based on reinforcement learning; And
A feedback unit for transmitting information of the selected modulation and demodulation method to the transmitter;
Including,
When the modulation and demodulation selector divides according to the number of modulation and demodulation methods based on the signal-to-noise ratio of the channel, the signal-to-noise ratio, which is the boundary between each channel, is used as a boundary value, and the boundary value is adjusted to compensate in a direction of maximizing the spectral efficiency value. Reinforcement learning by giving (reward),
The reinforcement learning is learning by a deep Q network algorithm, characterized in that a state of the deep Q network algorithm is set as the boundary value, and an action is set as an adjustment of raising or lowering the boundary value. Reinforcement learning-based adaptive modulation and demodulation method is applied.

delete

The receiver of claim 1, wherein the boundary value is a reference value for dividing the signal-to-noise ratio probability density distribution function of the channel into a plurality of modulation/demodulation scheme allocation regions.

The method of claim 1,
The modulation/demodulation selection unit includes a state-value buffer for storing an optimal spectral efficiency value for a boundary value,
The modulation and demodulation selection unit, at the start of each trial, calls a boundary value corresponding to an optimal spectral efficiency value from the state-value buffer and starts the trial.

The method of claim 1,
And the modem selector sets a penalty when the threshold value is not valid. The receiver to which the reinforcement learning-based adaptive modulation and demodulation method is applied.

The method of claim 5,
The state in which the threshold value is not valid is a state in which the next state is less than or equal to the previous state, so that a high boundary violates a lower boundary, a state in which the threshold value exceeds a preset signal-to-noise ratio range, and a state between neighboring boundary values. The receiver to which the reinforcement learning-based adaptive modulation and demodulation method is applied, characterized in that at least one of the states in which the range of is set to be narrower than the preset interval.

The method of claim 1,
The modulation/demodulation selection unit includes a state-value buffer for storing an optimal spectral efficiency value for a boundary value,
The modem selector calculates a bit error probability for a signal-to-noise ratio for a channel, and when the bit error probability exceeds a preset bit error probability reference value, a threshold value corresponding to an optimal spectral efficiency value from the state-value buffer Reinforcement learning-based adaptive modulation and demodulation method is applied, characterized in that it calls and starts execution.

In the modulation and demodulation method based on reinforcement learning by a receiver,
Receiving a pilot signal from a transmitter;
Estimating the state of the channel from the received pilot signal;
Selecting a modulation/demodulation method according to the state of the channel based on reinforcement learning; And
Transmitting information of the selected modulation and demodulation method to the transmitter;
Including,
The reinforcement learning is learning by a deep Q network algorithm, and a state of the deep Q network algorithm is set as a threshold value, an action is set as an adjustment to raise or lower the threshold value, and the threshold value It gives a reward in the direction of maximizing the spectral efficiency value by adjusting
The boundary value is a signal-to-noise ratio that becomes a boundary between each channel when partitioning according to the number of modulation and demodulation methods based on a signal-to-noise ratio of a channel.

delete

The method of claim 8,
The step of selecting the modulation and demodulation method
An adaptive modulation and demodulation method based on reinforcement learning, characterized in that, at the start of each trial, the threshold value corresponding to the optimal spectral efficiency value stored in advance is called and the trial is started.

The method of claim 8,
The step of selecting the modulation and demodulation method,
The bit error probability is calculated for the signal-to-noise ratio for the channel, and when the bit error probability exceeds the preset bit error probability reference value, the threshold value corresponding to the optimal spectral efficiency value stored in advance in the buffer is called and execution is started. An adaptive modulation and demodulation method based on reinforcement learning, characterized in that.

A transmitter for transmitting a pilot signal; And
Receiver for receiving the pilot signal, estimating the state of the channel, selecting a modem based on the state of the channel based on reinforcement learning, and transmitting information of the selected modem to the transmitter
Including,
When the receiver divides according to the number of modulation and demodulation methods based on the signal-to-noise ratio of the channel, the signal-to-noise ratio, which is the boundary between the channels, is used as a boundary value, and the boundary value is adjusted to compensate in the direction of maximizing the spectral efficiency value. Reinforcement learning by giving (reward),
The reinforcement learning is learning by a deep Q network algorithm, characterized in that a state of the deep Q network algorithm is set as the boundary value, and an action is set as an adjustment of raising or lowering the boundary value. A reinforcement learning-based adaptive modulation and demodulation system.

The method of claim 12,
The transmitter receives the modulation and demodulation method information, modulates and transmits the data according to the modulation and demodulation method.

The system of claim 13, wherein the receiver receives the data and demodulates the received data according to the modulation and demodulation method.

delete