Disclosure of Invention
The invention aims to overcome the defects of the prior art, and aims to solve the practical problems of complex network planning conditions, time emergency, uncertain places, limited equipment conditions and the like of the mobile communication network, so that the intelligent planning method of the mobile communication network based on deep reinforcement learning is realized.
In order to achieve the purpose, the invention adopts the following technical scheme:
an intelligent planning method for a mobile communication network based on deep reinforcement learning comprises the following steps:
s1, preprocessing resource elements, abstracting and mapping the erection region, the guarantee nodes and the guaranteed users of the mobile communication network, and establishing a simulation model of the resource elements of the mobile communication network;
s1.1, preprocessing an erection region of a mobile communication network;
s1.2, preprocessing guarantee nodes of the mobile communication network;
s1.3, preprocessing the guaranteed user of the mobile communication network.
S2, planning rule preprocessing, abstracting and mapping the guarantee relationship and the planning state of the mobile communication network, fusing the resource element simulation model of the step S1, and establishing an integral simulation model of the mobile communication network planning;
s2.1, preprocessing the connection relation of the mobile communication network;
s2.2, preprocessing the planning state of the mobile communication network.
S3, generating training samples, establishing network planning simulation according to the overall simulation model in the step S2, and generating the training samples and forming a training sample set for deep reinforcement learning by adopting a Monte Carlo tree search method based on an upper limit confidence interval algorithm (UCT) to run simulation;
s3.1, establishing network planning simulation according to the integral simulation model of the step S2, and during initial training, randomly generating the position of a guaranteed user;
s3.2, correspondingly generating the positions of the guaranteed users, and performing simulated deployment by using a search algorithm;
and S3.3, repeatedly simulating deployment by using a search method to obtain a sample and an evaluation set which meet the conditions.
S4, model training, based on a deep reinforcement learning algorithm such as a recurrent neural network, training the overall simulation model of the step S2 by using the training sample of the step S3, comparing and screening the training results of each time, feeding back the obtained planning space strategy and the real-time planning satisfaction degree of the step S3, optimizing the search result of a Monte Carlo tree search algorithm based on an upper confidence interval algorithm (UCT), and obtaining an optimized training sample;
s4.1, initializing and using three types of elements to describe a planning situation;
s4.2, constructing a filter (filter) by adopting a public full convolution network through the recurrent neural network, and dividing the tail part into two branches of a planning strategy and a planning satisfaction degree;
s4.3, feeding back the result of the step S4.2 to the step 3.2, and refining the searching process;
s4.4, defining local strategy evaluation;
s4.5, combining the output of the recurrent neural network, and updating all the searching processes into the deployment actions for searching the maximum value;
and S4.6, according to the process of the step S4.5, combining the time and effective results for each situation, executing a search process and determining a new address selection strategy.
S5, model generation, inputting the obtained optimized training sample into the training network in the step S4, constructing a joint loss function according to a training target, searching and training the sample according to joint loss function instructions, and generating a mobile communication network planning model;
s5.1, constructing a joint loss function according to the training target;
s5.2, comparing the model after training with the model before training, and judging the result according to the simulation model rule;
and S5.3, training based on the steps S4.1 and S4.2 to obtain a mobile communication network planning model.
The invention adopts the intelligent planning method of the mobile communication network based on the deep reinforcement learning, and has the advantages that:
1. the Monte Carlo tree searching method based on the upper limit confidence interval algorithm (UCT) is adopted, and the recursive neural network which is simple in structure but practical and effective is combined, so that the computational power requirement and the processing time of hardware are greatly reduced, and the problem of network planning of the maneuvering network can be solved quickly;
2. by adopting the deep reinforcement learning algorithm to train the intelligent planning model, the planning model overcomes the defect of single applicable scene and can adapt to the scenes of different regions, different security equipment and different secured users.
Detailed Description
The following further describes embodiments of the present invention with reference to the drawings. It should be noted that the description of the embodiments is provided to help understanding of the present invention, but the present invention is not limited thereto. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present invention without making any creative effort, shall fall within the protection scope of the present invention.
Referring to the attached fig. 1, a schematic flow chart of an embodiment of the intelligent planning method for the mobile communication network based on deep reinforcement learning of the present invention is shown, which specifically includes the following steps:
s1, preprocessing resource elements, abstracting and mapping the erection region, the guarantee nodes and the guaranteed users of the mobile communication network, and establishing a simulation model of the resource elements of the mobile communication network;
s1.1, preprocessing the erection region of the mobile communication network, and abstracting the erection region by analogy to a chessboard. Setting the region size N2km2The method comprises the following steps of taking a left lower corner coordinate of a topographic map of an erection region as a zero point coordinate, taking a certain divisor of N as a unit length, transversely and longitudinally dividing the erection region, taking each intersection point as a positioning point to obtain a node position matrix, presetting the erection region as a square region with the same length and width in the patent to obtain an NxN node position matrix, and continuously performing multiple expansion subdivision;
s1.2, preprocessing communication platforms/devices of the mobile communication network, namely guarantee nodes (such as mobile communication vehicles, mobile radio stations, mobile stations and the like), presetting P-type guarantee nodes, wherein the communication distance R and the link quantity L of the guarantee nodes are determined according to the specific models of the devices. In this patent, the security nodes are mainly divided into two categories, i.e. main nodes P1And a secondary node P2Sequentially modeling according to the guarantee priority B, and setting the priority of the main node as B1The secondary node has a priority of B2Wherein the communication guarantee range of the main node is based on a single-hop microwave communication distance R with the node deployment position as the center of a circle1km is a circle of radius, and the number of links is set to L1The communication guarantee range of the secondary node is based on a single-hop microwave communication distance R with the node deployment position as the center of a circle1km, single-hop shortwave communication distance R2km is a circle of radius, and the number of microwave links is set to L2The number of short-wave links is set to L'2;
S1.3, pre-processing guaranteed users (such as military unit group classes, grade classes, continuous classes, individual soldiers and the like of different levels) of the mobile communication network, presetting Q-type guarantee nodes, and determining the communication distance R and the link quantity L of the guaranteed user nodes according to the specific models of the equipment. The guaranteed users in the patent are mainly divided into three categories, namely main user Q1Secondary user Q2And subordinate user Q3Sequentially modeling according to the guarantee priority A, and setting the priority of the main user as A1The secondary user is A2The subordinate user is A3Wherein the single-hop microwave communication distance of the main user is R1km, number of links U1(ii) a The secondary user single-hop microwave communication distance is R1km, number of links U2The single-hop short-wave communication distance of the subordinate user is R2Km, number of links is U'3。
The resource elements of the mobile communication network are abstracted and mapped, and support is provided for subsequent completion of rules and integral modeling of the mobile communication network.
S2, planning rule preprocessing, and preprocessing the planning rule of the mobile communication network. Abstracting and mapping the guarantee relationship and the planning state of the mobile communication network, fusing the resource element simulation model of the step S1, and establishing an integral simulation model of the mobile communication network planning;
s2.1, preprocessing the connection relation of the mobile communication network, and associating a guarantee node with a guarantee node and a guaranteed user;
s2.1.1, associating the guarantee node with the guaranteed user according to the priority association A → B, and determining the guarantee relationship. In this patent, A1And B1Corresponding to (A)2And B2、B3Corresponding, i.e. primary node P1Securing primary user Q1The secondary node P2Securing secondary user Q2And subordinate user Q3Each user needs to have at least one corresponding security node connected with the user;
s2.1.2, determining the connection relationship between the guarantee nodes, in this patent, all the primary nodes need to form a connected graph, and the secondary nodes P2Must communicate with at least one primary node P1Connecting;
s2.1.3, all connections need to satisfy the communication type specified in step S1, i.e., links of the same communication type can be connected;
s2.1.4, all connections need to satisfy the number of links specified in step S1, that is, the number of connections cannot exceed the specified number L of node links to be connected;
s2.1.5, all connections need to satisfy the communication distance specified in step S1, i.e. the distance between any two nodes must be less than the maximum communication distance R of the communication device used to be connectable;
s2.1.6, the minimum spanning tree is formed by the minimum requirement of the topological structure of the whole mobile network communication network;
s2.2, preprocessing the planning state of the mobile communication network, establishing a network situation S according to the guaranteeing node, the guaranteed user, the erection region and the network planning rule, wherein the network situation S comprises all information of the mobile communication network, namely S ═ P, Q, A, B, R and L …, but the main plane is used for describing the planning position of each node, the planned position is occupied by characters, the unplanned position is marked as 0, and the shape is as follows
S2.2.1, marking the initial situation of the network situation s as s
0The planning positions of all the guaranteed user nodes are mainly described, namely the positions of the guaranteed personnel in the erection region model are directly determined according to the actual task requirements of the guaranteed personnel, and the shape is as follows
Wherein the positions of the guaranteed user nodes are represented in the matrix by symbols in the guaranteed user set P.
S2.2.2, the planning of the subsequent safeguard nodes is regarded as a typical Markov process, that is, the deployment situation of each safeguard node can be regarded as a situation for the current network situation s
i-1Action response of a
i(where i ∈ [1, K)]K is the total number of security nodes, in this patent the sum of the primary and secondary nodes), which is the determination of the addressing of a certain security node Q, e.g. the addressing of a certain security node Q
S2.2.3, all guaranteed nodes are located and planned to meet the requirements, or the guaranteed nodes are arranged and marked as the final office after the arrangement, and the network situation is obtained when the final office is finished
In the step, on the basis of the step S1, the planning rule of the mobile communication network is abstracted and mapped, and an overall mobile communication network simulation model is established, so as to provide support for the subsequent deep reinforcement learning planning strategy.
S3, generating training samples, establishing network planning simulation according to the overall simulation model in the step S2, and generating the training samples and forming a training sample set for deep reinforcement learning by adopting a Monte Carlo tree search method based on an upper limit confidence interval algorithm (UCT) to run simulation;
s3.1, establishing network planning simulation according to the integral simulation model of the step S2, and during initial training, randomly generating the position of a guaranteed user;
s3.2, correspondingly generating the guaranteed user position, and performing simulated deployment by using a Monte Carlo tree search algorithm based on an upper limit confidence interval algorithm (UCT);
s3.2.1 from an initial situation s0And starting to initialize the simulation deployment, wherein the state is the root node of the search tree, and initializing each action (s, a) of the search tree based on a certain situation at the moment, wherein E (s, a) is the comprehensive action evaluation of each possible selected position of the guarantee node under the situation.
S3.2.2 in existenceWhen the neural network is introduced, the initial E (s, a) scores under all situations are equal and are set as r0Continuously searching in a random traversal mode until all the guarantee nodes are under, namely, after the final bureau is reached, judging according to the steps S1 and S2, and calculating each corresponding current situation S according to whether the final bureau result meets the conditioni-1Deployment action a ofiThe action evaluation r of (2) is set as "r", and if satisfied, is counted as "r ═ r0+ r', if not, r ═ r0-r', normalized to obtain the shape:
the evaluation set of (1).
And S3.3, repeatedly simulating deployment by using a search method to obtain a sample and an evaluation set which meet the conditions.
S4, model training, based on a recurrent neural network, training the overall simulation model of the step S2 by using the training sample of the step S3, comparing and screening the training results of each time, feeding back the obtained planning space strategy and the step real-time planning satisfaction degree to the step S3, and optimizing the search result of a Monte Carlo tree search algorithm based on an upper confidence interval algorithm (UCT) to obtain an optimized training sample;
s4.1, initializing and using 6 planes of three categories to describe a planning situation, namely three planes of a guaranteed user Q, two planes of a guaranteed user P and one plane of an erection region;
s4.2, the recurrent neural network firstly adopts 4 layers of public full convolution networks, respectively uses Relu functions to construct 32, 64, 128 and 256 filters (filters) of 3 × 3, the tail part is divided into two branches of planning strategies and planning satisfaction degrees, the strategy branches use 4 dimensionality reduction filters of 1 × 1, one full connection layer outputs the selection probability P of each node in a planning space by using a softmax function, the satisfaction degree branches use 2 dimensionality reduction filters of 1 × 1, the full connection layer uses a tank function output range of [0,1] satisfaction degree score C, namely:
fθ(s)=(P,C)
and S4.3, returning the planning strategy probability P and the satisfaction degree score C obtained in the S4.2 to the S3.2, refining the expansion process of UCT tree search, and updating the action situation of each time into (S, a) ═ E (S, a), N (S, a) and E (S, a)v(s,a),P(s,a));
S4.3.1, N (s, a) is the number of visits of the next node (child node) selected based on the current situation;
S4.3.2、E
v(s, a) is the average action evaluation,
combined with the output of the neural network to update
S4.4, defining local strategy evaluation El(s,a),El(s, a) equals the parallel UCT search horizon constant UpuctThe quotient of (initialization is 3) the product of the recursive neural network output strategy probability P (s, a) and the evolution of the parent node access times N (s, b) and 1+ the access times N (s, a) of a certain child node is obtained, and the specific algorithm is as follows:
s4.5, combining the output of the recurrent neural network, and updating all the UCT search tree processes to search for a situation Si-1Under the action ofv(s,a)+El(s, a) deployment action a to obtain maximum valueiAfter a certain number of times of cyclic training of the search tree and the neural network, the search process of a UCT search tree is as follows:
s4.5.1 initial situation s for the currently secured user0Selecting the current Ev(s0,a1)+El(s0,a1) Deploying action a with the maximum value and deploying;
s4.5.2, repeat 4.5.1 until a certain situation siHas not been evaluatedv+ElValue, unable to select, at which time the current situation siImporting a neural network fθ(s) intoEvaluating to obtain fθ(si)=(Pi,Ci);
S4.5.3, updating the access times N(s) of the current nodei,ai+1)=N(si,ai+1)+1;
S4.5.4, use of PiProceed to the next deployment action ai+1And 4.5.2, 4.5.3 are repeated until the final end is reached;
s4.5.5, returning the search result of the whole tree, updating the access times of each passed node according to 4.5.3, returning and updating the satisfaction degree scores of all child nodes according to leaf nodes, wherein the satisfaction degree score is 0 and the satisfaction degree score is 1;
s4.5.6, calculate average action rating per node as S4.3.2:
s4.6, according to the whole flow of S4.5, for each situation SiCombining the time of use and the consideration of effective results, the search tree search process is carried out for 800 times, and finally the actual action set { a) according to the search tree is collectednDetermining a new addressing strategy M as follows:
wherein tau is a search constant and is responsible for controlling the randomness of the address selection, the larger tau is, the stronger the randomness is, and tau is set to be continuously reduced according to the address selection process because the address selection activity has certain relevance, and finally, tau is stabilized at 0.4.
S5, model generation, inputting the obtained optimized training sample into the training network in the step S4, constructing a joint loss function according to a training target, searching and training the sample according to joint loss function instructions, and generating a mobile communication network planning model;
s5.1, constructing a combined Loss function Loss according to a training target, predicting an error of the satisfaction C and the upper limit confidence interval algorithm search planning satisfaction C' for the minimum neural network, enabling the strategy probability P output by the neural network to be similar to the branch probability pi obtained by the UCT tree search algorithm search as much as possible, and adding a control parameter g | theta | for preventing overfitting to obtain the combined Loss function Loss:
Loss=(C'-C)2-πTlogP+g||θ||
wherein g | | θ | | | is the L2 norm of the neural network variable;
s5.2, setting the obtained model to be compared with the previous model after each 50 training batches, and judging the result according to the simulation model rule: wining in accordance with the guarantee rule; all do not accord with the bureau of confluence, keep the former model parameter; when the data are consistent, judging according to the number of used guarantee nodes, and reserving a model with a small number;
and S5.3, continuously training based on the steps S4.1 and S4.2 to obtain a network planning model of the mobile communication network.
Referring to fig. 2, a block diagram of the structure of the present invention is shown, which specifically includes:
resource element preprocessing module 100: abstracting and mapping an erection region, a guarantee node and a guaranteed user of the mobile communication network, and establishing a simulation model of resource elements of the mobile communication network, which specifically comprises the following steps:
erection region preprocessing unit 101: preprocessing the erection region of the mobile communication network;
the safeguard node preprocessing unit 102: preprocessing guarantee nodes of a mobile communication network;
secured user preprocessing unit 103: preprocessing guaranteed users of the mobile communication network;
planning rule preprocessing module 200: abstracting and mapping the guarantee relationship and the planning state of the mobile communication network, fusing a resource element simulation model of the resource element preprocessing module 100, and establishing an overall simulation model of the mobile communication network planning, which specifically comprises the following steps:
the connection relationship preprocessing unit 201: preprocessing the connection relation of the mobile communication network;
the plan state preprocessing unit 202: preprocessing the planning state of the mobile communication network;
the training sample generation module 300: establishing network planning simulation according to the overall simulation model of the planning rule preprocessing module 200, and adopting a search method to run simulation, generating training samples and forming a training sample set for deep reinforcement learning, wherein the method specifically comprises the following steps:
network planning simulation setup unit 301: establishing network planning simulation according to an overall simulation model of the planning rule preprocessing module 200, and randomly generating the position of a guaranteed user during initial training;
the simulation deployment unit 302: performing simulated deployment by using a search algorithm according to the generated guaranteed user position;
the sample and evaluation set generation unit 303: repeatedly simulating deployment by using a search method to obtain a sample and an evaluation set which meet conditions;
model training module 400: based on the recurrent neural network, the training sample of the training sample generation module 300 is used to train the whole simulation model of the planning rule preprocessing module 200, the training results of each time are compared and screened, the obtained planning space strategy and step real-time planning satisfaction are fed back to the training sample generation module 300, the search result of the search algorithm is optimized, and the optimized training sample is obtained, which specifically comprises:
planning situation initialization unit 401: initializing and using three major element description planning situations;
filter configuration unit 402: the recurrent neural network adopts a public full convolution network to construct a filter (filter), and the tail part of the filter is divided into two branches of a planning strategy and a planning satisfaction degree;
search process refinement unit 403: feeding back the results of the filter construction unit 402 to the simulation deployment unit 302, and refining the search process;
local policy evaluation definition unit 404: defining local strategy evaluation;
search procedure update unit 405: combining the output of the recurrent neural network, and updating all the search processes into the deployment action for searching the maximum value;
new addressing policy determination unit 406: according to the flow of the search process updating unit 405, the search flow is executed for each situation in combination with the time and effective results, and a new address selection strategy is determined;
the model generation module 500: inputting the obtained optimized training sample into a training network of the model training module 400, constructing a joint loss function according to a training target, searching and training the sample according to joint loss function instructions, and generating a mobile communication network planning model, which specifically comprises the following steps:
joint loss function construction unit 501: constructing a joint loss function according to the training target;
the result evaluation unit 502: comparing the model after training with the model before training, and judging the result according to the simulation model rule;
the model generation unit 503: training based on a planning situation initialization unit 401 and a filter construction unit 402 to obtain a mobile communication network planning model;
the network planning module 600: inputting the parameters of an erection region, a guarantee node and a guaranteed user by applying a trained network planning model to obtain the planning parameters of the mobile communication network, and specifically comprising the following steps:
network planning element input section 601: inputting parameters of an erection region, a guarantee node and a guaranteed user;
model operation section 602: calling the trained network planning model for operation;
network planning parameter generation unit 603: the model generates network planning parameters.