CN114828045A

CN114828045A - Network optimization method and device, electronic equipment and computer readable storage medium

Info

Publication number: CN114828045A
Application number: CN202210383749.3A
Authority: CN
Inventors: 李文文; 王希栋; 鹿岩; 宋勇; 叶晓舟; 欧阳晔
Original assignee: Asiainfo Technologies China Inc
Current assignee: Asiainfo Technologies China Inc
Priority date: 2022-04-12
Filing date: 2022-04-12
Publication date: 2022-07-29

Abstract

The embodiment of the application provides a network optimization method, a network optimization device, electronic equipment and a computer-readable storage medium, and relates to the field of wireless communication. The method comprises the following steps: acquiring current network parameter data of a target optimization cell; then, based on the network parameter data and the network optimization index of the target optimization cell, determining to generate a network optimization strategy through a simulated neural network or a deep reinforcement learning model according to a preset judgment strategy, wherein the simulated neural network is generated by training a preset neural network based on a pre-acquired network parameter tuning experience data set of the target optimization cell, and the deep reinforcement learning model is generated based on the simulated neural network; and then, optimizing the network parameters of the target optimization cell according to the network optimization strategy. The method for optimizing the network can be carried out by fusing the simulated neural network and the deep reinforcement learning model, and provides a solution for the perception decision of the dynamic optimization of the mobile communication network.

Description

Network optimization method and device, electronic equipment and computer readable storage medium

Technical Field

The present application relates to the field of mobile communications technologies, and in particular, to a network optimization method and apparatus, an electronic device, and a computer storage medium.

Background

Various expert methods have been developed in the field of mobile communications and have been used to optimize wireless networks with various goals to address various performance issues. These expert methods include methods for Radio Resource Management (RRM) and self-organizing networks (SON). Network optimization is a global optimization process, and when a parameter of a certain cell is adjusted, the network state of its neighboring cells is likely to be affected, so that the network optimization needs to be optimized not only locally (or cell) but also globally (or neighboring multi-cells).

At present, existing solutions for optimizing a mobile communication network mainly include: (1) network self-optimization is carried out in a mode of manually formulating a rule table, various network states are inquired, and corresponding adjustment is carried out according to the rule table formulated by an expert; however, the rule table of this scheme does not cover all possible scenarios, and the optimization of the rule table across different devices is not uniform. (2) Performing network optimization based on an iterative optimization model, traversing various parameters of a base station antenna, testing and analyzing acquired network data information, performing iterative optimization on different combinations of acquired sample data and antenna parameters by taking the network performance maximization of peripheral areas as a target, and finally obtaining an optimal parameter combination to realize an optimization target; however, the optimization model of this solution cannot cover all possible situations, and if an unknown state is encountered, the optimization cannot be performed, resulting in failure of optimization.

Disclosure of Invention

The embodiment of the application provides a network optimization method, a network optimization device, electronic equipment and a computer storage medium, which can be used for fusing a simulation neural network and a deep reinforcement learning model to carry out a network optimization method and provide a solution for a perception decision of dynamic optimization of a mobile communication network. The specific technical scheme is as follows:

according to an aspect of an embodiment of the present application, there is provided a network optimization method, including:

acquiring current network parameter data of a target optimization cell;

determining a simulated neural network or a deep reinforcement learning model according to a preset decision strategy to generate a network optimization strategy based on network parameter data and network optimization indexes of a target optimization cell, wherein the simulated neural network is generated by training a preset neural network based on a pre-acquired network parameter tuning experience data set of the target optimization cell, and the deep reinforcement learning model is generated based on the simulated neural network;

and optimizing the network parameters of the target optimization cell according to the network optimization strategy.

In one possible implementation, training a predetermined neural network based on a pre-obtained network parameter tuning experience data set of a target optimization cell to generate a simulated neural network includes:

screening a training sample data set from the network parameter tuning experience data set, wherein each training sample in the training sample data set comprises state data and action data, and the state data is the cell state S of the target optimization cell _i The action data is the target optimized cell slave cell state S _i Transition to next cell state S _i+1 Action a performed _i I is a natural number;

and taking the state data of each training sample in the training sample data set as input data of a preset neural network, and training the preset neural network by simulating a learning algorithm until an error between output data of the preset network and action data corresponding to the state data of each training sample meets a preset condition.

In one possible implementation manner, each network parameter tuning experience data in the network parameter tuning experience data set is quintuple data, and the quintuple data includes a cell identifier of the target optimization cell, a current cell state of the target optimization cell, a next cell state of the current cell state of the target optimization cell, an action performed by the target optimization cell to transition from the current state to the next state, and a reward value obtained after the action is performed;

the method comprises the following steps of screening a training sample data set from a network parameter tuning experience data set, wherein the training sample data set comprises the following steps:

and screening out a training sample data set from the network parameter tuning experience data set according to the current cell state of the target optimization cell, the next cell state of the current cell state of the target optimization cell and the action executed by the target optimization cell to transfer from the current cell state to the next cell state.

In one possible implementation, the deep reinforcement learning model includes a first neural network and a second neural network; wherein, the deep reinforcement learning model is generated based on the simulated neural network, and the deep reinforcement learning model comprises the following steps:

constructing a first neural network and a second neural network according to the structure of the simulated neural network, and copying parameters of the simulated neural network to the first neural network and the second neural network;

and determining a target parameter value theta of the first neural network by performing loop iteration on the first neural network based on the acquired current cell state S of the target optimization cell, and updating the parameter value theta' of the second neural network by copying the target parameter value theta to the second neural network at a preset interval in the process of loop iteration.

In a possible implementation manner, determining a target parameter of a first neural network by performing loop iteration on the first neural network based on the obtained current cell state S of the target optimization cell includes:

determining an action a executed by the target optimization cell to be transferred from the current cell state S to the next cell state S' through a first neural network, and determining a Q estimation value according to the action a;

determining a Q realization value through a second neural network;

determining a loss function of the first neural network according to a difference between the Q actual value and the Q estimated value;

and solving the corresponding parameter value when the loss function is minimized, and determining the parameter value as a target parameter value theta.

In a possible implementation manner, determining to generate a network optimization strategy by simulating a neural network or a deep reinforcement learning model according to a preset decision strategy based on network parameter data and a network optimization index of a target optimization cell, includes:

and directly determining to generate a network optimization strategy through the simulated neural network based on the network parameter data and the network optimization index of the target optimization cell, and determining to generate the network optimization strategy through a deep reinforcement learning model when the network optimization strategy generated through the simulated neural network cannot meet the requirement.

In a possible implementation manner, after optimizing the network parameters of the target optimization cell according to the network optimization strategy, the method further includes:

and acquiring the optimized network parameters of the target optimization cell, and updating the optimized network parameters to the network parameter adjustment experience data set of the target optimization cell.

According to another aspect of the embodiments of the present application, there is provided a network optimization apparatus, including:

the acquisition module is used for acquiring the current network parameter data of the target optimization cell;

the processing module is used for determining to generate a network optimization strategy through a simulated neural network or a deep reinforcement learning model according to a preset judgment strategy based on the network parameter data and the network optimization index of the target optimization cell, wherein the simulated neural network is generated by training a preset neural network based on a pre-acquired network parameter optimization empirical data set of the target optimization cell, and the deep reinforcement learning model is generated based on the simulated neural network;

and the optimization module is used for optimizing the network parameters of the target optimization cell according to the network optimization strategy.

In a possible implementation manner, the processing model is further configured to train a predetermined neural network based on a pre-acquired network parameter tuning experience data set of the target optimization cell to generate a simulated neural network, where the processing model is specifically configured to:

screening a training sample data set from the network parameter tuning experience data set, wherein each training sample in the training sample data set comprises state data and action data, and the state data is the cell state S of the target optimization cell _i The action data is the target-optimized cell slave cell state S _i Transition to next cell state S _i+1 Action a performed _i I is a natural number;

In one possible implementation manner, each network parameter tuning experience data in the network parameter tuning experience data set is quintuple data, and the quintuple data includes a cell identifier of the target optimization cell, a current cell state of the target optimization cell, a next cell state of the current cell state of the target optimization cell, an action performed by the target optimization cell to transition from the current state to the next state, and an award value obtained after the action is performed;

when the training sample data set is screened from the network parameter tuning experience data set, the processing module is used for:

In one possible implementation, the deep reinforcement learning model includes a first neural network and a second neural network; wherein the processing module, when generating the deep reinforcement learning model based on the simulated neural network, is configured to:

In a possible implementation manner, the processing module, when determining the target parameter of the first neural network by performing loop iteration on the first neural network based on the obtained current cell state S of the target optimization cell, is configured to:

determining a Q realization value through a second neural network;

In one possible implementation, the processing module is further configured to:

In one possible implementation manner, the apparatus further includes an update module, where the update module is configured to:

According to another aspect of an embodiment of the present application, there is provided an electronic apparatus including: a memory, a processor and a computer program stored on the memory, the processor executing the computer program to implement the steps of the network optimization method described above.

According to yet another aspect of embodiments of the present application, there is provided a computer-readable storage medium, which when executed by a processor implements the steps of the network optimization method described above.

According to an aspect of embodiments of the present application, there is provided a computer program product, which when executed by a processor implements the steps of the network optimization method described above.

The network optimization method provided by the embodiment of the application can not only directly perform optimization adjustment on network parameters based on the simulated neural network, so that the existing network parameter tuning experience can be fully utilized, but also perform optimization adjustment on the network parameters through a deep reinforcement learning model constructed on the basis of the simulated empirical neural network, so that the advantage that the deep reinforcement learning model can quickly converge can be fully utilized, and the adverse effect on the network caused by the exploration process is effectively avoided; the simulation neural network and the deep reinforcement learning model are fused, the deep learning and the reinforcement learning are combined, the advantage complementation of the two is realized, and a solution is provided for the perception decision of the dynamic optimization of the mobile communication network.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings used in the description of the embodiments of the present application will be briefly described below.

Fig. 1 is a schematic diagram of a mobile communication network according to an embodiment of the present application;

fig. 2 is a schematic diagram of a cellular network according to an embodiment of the present application;

FIG. 3 is a schematic diagram of a reinforcement learning process provided in an embodiment of the present application;

fig. 4 is a schematic flowchart of a network optimization method according to an embodiment of the present application;

fig. 5 is a schematic diagram of a basic principle of network optimization provided in an embodiment of the present application;

FIG. 6 is a schematic diagram of a network dynamic optimization apparatus fusing a simulated neural network and a deep reinforcement learning model according to an embodiment of the present disclosure;

fig. 7 is a schematic process diagram of network optimization provided in an embodiment of the present application;

FIG. 8 is a schematic diagram illustrating an implementation process of a deep reinforcement learning model according to an embodiment of the present disclosure;

fig. 9 is a schematic structural diagram of a network optimization apparatus according to an embodiment of the present application;

fig. 10 is a structural schematic diagram of an electronic device according to an embodiment of the present application.

Detailed Description

Embodiments of the present application are described below in conjunction with the drawings in the present application. It should be understood that the embodiments set forth below in connection with the drawings are exemplary descriptions for explaining technical solutions of the embodiments of the present application, and do not limit the technical solutions of the embodiments of the present application.

As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should be further understood that the terms "comprises" and/or "comprising," when used in this specification in connection with embodiments of the present application, specify the presence of stated features, information, data, steps, operations, elements, and/or components, but do not preclude the presence or addition of other features, information, data, steps, operations, elements, components, and/or groups thereof, as embodied in the art. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or wirelessly coupled. The term "and/or" as used herein indicates at least one of the items defined by the term, e.g., "a and/or B" indicates either an implementation as "a", or an implementation as "a and B".

To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

The related art and terms related to the present application will be described and explained as follows:

fig. 1 shows a mobile communication network comprising a base station having a coverage area (or cell), a plurality of mobile devices and a backhaul network. As shown in fig. 1, the base station establishes uplink and downlink connections with the mobile devices, which are used to carry data for bi-directional transmission between the mobile devices and the base station, and the data carried by the uplink and downlink connections may include data transmitted between the mobile devices as well as data transmitted to or from the remote end through the backhaul network.

The term "base station" refers to a device or any component or collection of components for providing wireless access to a network, e.g., an enhanced base station, macrocell, femtocell, Wi-Fi access point, etc., that can provide wireless access according to one or more wireless communication protocols, e.g., long term evolution, LTE-a, high speed packet access, HSPA, etc

The term "Mobile device" refers to a device or collection of devices, such as User Equipment (UE), Mobile Station (STA), etc., that is capable of establishing a wireless connection with a base Station.

A cellular network is a typical mobile communication network and may include base stations forming a plurality of interconnected cells, through the arrangement of which the cellular network provides wireless communication coverage over a large geographic area and enables wireless communication devices to communicate with other wireless communication devices at any geographic location in the network. Fig. 2 shows a cellular network of a plurality of base stations providing cell coverage, each cell representing a geographical area, the areas covered by the cells may not overlap or may partially overlap each other. Each cell has a hexagonal shape, and each base station may be located at the center of the cell or at a corner of the cell hexagon and cover 3 adjacent hexagonal cells (or three-sector cells). Cellular networks may have a particular layout or topology that includes the relative distances between base stations and the angular directions in which their antennas are relative to each other. The cellular network shown in fig. 2 is only a simple example and may vary in particular in different application implementation scenarios.

Network optimization techniques may be applied to a cellular network to adjust one or more parameters or performance indicators of the cellular network to improve the performance of the cellular network. For example, cellular network optimization may include optimization for coverage, capacity, interference reduction, etc., or adjustment and optimization for other key performance indicators, KPIs. The cellular network parameters may include antenna electrical tilt, antenna azimuth, antenna mechanical tilt, and transmit power, among others.

In a cellular network, neighboring cells may interact with each other such that a change in the settings of a base station associated with one cell may affect the network performance, e.g., coverage and capacity, of the neighboring cells, change the parameter settings of one base station to increase the coverage and capacity of its cell, may cause interference to its neighboring cells, and may reduce the coverage and capacity of these neighboring cells, as well as the coverage and capacity of the entire network. Increasing the number of cells in a cellular network can result in an exponential increase in the number, relationship, and potential interference of neighboring cells. The cellular network is thus influenced not only by each individual cell but also by the relationship between the cells, and the optimization of the cellular network needs to take these influencing factors into account.

Emulation Learning (Learning) is a method of Learning from expert examples, and is a method of making intelligent decisions for an agent (a robot) like a human expert. Using these expert examples to teach an agent to make intelligent decisions is to mimic the problem that learning is primarily addressing. In addition, a virtual world can be constructed by simulating learning, so that the intelligent agent is allowed to freely try and learn in the virtual world. Taking network optimization as an example, the expert example refers to a state-action sequence acquired by an optimal or suboptimal strategy, existing network parameter tuning empirical data is analyzed, a state is a cell parameter, an action is a parameter of a base station associated with the cell, the network parameter tuning empirical data is input to a neural network, and after the neural network is trained, a certain cell parameter is input, so that the base station tuning parameter with the optimal experience can be output.

Fig. 3 shows a Reinforcement Learning (RL) process, in which an agent dynamically interacts with an environment (environment), and after the agent senses the environment state (state) information, a reward action is determined according to a reward action possibly brought by the agent taking an action (action), and a reaction of the environment is further observed, and the process is repeated until the agent converges to a certain steady-state target.

Reinforcement learning contains 5 elements consisting of 2 subjects and 3 interactive data, 2 subjects being agents and environments, respectively, and 3 interactive data being states, actions, and rewards, respectively. Wherein the content of the first and second substances,

the intelligent agent: a decision maker interacting with the environment by performing an action, obtaining a status value, receiving a reward.

Environment: anything outside the body of the intelligence.

The state is as follows: all data that can affect the agent's next action is considered agent state.

The actions are as follows: the agent interacting with the environment must have some control over the environment, i.e., change the reward earned in the future.

Rewarding: a scalar feedback measures how well the agent does when in the state.

Deep Learning (Deep Learning) is a new algorithm that combines Deep Learning and Reinforcement Learning to achieve end-to-end Learning that is motion aware. In short, the input information is sensed like a human being, and then the action is directly output through a deep neural network. Deep reinforcement learning has the potential to enable an agent to achieve truly fully autonomous learning of one or more skills.

The dynamic optimization of the mobile communication network is a perception decision problem, and deep learning has strong perception capability but lacks certain decision capability; the reinforcement learning has decision-making capability, and is ineligible to the sensing problem of complex state space and action space. The network optimization method fusing the simulation learning and the deep reinforcement learning, which is provided by the embodiment of the application, combines the deep learning and the reinforcement learning, has complementary advantages, and provides a solution for the perception decision of the dynamic optimization of the mobile communication network.

The technical solutions of the embodiments of the present application and the technical effects produced by the technical solutions of the present application will be described below through descriptions of several exemplary embodiments. It should be noted that the following embodiments may be referred to, referred to or combined with each other, and the description of the same terms, similar features, similar implementation steps and the like in different embodiments is not repeated.

Fig. 4 is a schematic flowchart of a network optimization method provided in an embodiment of the present application, and as shown in fig. 4, the method includes: step S410, acquiring current network parameter data of a target optimization cell; step S420, determining to generate a network optimization strategy through a simulated neural network or a deep reinforcement learning model according to a preset judgment strategy based on the network parameter data and the network optimization index of the target optimization cell, wherein the simulated neural network is generated by training a preset neural network based on a pre-acquired network parameter optimization empirical data set of the target optimization cell, and the deep reinforcement learning model is generated based on the simulated neural network; and step S430, optimizing the network parameters of the target optimization cell according to the network optimization strategy.

In the embodiment of the present application, current network parameter data of a target optimization cell may be periodically collected and analyzed by interacting with the target optimization cell and taking a predetermined time interval as a period, where the network parameter data includes, but is not limited to, cell state data, network environment data, and the like, and the cell state data (or referred to as cell state) includes, but is not limited to, cell traffic, cell resource utilization, cell reference signal reception quality, cell access success rate, and the like; network environment data (or referred to as network environment) includes, but is not limited to, base station performance indicator data and configuration data (e.g., azimuth, transmit data), etc.

After the current network parameter number of the target optimization cell is obtained, a network optimization strategy for generating the target optimization cell by simulating a neural network is determined according to a preset judgment strategy or a network optimization strategy for generating the target optimization cell by a deep reinforcement learning model is determined based on the network parameter data and the network optimization index of the target optimization cell, so as to optimize the network parameters of the target optimization cell. In practical application, the current network parameter number of the target optimization cell can be used as input data and input into the simulated neural network or the deep reinforcement learning model, and output data of the simulated neural network or the deep reinforcement learning model is a network optimization strategy (also can be referred to as network optimization action) of the target optimization cell.

The network parameter tuning empirical data set is used to refer to expert empirical data, historical data, and other data that may be used to help select actions when training a predetermined neural network. The network parameter tuning empirical data set may be used to train a predetermined neural network to optimize the wireless network. The use of the network parameter tuning empirical data set may help to speed up the training of the predetermined neural network.

In practical application, a network tuning parameter data source can be periodically and offline collected through interaction with a target optimization cell set, and the collected network tuning parameter data source is analyzed to generate a network parameter tuning experience data set; then, on the basis of the network parameter tuning experience data set, training a predetermined neural network to generate a simulated neural network for carrying out optimization adjustment on the network parameters of the target optimization cell.

After the simulated neural network is generated, a deep reinforcement learning model can be constructed on the basis of the simulated neural network, and the deep reinforcement learning model can be quickly converged to a certain stable state after being interacted with a real network of the target optimization cell for a limited time, so that the network parameters of the target optimization cell can be optimized and adjusted.

The basic principle of the embodiment of the present application is shown in fig. 5. Before the embodiments of the present application are described in detail, rewards of states and actions that may be involved in the simulation neural network and the deep reinforcement learning model of the embodiments of the present application are specifically defined in a network optimization scenario as follows:

the states include statistics of base station performance indexes and information obtained from measurement reports provided by the UE, such as cell traffic, cell resource utilization, cell reference signal received quality RSRQ, cell access success rate, etc. A suitable subset of cell indicators may be selected as the cell state space based on a specific network optimization objective.

Act to make the cell change from the current state S _t Transfer to the next state S _t+1 The action performed. Cell actions include parameters for adjusting the base station associated with the cell, such as: cell transmitting power, antenna azimuth, antenna downtilt angle, cell access threshold, cell switching threshold, etc. A suitable subset of cell parameters may be selected as a cell action space based on a specific network optimization objective.

And the reward is calculated by designing a cost function through a network optimization target, when the optimization target is more than one, the optimization target can be considered when the reward function is involved, and the importance of the target is reflected by setting weight. For example, optimizing the network for coverage and interference, the cost function can be designed as: fc ═ N1) + (1-w) × N2/N, where w is a weight coefficient and ranges from [0,1], N1 is the number of MRs received from UEs whose RSRP of the cell is equal to or greater than the RSRP threshold, N2 is the number of MRs received from UEs whose RSRQ of the cell is equal to or greater than the RSRQ threshold, and N ═ N1+ N2. Part (w) N1 is used to reflect the weight of the coverage into the cost function and part (1-w) N2 is used to reflect the weight of the interference into the cost function.

In a possible implementation manner of the embodiment of the present application, in a process of determining, according to a preset decision strategy, to generate a network optimization strategy by simulating a neural network or a deep reinforcement learning model based on network parameter data and a network optimization index of a target optimization cell, the following processing may be performed: and directly determining to generate a network optimization strategy through the simulated neural network based on the network parameter data and the network optimization index of the target optimization cell, and determining to generate the network optimization strategy through a deep reinforcement learning model when the network optimization strategy generated through the simulated neural network cannot meet the requirement.

In one example, the network optimization strategy generated by simulating a neural network or a deep reinforcement learning model can be intelligently selected according to the network optimization index change situation of the target optimization cell. In practical application, based on the strategy principles of experience priority and little exploration, the application simulation neural network can be considered preferentially, so that the existing network parameters are fully utilized to adjust and optimize the experience data set; in this case, the current network parameter number of the target optimization cell may be input to the simulated neural network as input data, and the output of the simulated neural network is the network optimization strategy (or referred to as a network optimization action) of the target optimization cell. When the network optimization effect generated by simulating the neural network cannot meet the optimization requirement, triggering a deep reinforcement learning model to generate a network optimization strategy of the next period; in this case, the current network parameter number of the target optimization cell may be input to the deep reinforcement learning model as input data, and the output of the deep reinforcement learning model is the network optimization strategy (or referred to as a network optimization action) of the target optimization cell.

In a possible implementation manner of the embodiment of the present application, after the network parameters of the target optimization cell are optimized according to the network optimization strategy, the optimized network parameters of the target optimization cell may also be obtained, and the optimized network parameters are updated to the network parameter tuning experience data set of the target optimization cell.

In practical application, in the process of optimizing the target optimization cell, the optimized network parameters of the target optimization cell may be periodically obtained with a predetermined time interval as a period, and the optimized network parameters are updated to the network parameter optimization experience data set of the target optimization cell. For example, the method interacts with a target optimization cell, periodically collects and analyzes network parameters such as optimized communication base station performance index data and configuration data, and uses the network parameters as incremental updating of a network parameter tuning experience data set, that is, the network parameters optimized by the target optimization cell are continuously updated into the network parameter tuning experience data set to incrementally update the existing network parameter tuning experience data, so that the memory storage of the simulated neural network and the deep reinforcement learning model is incrementally updated, and the iterative process of interactive learning of the simulated neural network and the deep reinforcement learning model and the cell network environment is further accelerated.

In other words, as the network parameter tuning experience data source is continuously analyzed, the network parameter tuning experience data set generates incremental updating, and based on the generated incremental updating data set, the storage memory of the simulated neural network and the deep reinforcement learning model can be incrementally updated, so that on one hand, a newly-built agent (such as a deep intensity learning model and the like) can obtain full network parameter tuning experience, on the other hand, the established agent can learn the incrementally updated network parameter tuning experience, the agent learning process can be further accelerated, and the iteration frequency of interactive learning with the environment can be reduced.

In a possible implementation manner of the embodiment of the present application, the following processing may be performed in the process of training a predetermined neural network based on a pre-obtained network parameter tuning experience data set of a target optimization cell and generating an emulated neural network: firstly, a training sample data set is screened from the network parameter tuning experience data set, and each training sample in the training sample data set comprises a state numberBased on the action data, the state data is the cell state S of the target optimized cell _i The action data is the target optimized cell slave cell state S _i Transition to next cell state S _i+1 Action a performed _i I is a natural number; and then, taking the state data of each training sample in the training sample data set as input data of a predetermined neural network, and training the predetermined neural network by simulating a learning algorithm until an error between output data of the predetermined network and action data corresponding to the state data of each training sample meets a predetermined condition.

Each network parameter tuning experience data in the network parameter tuning experience data set is quintuple data, and the quintuple data comprises a cell identifier of a target optimization cell, a current cell state of the target optimization cell, a next cell state of the current cell state of the target optimization cell, an action executed by the target optimization cell to shift from the current state to the next state, and a reward value obtained after the action is executed.

In one example, the five tuple data may be a specific mathematical identifier as shown below:

(I _c ，S _t ，S _t+1 ，a _t ，r _t )

wherein, I _c Denotes the cell identity, S _t Indicates the current state of the cell, S _t+1 Indicates the next state of the cell, a _t Indicating that the cell is caused to be in the current state S _t Transfer to the next state S _t+1 The action performed, r _t Representing actions a taken on _t The reward obtained later.

Specifically, the process of screening out the training sample data set from the network parameter tuning experience data set may be: and screening out a training sample data set from the network parameter tuning experience data set according to the current cell state of the target optimization cell, the next cell state of the current cell state of the target optimization cell and the action executed by the target optimization cell to transfer from the current cell state to the next cell state.

The process of generating the simulated neural network can be divided into two main stages, wherein the first stage is to construct a learning data set (namely a training sample data set) of the simulated neural network; the second phase is modeling that mimics the neural network, i.e., by training a predetermined neural network to mimic a learning training sample data set, to generate a network optimization strategy. Wherein:

before a training sample data set simulating a neural network is constructed, a network parameter tuning experience data source of a cell can be acquired offline in advance, and a network parameter tuning experience data set can be obtained by analyzing and extracting the network parameter tuning experience data source.

In the process of establishing the learning data set for simulating the neural network, because the network parameter tuning experience data set is obtained, and the existing network parameter tuning experience data set is often given from a current cell state S with poor performance _i Transfer to next cell state S with up-to-standard performance _i+1 (i.e. current cell state S) _i Next cell state) of the cell, the action a to be performed _i I is a natural number (i.e., i is 0 or a positive integer); therefore, a set of sample data a { (S) for simulation neural network training from a cell state to an empirically optimal action can be extracted from the network parameter tuning empirical data set _i ，a _i )}. Wherein (S) _i ，a _i ) Represents a training sample data, S _i Is an input X which mimics a neural network, the Si being considered as state data comprised by training sample data, a _i Is the output y of the simulated neural network, which ai can be seen as motion data comprised by the training sample data.

Then, the state data (e.g. Si) of each training sample in the training sample data set (e.g., { (Si, ai) }) is used as input data of the predetermined neural network, and the predetermined neural network is trained through the simulated learning algorithm until an error between the output data (e.g. ai) of the predetermined network and the action data corresponding to the state data of each training sample satisfies a predetermined condition, so as to obtain a trained simulated neural network. For example, in modeling the neural network, the set of network tuning empirical data can be learned by using a predetermined neural network, where an input of the predetermined neural network is a cell state of a problem cell (i.e., a target optimization cell), an output of the predetermined neural network is an optimal or suboptimal empirical action (i.e., an error between output data of the predetermined network and action data corresponding to state data of each training sample satisfies a predetermined condition), and the optimal or suboptimal empirical action can be executed to reach the performance of the problem cell network.

According to the method, the simulated neural network is created by applying the simulated learning algorithm, the learning network parameter tuning experience data set is generated, and the network optimization strategy of the cell state to the optimal experience action is generated, wherein the simulated neural network is used as the basis for the subsequent deep reinforcement learning model construction.

In one possible implementation of the embodiment of the present application, the deep reinforcement learning model includes a first neural network and a second neural network; in the process of generating the deep reinforcement learning model based on the simulated neural network, the following processing can be executed: firstly, constructing a first neural network and a second neural network according to the structure of the simulated neural network, and copying parameters of the simulated neural network to the first neural network and the second neural network; next, based on the obtained current cell state S of the target optimization cell, a target parameter value θ of the first neural network is determined by performing loop iteration on the first neural network, and the parameter value θ' of the second neural network is updated by copying the target parameter value θ to the second neural network at predetermined intervals during the loop iteration.

In a possible implementation manner, in the process of determining the target parameter of the first neural network by performing loop iteration on the first neural network based on the obtained current cell state S of the target optimization cell, the following processing may be performed: firstly, determining an action a executed by a target optimization cell to be transferred from a current cell state S to a next cell state S' through a first neural network, and determining a Q estimation value according to the action a; then, determining a Q implementation value through a second neural network;

determining a loss function of the first neural network according to a difference between the Q actual value and the Q estimated value; and then solving the corresponding parameter value when the loss function is minimized, and determining the parameter value as a target parameter value theta.

In practical applications, in the embodiment of the present application, dqn (deep Q network) is used to generate a deep reinforcement learning model. DQN (Deep Q-network) combines Q-Learning with Deep Learning, predicts Q value by using a Deep neural network to represent a value function, and learns an optimal action path by continuously updating the neural network. Wherein, the Q learning algorithm is a value-based reinforcement learning algorithm, Q is an abbreviation of quality, and the Q function Q (state) represents the quality of performing action at the state, that is, how much Q value can be obtained.

DQN contains 2 neural networks eval _ net (i.e. the first neural network described above) and target _ net (i.e. the second neural network described above) of the same structure but different parameters. Wherein, the parameters of the eval _ net neural network and the target _ net neural network are the same at the beginning, and the parameter which is relatively fixed is the target _ net and is used for acquiring a Q target value or a real value (Q _ target); while another constantly updated network is used to obtain the Q evaluation or estimation value (Q _ eval), but after every N actions, the parameters of eval _ net neural network are synchronized to the target _ net neural network, so the target _ net is called relatively fixed. The reason for this is that the difference between Q _ target and Q _ eval is a loss function, and the DQN is trained to reduce the difference between Q _ target and Q _ eval, so that the target needs to be relatively fixed for convergence. In other words, eval _ net neural network is used to predict Q estimation value, and the parameters of eval _ net neural network are updated in real time by learning network parameter data (including state, action, reward, etc.) interacting with the environment; the target _ net neural network is used for predicting the Q actual value, and the parameter update of the target _ net neural network is periodically copied from the eval _ net neural network.

The process of generating the deep reinforcement learning model comprises initialization of the deep reinforcement learning model and loop iteration of the deep reinforcement learning model, wherein:

the initialization of the deep reinforcement learning model comprises the following steps:

a) 2 neural networks eval _ net and target _ net which have the same structure as the simulated neural network are constructed, and the parameters corresponding to the 2 neural networks are theta and theta' respectively.

b) Parameters that mimic the neural network are copied to the eval _ net neural network and the target _ net neural network.

c) Determining a hyper-parameter: learning rate alpha, discount factor gamma, target _ net parameter update period, memory storage size, etc.

The loop iteration of the deep reinforcement learning model comprises the following steps:

a) acquiring a current cell state s;

b) depending on the current cell state s, action a is selected, which may be selected by eval _ net, where a is argmax _a Q (S, a; θ), i.e. the action a performed to make the target optimized cell transition from the current cell state S to the next cell state S' is determined by the first neural network eval _ net;

c) performing an action a, acting on the cell environment (for example, performing the action a to optimize the network parameters of the target optimization cell), and transitioning the environment state to a state s '(for example, transitioning the target optimization cell from the current cell state s to the next cell state s'), and feeding back a reward r;

d) storing the data (s, a, r, s') generated by the interaction process into a memory storage;

e) calculating a Q-estimation value, wherein the Q-estimation value is Q (s, a) corresponding to the action a in the step b);

f) calculating a Q realization value, which is given by a second neural network target _ net, wherein the Q realization value is gamma max _a′ Q(s′,a′；θ′)+r；

g) Calculating Q difference, wherein Q difference is (Q reality-Q estimate) and using Q difference to define a Loss function Loss of the first neural network eval _ net;

h) solving a parameter theta of the first neural network eval _ net when the Loss function Loss is minimized;

i) updating the parameter theta' of the target _ net by copying the parameter of the first neural network eval _ net to the second neural network target _ net every N steps;

and repeating the steps a) to i) until the deep reinforcement learning model converges.

Through the above description of a series of contents, it can be seen that the network optimization method in the embodiment of the present application mainly includes 5 functional modules, as shown in fig. 6, each functional model is specifically as follows:

1) the network optimization experience analysis module: interacting with each target optimization cell, periodically and offline collecting a network parameter tuning data source, analyzing the existing network parameter tuning data source, and generating a network parameter tuning experience data set;

2) the simulation learning module: establishing a simulated neural network, applying a simulated learning algorithm, adjusting and optimizing the experience data set by the learning network parameters, and generating the simulated neural network of strategy mapping from the cell state to the optimal experience action;

3) the deep reinforcement learning module: constructing a deep reinforcement learning model (also called a deep reinforcement learning intelligent agent), copying and simulating a neural network structure and parameter values, so that the deep reinforcement learning intelligent agent obtains the existing network parameter tuning experience, can quickly converge to a certain stable state after carrying out limited interaction with a real network environment, and further can carry out optimization adjustment on network parameters based on the deep reinforcement learning intelligent agent;

4) the network data real-time analysis module: and interacting with each target optimization cell, periodically acquiring and analyzing network parameters such as performance index data and configuration data of the communication base station, and updating input data (namely network parameter data) such as network environment, cell state and the like required by a cell simulation learning module and a deep reinforcement learning module for reasoning a network optimization strategy in the next period.

5) A network optimization algorithm selection module: according to the network optimization index change condition of the target optimization cell, intelligently selecting the simulated neural network or the deep reinforcement learning model to generate the network optimization strategy, wherein the simulated neural network is preferentially selected to generate the network parameter optimization strategy, and issuing the network optimization strategy to the corresponding network optimization execution equipment, if the network optimization index is not obviously improved (namely the network optimization strategy generated by the simulated neural network cannot meet the requirement), the deep reinforcement learning model is selected, after the deep reinforcement learning model and the target optimization cell are interacted for a limited time, the network parameter optimization strategy is generated, and the network optimization strategy is issued to the corresponding network optimization execution equipment.

A specific execution flow of the network optimization method according to the embodiment of the present application may be as shown in fig. 7, and includes the following processing steps:

1) the off-line data acquisition module comprises a network parameter tuning data source, a network parameter tuning data set and a network parameter tuning data increment updating part, the network parameter tuning data source is acquired off-line, the network parameter tuning experience data source is analyzed and extracted, a network parameter tuning data set (namely the network parameter tuning experience data set) can be obtained, each element in the data set is composed of a 5-tuple, and each 5-tuple data comprises: a cell identity identifying the cell (e.g., a cell identity of a target optimized cell), a cell current state (e.g., a current cell state of the target optimized cell), a cell next state (e.g., a next cell state to the current cell state of the target optimized cell), an action to transition the cell current state to the next state (e.g., an action to transition the cell current state to the next state), and a reward value obtained after the action is taken. The specific mathematical identifiers are as follows:

(I _c ，S _t ，S _t+1 ，a _t ，r _t )

It should be noted that the incremental updating of the network parameter tuning data is the content described in step 9) below.

2) The target cell state and network environment module analyzes the environment data in real time, namely, the cell state data, the network environment data and the like of the target optimization cell are analyzed in real time, and the current network parameter data of the target optimization cell are obtained.

3) The network optimization algorithm selection module selects the simulated neural network or the deep reinforcement learning model to generate an optimization action (namely, a network optimization strategy is generated) according to the judgment strategy, wherein the simulated neural network is preferentially considered and applied, the existing network parameter tuning and optimizing experience data set is fully utilized, when the optimization effect of the simulated neural network cannot meet the optimization requirement, the deep reinforcement learning model is triggered to generate the network optimization action (namely, the network optimization strategy) of the next period, the deep reinforcement learning model simulating the neural network is fused, the fast convergence can be realized, and the adverse effect on the network in the exploration process is avoided.

The simulation learning module comprises three parts of a simulation learning data set, a simulation neural network construction and a simulation learning reasoning application (namely a reasoning application simulating a neural network), wherein the three parts respectively correspond to the following steps 4-6):

4) the simulation learning module constructs a learning data set (i.e. a training sample data set) simulating a neural network, because the existing network parameter tuning experience data set directly gives a current state S with poor performance _t Transfer to next state S with performance up to standard _t+1 An action a to be performed _t Therefore, a set of sample data a { (S) of simulated neural network training samples from a state to an empirically optimal action can be extracted from the set of empirical network parameter tuning experience data _i ，a _i )}. Wherein (S) _i ，a _i ) Represents a sample, S _i Is the input x, a of the model neural network _i Is the output y of the simulated neural network.

5) And constructing the simulated neural network, namely modeling the simulated neural network by learning a network parameter tuning data set, wherein the input is the state of the problem cell, the output is an optimal or suboptimal empirical action, and the performance of the problem cell network can reach the standard by executing the action.

6) And (3) reasoning application of the simulated neural network, wherein the simulated neural network is created by applying a simulated learning algorithm to learn a network parameter tuning experience data set and generate a strategy mapping from a cell state to an optimal experience action, and the simulated neural network is used as a basis for constructing a subsequent deep reinforcement learning intelligent agent.

7) -8): detailed description of deep reinforcement learning model, wherein, the embodiments of the present application use dqn (deep Q network) to learn deep reinforcement learning agent. The DQN comprises 2 neural networks eval _ net (namely a first neural network) and a neural network target _ net (namely a second neural network) which have the same structure and different parameters, wherein the eval _ net neural network is used for predicting the Q estimation value, and the parameters of the eval _ net are updated in real time by learning data (states, actions and rewards) interacted with the environment; the target _ net neural network is used for predicting the Q actual value, and the parameter update of the target _ net is copied from eval _ net periodically.

The specific process of interaction between the deep reinforcement learning model and the cell environment, the simulated neural network and the network parameter tuning experience data set is shown in fig. 8, and the specific description of the deep reinforcement learning model in steps 7) -8) is as follows:

7) the initialization of the deep reinforcement learning model comprises the following steps:

b) Parameters that mimic the neural network are copied to the eval _ net neural network.

8) The loop iteration of the deep reinforcement learning model comprises the following steps:

a) acquiring a current cell state s;

9) The incremental updating of the network parameter tuning experience is carried out, the incremental updating of the network parameter tuning experience data set can be generated along with the continuous analysis of the network parameter tuning experience data source, and the incremental updating can be carried out on the storage memory of the simulated neural network and the deep reinforcement learning model based on the generated incremental updating data set.

According to the introduction, the simulated neural network obtained based on the simulated learning in the embodiment of the application can make full use of the existing network parameter tuning experience data set; by combining a deep reinforcement learning model simulating a neural network, the convergence can be fast, and adverse effects on the network in the exploration process are avoided; the continuous increment updating of the network parameter tuning experience data set drives the increment updating learning of the data set simulating the neural network and the deep reinforcement learning model, so that the learning process of the deep reinforcement learning model can be further accelerated, and the interactive iterative learning times with the network environment are reduced. In addition, the trained neural network may help to reduce the number of iterations required to adjust the cell parameters of the wireless network according to the optimization objective, and may also help to achieve a one-time optimization of the wireless network, i.e., the network parameters may only need to be adjusted once (rather than iteratively) to achieve the optimization objective.

An embodiment of the present application provides a network optimization apparatus, as shown in fig. 9, the network capacity prediction apparatus 900 may include: an acquisition module 901, a processing module 902, and an optimization module 903, wherein,

an obtaining module 901, configured to obtain current network parameter data of a target optimization cell;

the processing module 902 is configured to determine, based on the network parameter data and the network optimization index of the target optimization cell, to generate a network optimization strategy by using an artificial neural network or a deep reinforcement learning model according to a preset decision strategy, where the artificial neural network is generated by training a predetermined neural network based on a pre-obtained network parameter optimization empirical data set of the target optimization cell, and the deep reinforcement learning model is generated based on the artificial neural network;

and an optimizing module 903, configured to optimize a network parameter of the target optimized cell according to the network optimization policy.

screening a training sample data set from the network parameter tuning experience data set, wherein each training sample in the training sample data set comprises state data and action data, and the state data is the cell state S of the target optimization cell _i The action data is the target optimized cell slave cell state S _i Transition to next cell state S _i+1 Movement performedAs a _i I is a natural number;

and screening a training sample data set from the network parameter tuning experience data set according to the current cell state of the target optimization cell, the next cell state of the current cell state of the target optimization cell and the action executed by the target optimization cell to transfer from the current cell state to the next cell state.

determining a Q realization value through a second neural network;

In one possible implementation, the processing module is further configured to:

The network optimization device in the embodiment of the present application may execute the network optimization method shown in the above embodiment of the present application, and the implementation principle is similar, the actions executed by each module in the device in each embodiment of the present application correspond to the steps in the method in each embodiment of the present application, and for the detailed functional description of each module of the device, reference may be specifically made to the description in the corresponding method shown in the foregoing, and details are not repeated here.

The network optimization device of the embodiment of the application can not only directly perform optimization adjustment on network parameters based on the simulated neural network, so that the existing network parameter optimization experience can be fully utilized, but also perform optimization adjustment on the network parameters through a deep reinforcement learning model constructed on the basis of the simulated empirical neural network, so that the advantage that the deep reinforcement learning model can quickly converge can be fully utilized, and the adverse effect on the network caused by the exploration process is effectively avoided; the simulation neural network and the deep reinforcement learning model are fused, the deep learning and the reinforcement learning are combined, the advantage complementation of the two is realized, and a solution is provided for the perception decision of the dynamic optimization of the mobile communication network.

In an embodiment of the present application, an electronic device is provided, which includes a memory, a processor, and a computer program stored in the memory, where the processor executes the computer program to implement the steps of the method for determining network coverage, and compared with the prior art, the method can implement: acquiring current network parameter data of a target optimization cell; then, based on the network parameter data and the network optimization index of the target optimization cell, according to a preset judgment strategy, determining to generate a network optimization strategy through an imitated neural network or a deep reinforcement learning model, wherein the imitated neural network is generated by training a preset neural network based on a pre-acquired network parameter optimization empirical data set of the target optimization cell, and the deep reinforcement learning model is generated based on the imitated neural network; and then, optimizing the network parameters of the target optimization cell according to the network optimization strategy. The electronic equipment can not only directly optimize and adjust the network parameters based on the simulated neural network so as to fully utilize the existing network parameter optimization experience, but also optimize and adjust the network parameters through a deep reinforcement learning model constructed on the basis of the simulated experience neural network so as to fully utilize the advantage that the deep reinforcement learning model can quickly converge and effectively avoid the adverse effect on the network caused by the exploration process; the simulation neural network and the deep reinforcement learning model are fused, the deep learning and the reinforcement learning are combined, the advantage complementation of the two is realized, and a solution is provided for the perception decision of the dynamic optimization of the mobile communication network.

In an alternative embodiment, an electronic device is provided, as shown in fig. 10, the electronic device 4000 shown in fig. 10 comprising: a processor 4001 and a memory 4003. Processor 4001 is coupled to memory 4003, such as via bus 4002. Optionally, the electronic device 4000 may further include a transceiver 4004, and the transceiver 4004 may be used for data interaction between the electronic device and other electronic devices, such as transmission of data and/or reception of data. In addition, the transceiver 4004 is not limited to one in practical applications, and the structure of the electronic device 4000 is not limited to the embodiment of the present application.

The Processor 4001 may be a CPU (Central Processing Unit), a general-purpose Processor, a DSP (Digital Signal Processor), an ASIC (Application Specific Integrated Circuit), an FPGA (Field Programmable Gate Array) or other Programmable logic device, a transistor logic device, a hardware component, or any combination thereof. Which may implement or perform the various illustrative logical blocks, modules, and circuits described in connection with the disclosure. The processor 4001 may also be a combination that performs a computational function, including, for example, a combination of one or more microprocessors, a combination of a DSP and a microprocessor, or the like.

Bus 4002 may include a path that carries information between the aforementioned components. The bus 4002 may be a PCI (Peripheral Component Interconnect) bus, an EISA (Extended Industry Standard Architecture) bus, or the like. The bus 4002 may be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown in FIG. 10, but that does not indicate only one bus or one type of bus.

The Memory 4003 may be a ROM (Read Only Memory) or other types of static storage devices that can store static information and instructions, a RAM (Random Access Memory) or other types of dynamic storage devices that can store information and instructions, an EEPROM (Electrically Erasable Programmable Read Only Memory), a CD-ROM (Compact Disc Read Only Memory) or other optical Disc storage, optical Disc storage (including Compact Disc, laser Disc, optical Disc, digital versatile Disc, blu-ray Disc, etc.), a magnetic Disc storage medium, other magnetic storage devices, or any other medium that can be used to carry or store a computer program and that can be Read by a computer, without limitation.

The memory 4003 is used for storing computer programs for executing the embodiments of the present application, and is controlled by the processor 4001 to execute. The processor 4001 is used to execute computer programs stored in the memory 4003 to implement the steps shown in the foregoing method embodiments.

Embodiments of the present application provide a computer-readable storage medium, on which a computer program is stored, and when being executed by a processor, the computer program may implement the steps and corresponding contents of the foregoing method embodiments.

Embodiments of the present application further provide a computer program product, which includes a computer program, and when the computer program is executed by a processor, the steps and corresponding contents of the foregoing method embodiments can be implemented.

It should be understood that, although each operation step is indicated by an arrow in the flowchart of the embodiment of the present application, the implementation order of the steps is not limited to the order indicated by the arrow. In some implementation scenarios of the embodiments of the present application, the implementation steps in the flowcharts may be performed in other sequences as desired, unless explicitly stated otherwise herein. In addition, some or all of the steps in each flowchart may include multiple sub-steps or multiple stages based on an actual implementation scenario. Some or all of these sub-steps or stages may be performed at the same time, or each of these sub-steps or stages may be performed at different times, respectively. In a scenario where execution times are different, an execution sequence of the sub-steps or the phases may be flexibly configured according to requirements, which is not limited in the embodiment of the present application.

The foregoing is only an optional implementation manner of a part of implementation scenarios in this application, and it should be noted that, for those skilled in the art, other similar implementation means based on the technical idea of this application are also within the protection scope of the embodiments of this application without departing from the technical idea of this application.

Claims

1. A method for network optimization, comprising:

acquiring current network parameter data of a target optimization cell;

determining to generate a network optimization strategy through a simulated neural network or a deep reinforcement learning model according to a preset decision strategy based on the network parameter data and the network optimization index of the target optimization cell, wherein the simulated neural network is generated by training a preset neural network based on a pre-acquired network parameter tuning experience data set of the target optimization cell, and the deep reinforcement learning model is generated based on the simulated neural network;

2. The method of claim 1, wherein training a predetermined neural network based on a pre-acquired network parameter tuning experience data set of the target optimization cell to generate the simulated neural network comprises:

screening a training sample data set from the network parameter tuning experience data set, wherein each training sample in the training sample data set comprises state data and action data, and the state data is the cell state S of the target optimization cell _i The action data is the target optimized cell slave cell state S _i Transition to next cell state S _i+1 Action a performed _i The i is a natural number;

and taking the state data of each training sample in the training sample data set as the input data of the predetermined neural network, and training the predetermined neural network by simulating a learning algorithm until the error between the output data of the predetermined network and the action data corresponding to the state data of each training sample meets a predetermined condition.

3. The method of claim 2, wherein each network parameter tuning experience data in the network parameter tuning experience data set is a quintuple data, and the quintuple data comprises a cell identifier of the target optimization cell, a current cell state of the target optimization cell, a next cell state of the current cell state of the target optimization cell, an action performed to transition the target optimization cell from the current state to the next state, and a reward value obtained after performing the action;

the screening of the training sample data set from the network parameter tuning experience data set comprises:

and screening a training sample data set from the network parameter tuning experience data set according to the current cell state of the target optimization cell, the next cell state of the current cell state of the target optimization cell and the action executed by enabling the target optimization cell to be transferred from the current cell state to the next cell state.

4. The method of any one of claims 1-3, wherein the deep reinforcement learning model comprises a first neural network and a second neural network; wherein generating the deep reinforcement learning model based on the mimic neural network comprises:

constructing the first neural network and the second neural network according to the structure of the simulated neural network, and copying the parameters of the simulated neural network to the first neural network and the second neural network;

determining a target parameter value theta of the first neural network by performing loop iteration on the first neural network based on the acquired current cell state S of the target optimization cell, and updating the first neural network by copying the target parameter value theta to the second neural network at predetermined intervals in the process of the loop iterationParameter value theta of two neural networks ^’ 。

5. The method according to claim 4, wherein the determining target parameters of the first neural network by performing loop iteration on the first neural network based on the obtained current cell state S of the target optimization cell comprises:

determining, by the first neural network, to transition the target optimized cell from a current cell state S to a next cell state S ^’ An executed action a, and determining a Q estimation value according to the action a;

determining, by the second neural network, a Q-reality value;

determining a loss function of the first neural network based on a difference between the Q realization value and the Q estimation value;

and solving the corresponding parameter value when the loss function is minimized, and determining the parameter value as the target parameter value theta.

6. The method according to any one of claims 1 to 5, wherein the determining, based on the network parameter data and the network optimization index of the target optimization cell, a network optimization strategy generated by simulating a neural network or a deep reinforcement learning model according to a preset decision strategy comprises:

and directly determining to generate a network optimization strategy through the simulated neural network based on the network parameter data and the network optimization index of the target optimization cell, and determining to generate the network optimization strategy through the deep reinforcement learning model when the network optimization strategy generated by the simulated neural network cannot meet the requirement.

7. The method according to any of claims 1-5, further comprising, after said optimizing network parameters of said target optimized cell according to said network optimization strategy:

and acquiring the optimized network parameters of the target optimization cell, and updating the optimized network parameters to the network parameter tuning experience data set of the target optimization cell.

8. A network optimization apparatus, comprising:

the processing module is used for determining to generate a network optimization strategy through a simulated neural network or a deep reinforcement learning model according to a preset judgment strategy based on the network parameter data and the network optimization index of the target optimization cell, wherein the simulated neural network is generated by training a preset neural network based on a pre-acquired network parameter tuning experience data set of the target optimization cell, and the deep reinforcement learning model is generated based on the simulated neural network;

9. An electronic device comprising a memory, a processor and a computer program stored on the memory, characterized in that the processor executes the computer program to implement the steps of the method of any of claims 1-7.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 7.

11. A computer program product comprising a computer program, characterized in that the computer program realizes the steps of the method of any one of claims 1-7 when executed by a processor.