CN113393495B

CN113393495B - High-altitude parabolic track identification method based on reinforcement learning

Info

Publication number: CN113393495B
Application number: CN202110685692.8A
Authority: CN
Inventors: 郭洪飞; 马向东; 曾云辉; 陈柄赞; 何智慧; 任亚平; 张锐
Original assignee: Jinan University
Current assignee: Jinan University
Priority date: 2021-06-21
Filing date: 2021-06-21
Publication date: 2022-02-01
Anticipated expiration: 2041-06-21
Also published as: CN113393495A

Abstract

The invention discloses a high-altitude parabolic track identification method based on reinforcement learning. The method comprises the following steps: acquiring a high-altitude parabolic track image of a monitored window area through an image sensor; preprocessing the high-altitude parabolic track image to obtain preprocessed image information; judging whether the image sensor is shielded or not according to the preprocessed image information; when the image sensor is judged not to be shielded, inputting the preprocessed image information into a processor, acquiring a pre-training target model after reinforcement learning by the processor, and performing high-altitude parabolic recognition on the preprocessed image information through the pre-training target model to obtain high-altitude parabolic recognition result information; and the processor stores the high-altitude parabolic recognition result information into a data storage unit, a cloud server and a storage so as to train and update the pre-training target model. According to the method, the high-altitude parabolic track is identified through the reinforcement learning model, and the identification accuracy is improved.

Description

High-altitude parabolic track identification method based on reinforcement learning

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to a high-altitude parabolic track identification method based on reinforcement learning.

Background

As the economies of scale further develop, the population of cities gathers, the production and living environment of mankind is filled with various uncertainties and risks, the high altitude parabola is called "pain over cities", which cannot be easily controlled and stopped once started, and is in a rapidly developing situation, and once the behavior is started and reaches the standard, the behavior is difficult to be controlled and stopped immediately, so that the behavior can rapidly spread in a very short time, and great damage is caused to public safety. Particularly, since recent years, civil and criminal cases about high-altitude parabolic behaviors are increasing, and newspapers in various places report events about high-altitude parabolic injuries in a dispute, so that people have a dispute and call for asking for strict regulations on high-altitude parabolic behaviors in order to ensure the 'top safety' of people. In the background, the highest courtyard publishes the opinion on justice-based proper high-altitude parabolic and falling cases of the highest people's court, and the crime and punishment are harmed by a dangerous method as long as the social public safety is endangered even though the actual damage result is not caused.

For traditional reinforcement learning, the typical problem is the Markov Decision Process (MDP). The markov decision process contains a set of states S and actions a. The transition of the states is determined by the probability P, the reward R and a compromise parameter gamma. The probability P reflects the relationship between transitions and rewards for state transitions, which depend only on the state and action of the last time step. Reinforcement learning defines an environment for an Agent (a software and hardware system) to implement certain actions to maximize rewards. The basis for the optimization behavior of an Agent is defined by Bellman's equation, a method widely used to solve practical optimization problems. Reinforcement learning is good enough for the environment when all reachable states are controllable and can be stored in computer RAM (random access memory). However, when the number of states in the environment exceeds the capacity of modern computers, the standard reinforcement learning mode is less effective. Moreover, in a real environment, the agent must face the problems of continuous state, continuous variables and continuous control (action). Therefore, the standard, well-defined reinforcement learning Q-table is replaced by a deep neural network, i.e., a Q-network, which can map environmental states to agent actions. The network architecture, selection of network hyper-parameters and learning are all done in the training phase (learning of Q network weights). DQN (Deep Q Network, reinforcement learning) allows agents to explore unstructured environments and gain knowledge that, over time, can mimic human behavior. We use DQN algorithm to solve this problem of continuous state (non-discrete), continuous variable and continuous control (action) in high altitude parabolic trajectory recognition systems.

At present, there are already high altitude parabolic trajectory prediction patents in the market: a high-altitude parabolic detection method, equipment, a storage medium (patent number: CN111931599A) and a high-altitude parabolic radar wave visual fusion monitoring and early warning system (patent number: CN201922207460.2) are provided. The former is to calculate the motion state of an object by an image processing algorithm based on SUV (standard absorption value) or the like to realize prediction, and the latter is to monitor a high-altitude parabolic trajectory by using a radar system. Therefore, there are few ideas in analyzing and predicting the high-altitude parabolic trajectory from the perspective of the intelligent prediction algorithm in the market.

Disclosure of Invention

The invention aims to provide a high-altitude parabolic track identification method based on reinforcement learning so as to accurately identify a high-altitude parabolic track.

In order to achieve the purpose, the invention is realized by the following technical scheme:

a high-altitude parabolic track identification method based on reinforcement learning comprises the following steps:

s1, acquiring a high-altitude parabolic track image of the monitored window area through an image sensor;

s2, preprocessing the high-altitude parabolic track image to obtain preprocessed image information;

s3, judging whether the image sensor is blocked according to the preprocessed image information;

s4, when the image sensor is judged not to be shielded, the preprocessed image information is input to a processor, the processor obtains a pre-training target model after reinforcement learning, and high-altitude parabolic recognition is carried out on the preprocessed image information through the pre-training target model to obtain high-altitude parabolic recognition result information;

and S5, the processor stores the high altitude parabolic recognition result information into a data storage unit, a cloud server and a storage to train and update the pre-training target model.

Optionally, the S2 includes:

s2.1, converting the high-altitude parabolic image collected by the image sensor into a low-dimensional gray image;

s2.2, carrying out affine transformation on the gray level image;

s2.3, carrying out noise elimination on the gray image after affine transformation in a spatial filtering and time domain filtering mode;

and S2.4, acquiring a target detection frame of the moving object in each frame of image after noise elimination by adopting a background difference and inter-frame difference fusion method, and predicting the target detection frame of the moving object in the next frame of image according to the target detection frame in the previous frame of image through Kalman filtering to obtain the preprocessed image information.

Optionally, the S3 includes:

s3.1, acquiring pixel values and distribution characteristics in the preprocessed image information;

s3.2, judging whether the image sensor is shielded or not according to the size and the distribution characteristics of the pixel values in the preprocessed image information;

and S3.3, when the image sensor is judged to be shielded, storing the preprocessed image information into the cloud server and the storage.

Optionally, after the S2 and before the S3, the method further comprises: and storing the preprocessed image information into the cloud server and a storage.

Optionally, the step of obtaining the pre-trained target model in S4 includes:

s4.1, initializing an action model and a target model before pre-training;

s4.2, establishing a simulation environment, and transmitting the optimal action parameters to the simulation environment by the action model;

s4.3, the simulation environment simulates according to the optimal action parameters to obtain simulated action parameters, and stores the simulated action parameters to the data storage unit;

s4.4, the action model acquires the simulated action parameters from the data storage unit so as to train and update the action model;

and S4.5, copying the latest simulated action parameters to the target model after the action model is trained for C times to train and update the target model to obtain the pre-trained target model, wherein C is an integer greater than or equal to 2.

Optionally, the optimal action parameters in S4.2 include: the high-altitude parabolic track image, the high-altitude parabolic predicted track and the target model parameters.

Optionally, the simulated operation parameters in S4.3 include: the high altitude parabolic track image of the current state, the current high altitude parabolic predicted track, the current reward obtaining and the high altitude parabolic track image of the next state.

Optionally, the step of establishing a simulation environment in S4.2 includes:

s4.2.1, acquiring physical characteristics, dynamic characteristics and surrounding environment characteristics of a high-altitude parabolic moving object;

s4.2.2, analyzing the physical characteristics, dynamic characteristics and surrounding environment characteristics of the moving object of the high-altitude parabola according to the air resistance and wind speed variables of the environment of the high-altitude parabola to establish the simulation environment.

Optionally, the action model and the target model continuously obtain high-altitude parabolic track prediction error information in an updating process, so as to change a prediction strategy according to the error information and an error value of an adjacent frame high-altitude parabolic track image.

Optionally, the S5 further includes: and comparing the high-altitude parabolic recognition result information with an actual high-altitude parabolic track to obtain actual prediction error information, and feeding back the actual prediction error information to the data storage unit.

The invention has at least one of the following beneficial effects:

the method starts from the angle of an intelligent prediction algorithm and the idea of predicting the high-altitude parabolic track, and the high-altitude parabolic track is recognized through a reinforcement learning model, so that the recognition accuracy rate is improved. In the high-altitude parabolic track recognition method based on reinforcement learning, the processor acquires a pre-training target model after reinforcement learning, so that high-altitude parabolic recognition is performed on pre-processing image information through the pre-training target model, the pre-training target model does not need to train a data set labeled manually and can improve high-altitude parabolic track prediction accuracy, and the processor stores high-altitude parabolic recognition result information into a data storage unit, a cloud server and a storage, so that the pre-training target model is trained and updated, the high-altitude parabolic track prediction accuracy can be further improved, the data storage unit can improve the data utilization rate, samples participating in network training can meet the requirement of independent and same distribution, and the training stability is improved.

Furthermore, in the high-altitude parabolic track recognition method based on reinforcement learning provided by the invention, after the action model is updated every C times, the latest simulated action parameters are copied to the target model to train and update the target model, so that the stability of model training is ensured, and the action model and the target model continuously acquire high-altitude parabolic track prediction error information in the updating process, so that the prediction strategy is changed according to the error information and the error value of the adjacent frame high-altitude parabolic track image, and the high-altitude parabolic track prediction accuracy can be effectively improved.

Additional aspects and advantages of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.

Drawings

Fig. 1 is a flowchart of a high-altitude parabolic trajectory recognition method based on reinforcement learning according to this embodiment;

fig. 2 is a specific working schematic diagram of the high-altitude parabolic trajectory identification method based on reinforcement learning according to the present embodiment;

fig. 3 is a schematic diagram of a reinforcement learning model architecture provided in this embodiment.

Detailed Description

Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are illustrative and intended to be illustrative of the invention and are not to be construed as limiting the invention.

The invention relates to a high-altitude parabolic track identification method based on reinforcement learning. The method is applied to a high-altitude parabolic track recognition system based on reinforcement learning. The system mainly comprises a simulation environment module, a data storage unit, an action model, a target model, a DQN error function module, an image acquisition module, a preprocessing module, an image storage module, a shielding prediction module, a cloud server and a memory module.

The high-altitude parabolic trajectory recognition method based on reinforcement learning of the present embodiment is described below with reference to the drawings.

Referring to fig. 1, the high-altitude parabolic trajectory identification method based on reinforcement learning according to the present embodiment includes the following steps:

and S1, acquiring a high-altitude parabolic track image of the monitored window area through an image sensor.

Specifically, an image sensor is installed at a proper position of the monitored window to collect image information of the monitored window. In order to reduce monitoring dead angles as much as possible, a plurality of image sensors with different angles are designed for the same window to acquire information, so that the probability that a malicious parabolic user avoids a camera in the process of parabolic movement is reduced.

And S2, preprocessing the high-altitude parabolic track image to obtain preprocessed image information.

Wherein the S2 includes:

s2.2, carrying out affine transformation on the gray level image;

Specifically, as shown with reference to FIG. 2, the collected data information is passed to a preprocessing module. And converting the color image collected by the image sensor into a low-dimensional gray-scale image. The converted image still retains main information, meanwhile, the data processing burden is reduced, affine transformation of the image is carried out, scale, stretching, rotation and translation change is carried out, the image information suitable for being predicted by using a training model is processed, salt and pepper noise and Gaussian noise are eliminated through spatial filtering and time domain filtering, and then a target detection frame of a moving object in each frame of image is obtained through a background difference and interframe difference fusion method. The background difference method can better keep the whole foreground of the target; the frame difference method has high detection sensitivity, and the background difference foreground in a window around the frame difference foreground is reserved on the basis of the frame difference foreground, so that a target detection frame of a moving object is detected and obtained more completely by using a background difference and interframe difference fusion method, and the target prediction frame of the moving object in a next frame image can be predicted by using Kalman filtering according to the target detection frame in the previous frame image, so that the preprocessed image information is obtained.

As one example, information collected by the image sensor is passed to a pre-processing module, which performs image pre-processing on the collected image information. The native size of the collected image is 210 × 160, with 128 colors per pixel, which is converted to a grayscale image of 84 × 84 dimensions. The transformed image still retains the main information while reducing the burden of data processing.

It should be noted that, since the trajectory of the high altitude parabola is continuous, the Agent can only obtain 1 frame of information from the environment at each moment, and such static image information is difficult to represent the dynamic motion information of the thrown object. To this end, the recognition algorithm will collect the first N frames from the current time and combine this information as input to the model. The state information collected within a certain time is obtained, and the reinforcement learning model can learn more accurate action value.

And S3, judging whether the image sensor is blocked according to the preprocessed image information.

Wherein the S3 includes:

Specifically, referring to fig. 2, after the trajectory information of the high altitude parabola is processed by the preprocessing module, it needs to be predicted to be blocked, and whether the image sensor is blocked or not is judged by judging the size and distribution characteristics of the pixel value of the preprocessed image information, and the prediction result can be transmitted to the cloud server and the storage. The preprocessed image information is also transmitted to an image storage module, so that historical basis and experience are provided for the arrival of the subsequent similar recognition task.

After the S2 and before the S3, the method further comprises: and storing the preprocessed image information into the cloud server and a storage.

And S4, when the image sensor is judged not to be shielded, inputting the preprocessed image information into a processor, acquiring a pre-training target model after reinforcement learning by the processor, and performing high-altitude parabolic recognition on the preprocessed image information through the pre-training target model to obtain high-altitude parabolic recognition result information.

Specifically, when the image sensor is judged not to be shielded, the preprocessed image information obtained by the preprocessing module is transmitted to the processor for data processing, and the main high-altitude falling object trajectory prediction and recognition task is realized by data transmission and processing with the established pre-training target model. And judging and predicting the real-occurring track through the existing prediction experience of the pre-trained target model.

The step of obtaining the pre-training target model in S4 includes:

s4.1, initializing an action model and a target model before pre-training;

Specifically, when initializing an action model and a target model before pre-training, parameters to be optimized need to be extracted from the model, where s represents a high-altitude parabolic track image to be identified, a represents a high-altitude parabolic predicted track, r represents accuracy of a prediction result, i.e., an obtained reward, t represents a t-th time step, G represents an accumulated reward, γ represents a decay factor of the reward, k represents an accumulated reward calculated from the k-th step, and a definition function Q is as follows:

state:S_t＝f(H_t),A_t＝h(S_t) (2)

the loss function is defined as, where θ represents the model parameters:

L(θ)＝E[(TargetQ-Q(s,a；θ))²] (3)

the objective function is:

the objective model solution value objective function formula is as follows:

wherein theta is^-And (3) further expanding the formula as a parameter of the Target Network, wherein j represents a state number, and the formula is as follows:

y_j＝r_j+1+γQ(s_j+1,argmax_a'Q(s_j+1,a'；θ^-)；θ^-) (6)

the updating method comprises the following steps:

and continuously interacting with the action model in the subsequent training process, and feeding back to the action model:

a_t＝argmax_aQ(φ(s_t),a；θ) (8)

the structure of the model mainly adopts

The output of the model is a vector of length | A |, each value in the vector representing a value estimate for the corresponding action. Thus, only one calculation is needed to find the value of all actions, and the time for evaluating the value is the same no matter how many actions exist.

The step of establishing the simulation environment in S4.2 includes:

Specifically, a virtual environment can be constructed according to the real environment characteristics of the high-altitude parabolic time, and materials are provided for model training. A motion trail model is established mainly according to physical characteristics of object motion in high-altitude parabolic motion and dynamic characteristics in combination with surrounding environment, and a simulation environment is established by considering variables such as air resistance, wind speed and the like. Therefore, the method can be attached to the real high-altitude parabolic scene as much as possible, and provides the most accurate materials for the training of the model.

When the action model interacts with the simulation environment, the action model transmits the optimal action argmaxQ (s, a, theta) to the simulation environment. Wherein s is a high-altitude parabolic track image, a is a high-altitude parabolic predicted track, and theta is a target model parameter.

And S4.3, the simulation environment simulates according to the optimal action parameters to obtain simulated action parameters, and stores the simulated action parameters to the data storage unit.

The simulated action parameters in S4.3 include: the high altitude parabolic track image of the current state, the current high altitude parabolic predicted track, the current reward obtaining and the high altitude parabolic track image of the next state.

Specifically, the simulation environment transmits the current state s to the action model, and stores the current state s, the current action a, the currently obtained reward r, and the next state s' in the data storage unit.

It should be noted that in the training process, the recognition algorithm can make a decision from a random scene, and if we make a decision from a fixed scene each time, the Agent always makes a decision on these same frames, which obviously is not beneficial to exploring more frames for learning. In order to enhance the exploratory property without deteriorating the model effect, the Agent is enabled to perform random actions in a short period of time from the beginning, so that different scene samples can be obtained to the maximum extent.

And S4.4, the action model acquires the simulated action parameters from the data storage unit so as to train and update the action model.

Specifically, the motion model acquires (s, a, r, s') data from the data storage unit and updates the model. The data storage unit stores sample data information, simulation prediction information and results, and is set to store 100 ten thousand samples, so that samples in a long period of time can be stored. When the value function is trained, a certain number of samples can be taken out from the value function, and training is carried out according to the information recorded by the samples. In general, the data storage unit includes both the process of collecting samples and sampling samples. The collected samples are stored in the structure in chronological order, and if the data storage unit is already full of samples, the new samples will overwrite the samples that are the oldest in time. The action model acquires information from the data storage unit to realize information transfer and adaptive updating with the simulation environment.

Specifically, the action model copies the model parameters to the target model every C updates. If the latest sample is taken every time, the algorithm is similar to online learning, and the data storage unit uniformly and randomly samples a batch of samples from the cache for learning so as to satisfy that the sequence obtained by interaction has certain correlation in the time dimension. In the future, the learned value function can represent the expectation of long-term benefits under the action of the current state, however, the sequence obtained by each interaction only represents one sampling track under the action of the current state, and cannot represent all possible tracks, so that the estimated result has a certain difference from the expected result. This gap accumulates more and more as the interaction time is lengthened. And the model is prone to large fluctuations. Therefore, after uniform sampling is adopted, the sample of each training usually comes from a plurality of interactive sequences, so that the fluctuation of a single sequence is greatly reduced, and the training effect is greatly stabilized. Meanwhile, one sample can be trained for multiple times, and the utilization rate of the sample is improved. Therefore, the model parameters are copied to the target model every time the model parameters are updated for C times, so that the instability of data is reduced, and the utilization rate of the data is improved.

Specifically, the action model and the target model receive information from the DQN error module during the update process. And continuously drawing the error information of the DQN error module by the action model and the target model in the updating process, changing an optimization strategy by the action model and the target model according to the numerical value of the error information and the error numerical value of the adjacent frame picture, and further updating so as to improve the model prediction accuracy. The DQN error module also stores the reward information r in a data storage unit for subsequent random, repetitive training and goal updating to provide data support.

As an example, the obtaining step of the pre-training target model may specifically include:

when training begins, the action model and the target model use the same parameters, and in the training process, the action model is responsible for interacting with the simulation environment to obtain an interaction sample. In the Learning process, the target value obtained by Q-Learning is calculated from the target model, and then compared with the estimated value of the motion model to obtain the target value and update the motion model. And each time the training completes a certain number of iterations, the parameters of the action model are synchronized to the target model, so that the next stage of learning can be carried out. By using the target model, the model that calculates the value of the target will be fixed over a period of time so that the model can mitigate the volatility of the model.

The S5 further includes: and comparing the high-altitude parabolic recognition result information with an actual high-altitude parabolic track to obtain actual prediction error information, and feeding back the actual prediction error information to the data storage unit.

Specifically, the system can judge whether an error is generated between the high-altitude parabolic recognition result information and the actual high-altitude parabolic track, and feeds back a comparison result to the data storage unit, so that an actual prediction effect is provided for a subsequent model training module, and further optimization and upgrading of the deep reinforcement learning system are promoted.

In order to make the construction process of the reinforcement learning model in the invention clear to those skilled in the art. The construction of the reinforcement learning model will be described in detail below.

FIG. 3 is a schematic diagram of an architecture of a reinforcement learning model. Strong learning model in this embodiment

The output of (a) is a vector of length | A |, the vectorEach value in (a) represents a value estimate for the corresponding action.

The main body of the model adopts a structure of four layers of convolutional neural networks: s represents the image of the high-altitude parabolic track to be identified, a represents the predicted track of the high-altitude parabolic track, and r represents the accuracy of the prediction result. For four convolutional layers:

the number of channels output for stride by the convolution kernel of the first layer convolutional layer is 32, and then the ReLU nonlinear layer is applied. The convolution kernel of the second convolutional layer is stride output with the number of channels of 64, after which the ReLU nonlinear layer is applied. The convolution kernel of the third convolutional layer is stride output with the number of channels of 64, and then the ReLU nonlinear layer is applied. The fourth layer is a fully connected layer, with output dimension 512, after which the ReLU non-linear layer is applied. And finally, obtaining value estimation of the corresponding action by the full connection layer.

The reinforcement learning model in this embodiment adopts greedy's strategy, which generates actions at random with a probability of 100% at first, and this probability will decay continuously with the continuous training, and eventually to 10%. That is, there is a 90% probability of executing the current optimal strategy. In this way, the strategy mainly used for exploration is gradually changed into the strategy mainly used for utilization, and the two strategies are well combined.

It should be noted that, when the reinforcement learning model is subjected to simulation training, if the test is performed from the same scene each time, the Agent always makes a decision on the same frames, which obviously is not beneficial to us to explore more frames for learning. In order to enhance the exploratory property without deteriorating the model effect, the Agent can be set to perform random actions in a short period of time from the beginning of the game, so that different scene samples can be obtained to the maximum extent.

When processing frame images collected by an image sensor, because the pictures between adjacent frames have great similarity, the same action can be generally adopted for the very similar pictures, so that the judgment of a certain number of frames is skipped, the space-time complexity of an algorithm is reduced, and the repeated processing of redundant data is avoided.

Meanwhile, due to the fact that the variance of the reward value is large, in order to enable a reinforcement learning model to better fit long-term return, the score needs to be compressed into a range which is good for the model to process, and the return obtained in each round is compressed to be between-1 and 1.

In summary, as shown in fig. 2, the main steps of the high-altitude parabolic trajectory identification method based on reinforcement learning can be divided into two stages:

in the model training phase: initializing an action model and a target model; the action model interacts with the simulation environment, the action model transmits the optimal action argmaxQ (s, a, theta) to the simulation environment, the simulation environment transmits the current state s to the action model, and stores the current state s, the current action a, the currently obtained reward r and the next state s' in the data storage unit; the action model acquires (s, a, r, s') data from the data storage unit and updates the model; copying the model parameters to the target model for each C times of updating of the action model; the action model and the target model receive information from the DQN error module in the updating process; the DQN error module stores the reward information r to the data storage unit.

In the model application phase: the image acquisition module acquires image information in a real scene and transmits the image information to the preprocessing module; the data preprocessing model carries out image cutting and median filtering operation, and key areas of the images are extracted; after the image is preprocessed, the image is stored in a cloud server and a storage through an image storage module; the preprocessed image is transmitted to a shielding prediction module, whether the camera is shielded in practical application is judged, and the result is transmitted to a cloud server and a storage; the processor judges the preprocessed image information by using the model trained in the model training stage, predicts to obtain a high-altitude parabolic track, and transmits a related result to the data storage unit to further train and update the target model.

In the high-altitude parabolic track recognition method based on reinforcement learning, the processor acquires a pre-training target model after reinforcement learning, so that high-altitude parabolic recognition is performed on pre-processing image information through the pre-training target model, the pre-training target model does not need to train a data set labeled manually and can improve high-altitude parabolic track prediction accuracy, and the processor stores high-altitude parabolic recognition result information into a data storage unit, a cloud server and a storage, so that the pre-training target model is trained and updated, the high-altitude parabolic track prediction accuracy can be further improved, the data storage unit can improve the data utilization rate, samples participating in network training can meet the requirement of independent and same distribution, and the training stability is improved.

It should be noted that the logic and/or steps represented in the flowcharts or otherwise described herein, such as an ordered listing of executable instructions that can be considered to implement logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device. More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable compact disc read-only memory (CDROM). Additionally, the computer-readable medium could even be paper or another suitable medium upon which the program is printed, as the program can be electronically captured, via for instance optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner if necessary, and then stored in a computer memory.

In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present invention, "a plurality" means at least two, e.g., two, three, etc., unless specifically limited otherwise.

Although embodiments of the present invention have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present invention, and that variations, modifications, substitutions and alterations can be made to the above embodiments by those of ordinary skill in the art within the scope of the present invention.

Claims

1. A high-altitude parabolic track identification method based on reinforcement learning is characterized by comprising the following steps:

s5, the processor stores the high altitude parabolic recognition result information into a data storage unit, a cloud server and a storage to train and update the pre-training target model;

the step of obtaining the pre-training target model in S4 includes:

s4.1, initializing an action model and a target model before pre-training;

s4.5, copying the latest simulated action parameters to the target model after the action model is trained for C times to train and update the target model to obtain the pre-trained target model, wherein C is an integer greater than or equal to 2;

the step of establishing the simulation environment in S4.2 includes:

2. The reinforcement learning-based high-altitude parabolic trajectory recognition method according to claim 1, wherein the S2 includes:

s2.2, carrying out affine transformation on the gray level image;

3. The reinforcement learning-based high-altitude parabolic trajectory recognition method according to claim 1, wherein the S3 includes:

4. The reinforcement learning-based high-altitude parabolic trajectory recognition method according to claim 1, wherein after the S2 and before the S3, the method further comprises: and storing the preprocessed image information into the cloud server and a storage.

5. The reinforcement learning-based high-altitude parabolic track recognition method according to claim 1, wherein the optimal action parameters in S4.2 include: the high-altitude parabolic track image, the high-altitude parabolic predicted track and the target model parameters.

6. The reinforcement learning-based high-altitude parabolic track recognition method according to claim 1, wherein the simulated action parameters in S4.3 include: the high altitude parabolic track image of the current state, the current high altitude parabolic predicted track, the current reward obtaining and the high altitude parabolic track image of the next state.

7. The reinforcement learning-based high-altitude parabolic track recognition method as claimed in claim 1, wherein the action model and the target model continuously obtain high-altitude parabolic track prediction error information in an updating process, so as to change a prediction strategy according to the error information and an error value of an adjacent frame high-altitude parabolic track image.

8. The reinforcement learning-based high-altitude parabolic trajectory recognition method according to claim 1, wherein the S5 further includes: and comparing the high-altitude parabolic recognition result information with an actual high-altitude parabolic track to obtain actual prediction error information, and feeding back the actual prediction error information to the data storage unit.