CN116644666A

CN116644666A - Virtual assembly path planning guiding method based on strategy gradient optimization algorithm

Info

Publication number: CN116644666A
Application number: CN202310650791.1A
Authority: CN
Inventors: 吴学毅; 李吉浩; 李景一; 王璞漳
Original assignee: Xian University of Technology
Current assignee: Xian University of Technology
Priority date: 2023-06-02
Filing date: 2023-06-02
Publication date: 2023-08-25

Abstract

The invention discloses a virtual assembly path planning and guiding method based on a strategy gradient optimization algorithm, which comprises the following steps: step 1, a 3D model of a part is built at 3ds MAX, the part is converted into an FBX file, and the FBX file is imported into a Unity 3D reinforcement learning training scene; step 2, creating an experimental environment by using an ML_agents module in a Unity 3D reinforcement learning training scene; step 3, establishing a part intelligent body decision model; step 4, optimizing a part intelligent body decision model; and 5, planning the path of the part by using the optimized part intelligent agent decision model. The method of the invention applies the deep reinforcement learning based on the strategy gradient optimization to the virtual assembly path planning guiding process, thereby greatly improving the assembly efficiency and the assembly quality and providing a new idea for solving the virtual assembly path planning.

Description

Virtual assembly path planning guiding method based on strategy gradient optimization algorithm

Technical Field

The invention belongs to the technical field of virtual assembly, and relates to a virtual assembly path planning and guiding method based on a strategy gradient optimization algorithm.

Background

Virtual assembly is an important tool for modern industrial design, product verification and user training. In the immersive virtual assembly system, due to the fact that collision force sense, assembly positioning constraint and other sense of reality are lacked, and due to the fact that trained personnel are easy to lose motion direction sense and other factors in the three-dimensional virtual environment, virtual assembly is completed in application scenes of complex parts and various assembly structures to form a piece of work with heavy burden, cognitive fatigue is easy to generate in the training process of the trained personnel, and assembly efficiency and assembly quality are greatly reduced.

Disclosure of Invention

The invention aims to provide a virtual assembly path planning and guiding method based on a strategy gradient optimization algorithm, which solves the problems that the prior art is difficult to realize the effective sense of body on assembly of parts, so that trained personnel are easy to produce cognitive fatigue in the training process, and the cognitive burden of the trained personnel is increased.

The technical scheme adopted by the invention is that the virtual assembly path planning and guiding method based on the strategy gradient optimization algorithm is implemented according to the following steps:

step 1, a 3D model of a part is built at 3ds MAX, the part is converted into an FBX file, and the FBX file is imported into a Unity 3D reinforcement learning training scene;

step 2, creating an experimental environment by using an ML_agents module in a Unity 3D reinforcement learning training scene;

step 3, establishing a part intelligent body decision model;

step 4, optimizing a part intelligent body decision model;

and 5, planning the path of the part by using the optimized part intelligent agent decision model.

The method has the advantages that the deep reinforcement learning based on strategy gradient optimization is applied to the virtual assembly path planning and guiding process, the cognitive burden caused by the fact that the moving direction is lacked and the assembly process is complex in the immersed virtual assembly system is well solved, the parts to be assembled are converted into the part intelligent bodies with autonomous decision making capability by utilizing the strategy gradient optimization algorithm, the part intelligent bodies can make decision according to specific environment states and conduct the path planning process, the path has guiding significance on assembly, the assembly efficiency and the assembly quality are greatly improved, and compared with the traditional intelligent algorithm, the method is more efficient and convenient, and a new idea is provided for solving the virtual assembly path planning.

Drawings

FIG. 1 is a flow chart of a part assembly path plan for the method of the present invention;

FIG. 2 is a workflow diagram of the ML_agents module employed in the method of the present invention;

FIG. 3 is a visual illustration of an assembly path planning training in the method of the present invention;

FIG. 4 is a schematic diagram of a path planning and guiding process performed by generalizing the trained primary rotor part intelligent agent decision model to other parts in embodiment 1 of the method of the present invention;

FIG. 5 is a schematic diagram of a path planning and guidance process performed by the trained three-stage rotor blade part intelligent agent decision model generalizing to other parts in method embodiment 2 of the present invention;

fig. 6 is a schematic diagram of a path planning and guiding process performed by generalizing the trained transmission shaft part intelligent agent decision model to other parts in embodiment 3 of the method of the present invention.

Detailed Description

The invention will be described in detail below with reference to the drawings and the detailed description.

The invention provides a part assembly path guiding prompt means based on gradient strategy optimization, which is used for solving the problem of part assembly path guiding in the traditional assembly teaching task, and the principle is that a strategy gradient optimization algorithm is utilized in a reinforcement learning environment to train parts so as to carry out intelligent path planning decision, the main idea is that the assembled parts are regarded as part intelligent bodies, and action reward punishment functions, action functions and scene resetting functions are arranged to train the part intelligent bodies, so that a part intelligent body path planning decision model is finally converged, the decision model can enable the part intelligent bodies to approach the assembled parts no matter where the assembled parts are located, an assembly guiding path can be generated in the process, and the cognitive burden of gas compressor assembly training is reduced.

Referring to fig. 1, the virtual assembly path planning and guiding method of the present invention is implemented as follows:

and step 1, a 3D model of the part is built at 3ds MAX, the part is converted into an FBX file, and the FBX file is imported into a Unity 3D reinforcement learning training scene.

Step 2, creating an experimental environment by using an ML_agents module in the Unity 3D reinforcement learning training scene, wherein as shown in fig. 2, part agents (agents) can be created in the experimental environment, trained and optimized, and the training scene is initialized.

Referring to fig. 3, the parts to be assembled are randomly generated in the Unity 3D reinforcement learning training scenario, the part agent selects the actions to be performed to be close to the parts to be assembled according to the probability distribution of the outputted actions, and no greedy strategy is adopted here to select the actions.

The reinforcement learning training scene relates to an interaction process between a part intelligent agent and an environment, and mainly comprises four elements of strategy, rewards, value functions and an environment model, wherein the four elements are respectively described as follows:

2.1 The strategy is a behavior mode of the part intelligent agent at a specific moment, namely, the part intelligent agent can take different actions and rewards according to an action transfer function according to action mapping taken by the part intelligent agent in the current environment state, and generates an action track tau reaching a target state from the current state, and the expression is as follows:

τ＝{s ₁ ,a ₁ ,r ₁ ,s ₂ ,a ₂ ,r ₂ ,...,s _T ,a _T ,r _T } (1)

wherein T is the current time of the environment, S ₁ ,S ₂ ,S ₃ …S _T Is an environmental state sequence, a ₁ ,a ₂ ,a ₃ …a _T For the action sequence, r ₁ ,r ₂ ,…,r _T A sequence of rewards obtained for each action, and after action a acts on adjacent states, state transition is carried out according to a state transition function;

2.2 The reward and punishment report is a value fed back to the part intelligent agent by the environment, which tells the part intelligent agent what should be done but not what should be done, and generates a complete action track reward and punishment report when training is performed in each round, and the expression is as follows:

r (τ) is the action reward sum of each round of training, t is the current time of the reinforcement learning training scenario, R _t Is the rewarding value obtained by the action at the current moment;

2.3 The value function refers to the sum of expected returns obtained by a part agent running all the time from a certain state, and is used for representing long-term trend, and the expression of the value function is as follows:

where θ is a neural network parameter,the cumulative sum expressed as the probability of each track occurrence P (τ|θ) multiplied by the action reward sum R (τ) for each track is approximately equal to the average of the total N track reward sums; τ ⁿ Represents an nth trace;

2.4 The environment model refers to the external world interacting with the part agent, the environment is the scene in which the part agent is located, including observable states, executable actions, and dynamic interactions between the part agent and the environment, and after a state and an action are given, the state and return of the next moment caused by the action are predicted.

Step 3, a decision model of the part agent is established, the decision model inputs environment observation information and makes action decisions according to a collectibservation () function, the AgentAction () function is utilized to make decision action, each action is rewarded or punished by an AddReward () function, and when one-round training is finished, the AgentReset () function is called to initialize a scene, and the specific process is as follows:

3.1 The part intelligent body decision model adopts a random strategy, the conditional probability of each action in a designated state needs to be calculated, observation data, such as the position and speed information of a local coordinate system of the part intelligent body, is input through a collectibservation () function in a training environment, the prediction return values of various actions are obtained through calculation of the part intelligent body decision model, and then proper actions are selected according to probability distribution, wherein the expression is as follows:

a～π(s，a，θ)＝p[a|s，θ] (4)

the strategy pi determines the selection of the behavior a, the selection of the behavior a determines the transition probability of the state s, the decisive method is mapping from the state to the action, the current action of the part intelligent agent is determined through an AgentAction () function, and a strategy gradient optimization algorithm is used for adjusting the parameters and finding the optimal strategy;

3.2 Since the rewarding rewards brought by the actions are unknown at the moment, the method is still in the exploration stage at the moment, and the degree of the actions cannot be judged, and judgment needs to be made according to a reward and punishment function AddReward ():

a) If the prize value of the action is positive or the subsequent state caused is high value, the action is regarded as a good action, and the part intelligent body is closer to the assembled part to be assembled;

b) If the action rewarding value is negative, or the subsequent state caused by the action is low value, the part intelligent body is farther away from the assembled part to be assembled, or the action exceeds the scene range, the action is regarded as differential action, and the path planning training is performed again by initializing the training scene.

3.3 Initializing a scene and making a subsequent action selection using a scene initialization function agenreset () function,

the part intelligent body decision model avoids the sparse rewards caused by the differential action as much as possible to further select the actions with high rewards, when the part intelligent body finally touches the assembled part according to some better actions, the path planning is successful, the scene is initialized, the part intelligent body decision model updates the decision model parameters by adopting a gradient rising method, and the action sequence is stored in an experience pool;

the part intelligent agent decision model can reduce the probability expectation of actions performed on the last assembly failure path, so that the probability of negative rewarding action tracks is reduced; if the assembly is successful, to improve the probability of the positive rewarding action track, a series of actions are essentially sampled, then the state value of the sequence at the time t is calculated, and the state value is used for updating the strategy, and the expression of the updating strategy is as follows:

wherein alpha is the learning rate,is the variation of the gradient;

finally, the gradient of the parameter function of the part agent decision model is calculated, and the expression is:

wherein E is _π For the desired amount of gradient change of the objective function,and obtaining the optimal theta through gradient rising to obtain the part intelligent body decision model for the gradient variation of the objective function.

Step 4, optimizing the decision model of the part intelligent agent,

after training, the system generates a neural network model file, namely a decision model, and after the path planning of the part intelligent agent is successful, the system triggers scene initialization, at the moment, the system cannot represent that the part intelligent agent decision model is optimal, but needs to train continuously in the scene to search for better and better actions so as to obtain higher rewards; when the total rewards obtained by training are not increased after a plurality of experimental rounds, the decision model of the part agent is converged to reach an optimized state.

Step 5, utilizing the optimized part intelligent agent decision model to carry out path planning of the part,

generalizing the optimized part intelligent decision model to other parts, inputting the current state of the other part intelligent agents into the part intelligent decision model, outputting the probability of each action, guiding the part intelligent agents to act according to the probability of the action by the part intelligent decision model, generating a path by continuous action, wherein the path is the optimal assembly path, at the moment, carrying out visual prompt of assembly path guidance by utilizing a line segment generated by a Unity 3D built-in Trail Renderer assembly,

the part intelligent body decision model after multiple rounds of training optimization is used for covering a continuous state space, then the change direction of an objective function is controlled by adjusting parameters of the part intelligent body decision model, so that an optimal strategy is realized, and the final gradient is as follows:

and (7) selecting a behavior to directly conduct back propagation through observation information, directly enhancing or weakening the possibility of selecting the behavior according to feedback rewards, wherein good behaviors can increase the probability of being selected next time, bad behaviors can weaken the probability of being selected next time, and finally an optimal assembly path can be planned through continuous actions.

Example 1

The implementation process of the method is described by taking a primary rotor and a secondary rotor of a compressor in an aero-engine as an assembly path planning process as objects. Referring to fig. 1, a primary rotor part agent is trained by constructing an environment for training the primary rotor part agent, a primary rotor part agent action decision model is obtained, and the primary part agent decision model is applied to a path planning process of primary rotor part assembly in the environment. The method comprises the following specific steps:

and step 1, a 3D model of a primary rotor and a secondary rotor of the part is built at 3ds MAX, the part is converted into an FBX file, and the FBX file is imported into a Unity 3D reinforcement learning training scene.

Step 2, referring to fig. 2, using an ml_agents module in the Unity 3D reinforcement learning training scenario; referring to fig. 4, part agents (agents) can be created, trained and optimized in an experimental environment. The use process of the ml_agents module is as follows:

a) A virtual environment is created and interaction rules between the primary rotor part agent and the environment are defined therein.

b) The neural network structure, the strategy function, the value function and the like of the primary rotor part intelligent agent are defined.

c) In the process of interaction between the primary rotor part intelligent agent and the environment, the ML-agents module automatically collects data such as states, actions, rewards and the like of the primary rotor part intelligent agent, and stores the data in an experience playback pool.

d) Using Unity 3D to initiate the training process, the ML-agents module automatically reads the data in the experience playback pool and trains the strategy and value functions of the primary rotor part agent using strategy gradient ascent.

e) After training, the ML-agents module automatically stores the trained primary rotor part intelligent agent decision model, and can generalize the model to other primary rotor parts, so that the other parts also have the capability of autonomous path planning.

And 3, setting an environment observation function collectobservationin which a secondary rotor is set as an assembled part, wherein 9 information observation inputs to be observed in collectobservationare respectively a secondary rotor part local coordinate, a primary rotor part intelligent body X-axis speed component, a primary rotor part intelligent body Z-axis speed component and a primary rotor part intelligent body Y-axis speed component, and the observed information is input into a primary rotor part intelligent body decision model for action decision.

3.1 Setting an action function AgentAction (), wherein the action function AgentAction () is continuously called by a training algorithm to update the action of the primary rotor part intelligent agent in the training process. The action space trained in the step is the component of the speed of the primary rotor part intelligent agent in the directions of the X axis, the Y axis and the Z axis, and is three continuous values, and the primary rotor part intelligent agent can be gradually driven to approach the secondary rotor by continuously interacting with the action function AgentAction (), so that the autonomous learning and the strategy optimization of the primary rotor part intelligent agent are realized.

3.2 Setting a reward and punishment function AddReward (), and evaluating the quality of each behavior action in the movement process of the part intelligent body through the reward and punishment function AddReward (), wherein if the action is good, the possibility of the next selection is enhanced; otherwise, the probability of the next selection is reduced. The reward and punishment function addforward () is designed as follows:

when the distance between the primary rotor part intelligent body and the secondary rotor exceeds 4, executing AddReward (-0.01 f) to obtain negative rewards of-0.01;

when the distance between the primary rotor part intelligent body and the secondary rotor is smaller than 3, executing AddReward (0.01 f) to obtain positive rewards of 0.01;

when the distance between the primary rotor part intelligent body and the secondary rotor is smaller than 2, executing AddReward (0.07 f) to obtain positive rewards of 0.07;

when the distance between the primary rotor part intelligent body and the secondary rotor is smaller than or equal to 1.42, addReward (1 f) is executed to obtain positive rewards of 1, the assembly path planning guidance is considered to be successful, the Done () method is utilized to finish the current round of training, and the function AgentReset () is utilized to initialize the scene.

3.3 Primary rotor part agent will be moved back to the new position by the scene reset function agenreset ().

In addition, if the primary rotor part agent is out of range, a scene reset function AgentReset () is triggered, so that the primary rotor part agent is initialized, the primary rotor part agent initial speed is reset to zero, the initial angular velocity of the primary rotor part agent is reset to zero, the initial local coordinates of the primary rotor part agent are set to (0, 0), and the secondary rotor positions are randomly refreshed within the range of 4 x 4.

And step 4, after training, the system generates a primary rotor part intelligent agent decision model which comprises a neural network model and parameter information. The method can process the information perceived by the part intelligent agent, generate corresponding actions, mount the primary rotor part intelligent agent decision model on other parts, and the other parts can make action decisions through a neural network model in the primary rotor part intelligent agent decision model.

And 5, after the steps 1 to 4 are completed, applying the trained decision model to other parts to carry out path planning so as to verify the generalization of the decision model. The method comprises the steps of introducing other primary rotor and secondary rotor parts to carry out assembly path planning, adding a trained decision model to the other primary rotor, enabling the other primary rotor parts to be part intelligent bodies with decision capability, setting end point targets of the primary rotor part path planning as secondary rotors respectively, and counting total time and success times of the primary rotor part intelligent bodies passing 100 times of path planning to obtain average time of each path planning as 1.21S and success rate as 89%. The generalization capability of the decision model is proved to be effective.

Example 2

The three-stage blade and three-stage rotor assembly process is taken as an illustration of the implementation process of the invention, a three-stage blade part intelligent agent action decision model is obtained by constructing a training environment by taking the three-stage blade as a part intelligent agent and training the training environment, and the three-stage blade part intelligent agent decision model is applied to a path planning process of part assembly in the environment. The method comprises the following specific steps:

and step 1, establishing a three-level blade and three-level rotor part 3D model at 3ds MAX, converting the part into an FBX file, and importing the FBX file into a Unity 3D reinforcement learning training scene.

Step 2, referring to fig. 2, an ml_agents module is used in the Unity 3D reinforcement learning training scenario. Referring to fig. 5, a three-stage rotor part agent can be created in an experimental environment and trained and optimized. The use process of the ml_agents module is as follows:

a) A virtual environment is created and interaction rules between the three-level blade part agent and the environment are defined therein.

b) And defining a neural network structure, a strategy function, a value function and the like of the three-stage blade part intelligent body.

c) In the process of interaction between the three-stage blade part intelligent body and the environment, the ML-agents module automatically collects data such as states, actions, rewards and the like of the three-stage blade part intelligent body and stores the data in an experience playback pool.

d) Using Unity 3D to initiate the training process, the ML-agents module automatically reads the data in the experience playback pool and trains the strategic and value functions of the part agents using strategic gradient rises.

e) After training is completed, the ML-agents module automatically stores the trained three-level blade part intelligent agent decision model, and can generalize the model to other parts, so that the other parts also have the capability of autonomous path planning.

Setting an environment observation function collectibservationin which an assembled part and a part intelligent body are arranged, wherein nine information observation inputs to be observed in collectibservationare respectively three-stage rotor part coordinates, three-stage blade part intelligent body part local coordinates, three-stage blade part intelligent body X-axis speed components, three-stage blade part intelligent body Z-axis speed components and three-stage blade part intelligent body Y-axis speed components, and the observed information is input into a three-stage blade part intelligent body decision model to carry out action decision.

3.1 Setting an action function AgentAction (), wherein the action function AgentAction () is continuously called by a training algorithm in the training process to update the actions of the three-level blade part intelligent body. The action space trained in the step is the component of the speed of the three-stage blade part intelligent body in the directions of the X axis, the Y axis and the Z axis, and is three continuous values, and the action function AgentAction () can gradually drive the three-stage blade part intelligent body to approach to the three-stage rotor through continuous interaction, so that the autonomous learning and the strategy optimization of the three-stage blade part intelligent body are realized.

3.2 Setting a reward and punishment function AddReward (), and evaluating the quality of each behavior action in the motion process of the three-stage blade part intelligent body through the reward and punishment function AddReward (), wherein if the action is good, the possibility of the next selection is enhanced; otherwise, the probability of the next selection is reduced. The reward and punishment function addforward () is designed as follows:

when the distance between the three-stage blade part intelligent body and the three-stage rotor exceeds 4, executing AddReward (-0.01 f) to obtain negative rewards of-0.01;

when the distance between the three-stage blade part intelligent body and the three-stage rotor is smaller than 3, executing AddReward (0.01 f) to obtain positive rewards of 0.01;

when the distance between the three-stage blade part intelligent body and the three-stage rotor is smaller than 2, executing AddReward (0.07 f) to obtain positive rewards of 0.07;

when the distance between the three-stage blade part intelligent body and the three-stage rotor is less than or equal to 1.42, executing AddReward (1 f), obtaining positive rewards of 1, regarding that the assembly path planning guidance is successful, ending the current round of training by using a Done () method and initializing a scene by using a function AgentReset ().

3.3 Tertiary blade part agent will move the part agent back to the new position through the scene reset function agenreset ().

In addition, if the three-stage blade part agent is out of range, a scene reset function AgentReset () is triggered, so that the three-stage blade part agent is initialized, the initial speed of the three-stage blade part agent is reset to zero, the initial angular velocity of the three-stage blade part intelligent body is reset to zero, the initial local coordinates of the three-stage blade part intelligent body are set to be (0, 0), and the three-stage rotor positions are randomly refreshed within the range of 4 multiplied by 4.

And step 4, after training, the system generates a three-level blade part intelligent body decision model which comprises a neural network model and parameter information. The three-stage blade part intelligent agent decision model can process information perceived by the three-stage blade part intelligent agent, generate corresponding actions, mount the three-stage blade part intelligent agent decision model on other three-stage blade parts, and the other three-stage blades can perform action decisions through a neural network model in the part intelligent agent decision model.

And 5, after the steps 1 to 4 are completed, applying the trained 3-level decision model to the rest three-level blade parts to carry out path planning so as to verify the generalization of the decision model. The three-stage rotor blade and three-stage rotor part assembly path planning is introduced, a trained decision model is added to other three-stage rotor blades, so that the other three-stage rotor blades become part intelligent bodies with decision capability, an end point target of three-stage rotor blade path planning is set as a three-stage rotor, the total time and the success times of 100 times of path planning of the three-stage rotor blade part intelligent bodies are counted, the average time of each path planning is 1.35S, and the success rate is 85%. The generalization capability of the decision model is proved to be effective.

Example 3

The invention is implemented by taking the assembly process of the transmission shaft and the front shaft as an illustration, constructing a training environment by taking the transmission shaft as a part intelligent body and training the training environment to obtain a motion decision model of the part intelligent body of the transmission shaft, and applying the motion decision model of the part intelligent body of the transmission shaft to a path planning process of the assembly of the transmission shaft. The method comprises the following specific steps:

and step 1, establishing a transmission shaft and front shaft 3D model at 3ds MAX, converting the transmission shaft and the front shaft into FBX files, and importing the FBX files into a Unity 3D reinforcement learning training scene.

Step 2, referring to fig. 2, an ml_agents module is used in the Unity 3D reinforcement learning training scenario. Referring to fig. 6, a propeller shaft part agent can be created in an experimental environment and trained and optimized. The use process of the ml_agents module is as follows:

a) Creating a virtual environment and defining interaction rules between the propeller shaft part agent and the environment therein.

b) And defining a neural network structure, a strategy function, a value function and the like of the transmission shaft part intelligent body.

c) In the process of interaction between the transmission shaft part intelligent body and the environment, the ML-agents module automatically collects data such as states, actions, rewards and the like of the transmission shaft part intelligent body, and stores the data in an experience playback pool.

e) After training is completed, the ML-agents module automatically stores the trained transmission shaft part intelligent agent decision model, and can generalize the model to other parts, so that the other parts also have the capability of autonomous path planning.

And 3, setting an environment observation function collectobservation (), setting a front axle in the environment as an assembled part and a transmission shaft part intelligent body, wherein 9 information observation inputs to be observed by the collectobservation () are respectively a front axle part local coordinate, a transmission shaft part intelligent body X-axis speed component, a transmission shaft part intelligent body Z-axis speed component and a transmission shaft part intelligent body Y-axis speed component, and inputting the observed information into a transmission shaft part intelligent body decision model for action decision.

3.1 Setting an action function AgentAction (), wherein the action function AgentAction () is continuously called by a training algorithm in the training process to update the action of the transmission shaft part intelligent body. The action space trained in the step is the component of the speed of the transmission shaft part intelligent body in the directions of the X axis, the Y axis and the Z axis, and is three continuous values, and the action function Agentaction () can gradually drive the transmission shaft part intelligent body to approach to the coordinate position of the front shaft part through continuous interaction, so that the autonomous learning and the strategy optimization of the part intelligent body are realized.

3.2 Setting a reward and punishment function AddReward (), and evaluating the quality of each behavior action in the motion process of the transmission shaft part intelligent body through the reward and punishment function AddReward (), wherein if the action is good, the possibility of the next selection is enhanced; otherwise, the probability of the next selection is reduced. The reward and punishment function addforward () is designed as follows:

when the distance between the transmission shaft part intelligent body and the front shaft exceeds 4, executing AddReward (-0.01 f) to obtain negative rewards of-0.01;

when the distance between the transmission shaft part intelligent body and the front shaft is smaller than 3, executing AddReward (0.01 f) to obtain positive rewards of 0.01;

when the distance between the transmission shaft part intelligent body and the front shaft is smaller than 2, executing AddReward (0.07 f) to obtain positive rewards of 0.07;

when the distance between the transmission shaft part intelligent body and the front shaft is smaller than or equal to 1.42, addReward (1 f) is executed to obtain positive rewards of 1, the assembly path planning guidance is considered to be successful, the Done () method is utilized to finish the current round of training, and the function AgentReset () is utilized to initialize a scene.

3.3 Transmission shaft part agent will be moved back to the new position by the scene reset function agenreset ().

In addition, if the transmission shaft part intelligent body is out of range, a scene reset function AgentReset () is triggered, so that the transmission shaft part intelligent body is initialized, the initial speed of the transmission shaft part intelligent body is reset to zero, the initial angular speed of the transmission shaft part intelligent body is reset to zero, the initial local coordinates of the transmission shaft part intelligent body are set to be (0, 0), and the positions of the front shaft parts are randomly refreshed in the range of 4 multiplied by 4.

And step 4, after training, the system generates a transmission shaft part agent decision model which contains the neural network model and parameter information. The information perceived by the transmission shaft part intelligent body can be processed, corresponding actions are generated, the transmission shaft part intelligent body decision model is mounted on the part intelligent body, and the part intelligent body can perform action decisions through the neural network model in the transmission shaft part intelligent body decision model.

And 5, after the step 1 to the step 4 are completed, applying the trained decision model to other transmission shaft parts to carry out path planning so as to verify the generalization of the transmission shaft part intelligent body decision model. The planning of assembly paths of the rest transmission shafts and the front shaft is introduced, a trained transmission shaft part intelligent body decision model is added to the rest transmission shaft parts, the rest transmission shaft parts are changed into part intelligent bodies with decision capability, the end point target of the transmission shaft part path planning is set as the front shaft, the total time and the success times of the rest transmission shaft part intelligent bodies passing 100 times of path planning are counted, the average time of each path planning is 2.01S, and the success rate is 76%. The generalization capability of the decision model is proved to be effective.

To illustrate the practical effect of the method of the present invention, an assembly pair illumination test was set up to verify the path guidance capabilities of the training decision model of the present invention. An assembly environment is set up in a scene, a square table with the assembly environment of 6 multiplied by 6 is provided with three groups of compressor parts, and the experiment aims to enable students who first contact with virtual assembly to carry out part assembly, and the effective guiding capability of the decision model trained by the invention is verified through two guiding modes. Two comparison groups are set, each comparison group is provided with two persons, the assembly process and the intelligent planning and guiding assembly process of the part path are guided through text prompt, the completed assembly process is used as an evaluation index, and the comparison experiment results are shown in the following table 1.

Table 1, text prompt and path guidance prompt efficiency comparison

Guidance mode	Time of use/s	Number of completions	Average time consumption/s
				Text prompt guide 1	201	6	33.5
Text prompt guidance 2	142	3	47.3
				Route planning guidance 1	96	7	13.7
Path planning guidance 2	94	5	18.8

As can be seen from the analysis of the results in Table 1, the first user who conducted the assembly of the parts by text prompt takes 33.5 seconds on average for the assembly of 6 parts in 201 seconds, the second user who conducted the assembly of the parts by text prompt takes 47.3 seconds on average for the assembly of 3 parts in 142 seconds, and the two users who conducted the assembly of the parts by intelligent route takes 13.7 seconds and 18.8 seconds on average in 96 seconds and 94 seconds respectively, and the effectiveness of the method of the present invention is demonstrated by comparing the guiding ability of the strategy gradient optimization algorithm for the assembly route planning with the text guidance ability far better than that of the conventional virtual assembly.

In summary, the invention belongs to the field of virtual assembly, and relates to a process for planning an assembly path for assembling parts of a gas compressor by utilizing a strategy gradient optimization algorithm in deep reinforcement learning. The part to be assembled is regarded as a part intelligent body, a neural network is used for fitting an action decision function, the environment where the part intelligent body is located takes the current state as the input of the neural network, then the probability of the action to be performed is output, and the part intelligent body selects the action of the next step according to the probability. When using a neural network to approximate a function, it is also necessary to take the current state as an input and then output the value of that state. Thus, the part intelligent agent can evaluate the current state according to the value, so that a better decision can be made. And updating network parameters through a back propagation algorithm, so that the strategy function and the value function of the part intelligent agent can be continuously optimized, thereby guiding the part intelligent agent to correctly execute actions and planning an assembly path. The process uses a strategic gradient optimization algorithm to train the part agent so that it can choose to perform the correct actions to generate the assembly guidance path.

Claims

1. The virtual assembly path planning and guiding method based on the strategy gradient optimization algorithm is characterized by comprising the following steps of:

step 3, establishing a part intelligent body decision model;

step 4, optimizing a part intelligent body decision model;

2. The virtual assembly path planning guiding method based on the strategy gradient optimization algorithm according to claim 1, wherein in step 2, the specific process is as follows:

randomly generating a part to be assembled in a Unity 3D reinforcement learning training scene, wherein the part intelligent agent selects an action to be executed to be close to the part to be assembled according to the probability distribution of the outputted action, and does not take greedy strategy to omit the selection action;

τ＝{s ₁ ,a ₁ ,r ₁ ,s ₂ ,a ₂ ,r ₂ ,...,s _T ,a _T ,r _T } (1)

3. The virtual assembly path planning guiding method based on the strategy gradient optimization algorithm according to claim 1, wherein in step 3, the specific process is as follows:

the decision model inputs environment observation information and makes action decisions according to a collectio-service () function, makes decision actions by using an AgentAction () function, rewards or penalizes each action by using an AddReward () function, and calls the AgentReset () function to initialize a scene when one round of training is finished, wherein the concrete process is as follows:

a～π(s，a，θ)＝p[a|s，θ] (4)

b) If the action rewarding value is negative, or the subsequent state caused by the action is low value, the part intelligent body is farther away from the assembled part to be assembled, or the part intelligent body is beyond the scene range, the action is regarded as differential action, and the path planning training is performed again by initializing the training scene;

wherein alpha is the learning rate,is the variation of the gradient;

4. The virtual assembly path planning guiding method based on the strategy gradient optimization algorithm according to claim 1, wherein in step 4, the specific process is as follows:

5. The virtual assembly path planning guiding method based on the strategy gradient optimization algorithm according to claim 1, wherein in step 5, the specific process is as follows: