WO2023206771A1 - 基于决策流图的环境建模方法、装置和电子设备 - Google Patents

基于决策流图的环境建模方法、装置和电子设备 Download PDF

Info

Publication number
WO2023206771A1
WO2023206771A1 PCT/CN2022/101444 CN2022101444W WO2023206771A1 WO 2023206771 A1 WO2023206771 A1 WO 2023206771A1 CN 2022101444 W CN2022101444 W CN 2022101444W WO 2023206771 A1 WO2023206771 A1 WO 2023206771A1
Authority
WO
WIPO (PCT)
Prior art keywords
target
node
business
decision
environment
Prior art date
Application number
PCT/CN2022/101444
Other languages
English (en)
French (fr)
Inventor
秦熔均
朱焕焕
高耸屹
Original Assignee
南栖仙策(南京)科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 南栖仙策(南京)科技有限公司 filed Critical 南栖仙策(南京)科技有限公司
Priority to EP22891179.8A priority Critical patent/EP4290351A1/en
Publication of WO2023206771A1 publication Critical patent/WO2023206771A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/004Artificial life, i.e. computing arrangements simulating life
    • G06N3/006Artificial life, i.e. computing arrangements simulating life based on simulated virtual individual or collective life forms, e.g. social simulations or particle swarm optimisation [PSO]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/01Input arrangements or combined input and output arrangements for interaction between user and computer
    • G06F3/048Interaction techniques based on graphical user interfaces [GUI]
    • G06F3/0484Interaction techniques based on graphical user interfaces [GUI] for the control of specific functions or operations, e.g. selecting or manipulating an object, an image or a displayed text element, setting a parameter value or selecting a range
    • G06F3/0486Drag-and-drop
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/901Indexing; Data structures therefor; Storage structures
    • G06F16/9024Graphs; Linked lists
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F30/00Computer-aided design [CAD]
    • G06F30/20Design optimisation, verification or simulation
    • G06F30/27Design optimisation, verification or simulation using machine learning, e.g. artificial intelligence, neural networks, support vector machines [SVM] or training a model
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/451Execution arrangements for user interfaces
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/092Reinforcement learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/06Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
    • G06Q10/067Enterprise or organisation modelling
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2111/00Details relating to CAD techniques
    • G06F2111/10Numerical modelling
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology

Definitions

  • the embodiments of the present application relate to computer technology, for example, to an environment modeling method, device and electronic device based on a decision flow diagram.
  • Reinforcement learning is where an agent learns in a "trial and error" manner, and the rewards obtained by interacting with the environment guide its behavior. The goal is to enable the agent to obtain the maximum reward.
  • Embodiments of the present application provide an environment modeling method, device and electronic device based on a decision flow diagram, so that virtual environment models in different business scenarios can be constructed more conveniently based on the decision flow diagram, thereby meeting the personalized needs of users. .
  • an environment modeling method based on a decision flow diagram including:
  • a target decision flow graph corresponding to the target business scenario is constructed, wherein the business nodes in the target decision flow graph include: at least one environment state node and at least one decision agent node, and the at least An environment state node includes the current environment state sub-node, the environment state transition sub-node and the next environment state sub-node;
  • an environment modeling device based on a decision flow diagram including:
  • the target business feature acquisition module is configured to acquire the target business features in the target business scenario to be modeled and the characteristic information of the target business features;
  • a target decision flow graph building module is configured to construct a target decision flow graph corresponding to the target business scenario based on the target business characteristics, wherein the business nodes in the target decision flow graph include: at least one environment status node and at least A decision-making agent node, the at least one environment state node includes a current environment state sub-node, an environment state transition sub-node and a next environment state sub-node;
  • the target calculation graph construction module is configured to build the target calculation graph based on the business characteristics bound to each business node in the target decision flow graph and the data flow information between multiple business nodes;
  • the target virtual environment model determination module is configured to perform environment modeling based on the target calculation graph and the characteristic information of the target business characteristics, and determine the target virtual environment model corresponding to the target business scenario.
  • an electronic device includes:
  • the memory stores a computer program that can be executed by the at least one processor, and the computer program is executed by the at least one processor, so that the at least one processor can execute the method described in any embodiment of the present application.
  • a computer-readable storage medium includes: a computer program.
  • the computer program When the computer program is executed by a processor, it can implement any of the embodiments of the present application.
  • Figure 1 is a flow chart of an environment modeling method based on a decision flow diagram provided in Embodiment 1 of the present application;
  • Figure 2 is an example of a decision flow diagram involved in Embodiment 1 of the present application.
  • Figure 3 is a flow chart of an environment modeling method based on a decision flow diagram provided in Embodiment 2 of the present application;
  • Figure 4 is a schematic structural diagram of an environment modeling device based on a decision flow diagram provided in Embodiment 3 of the present application;
  • FIG. 5 is a schematic structural diagram of an electronic device that implements an environment modeling method based on a decision flow diagram according to an embodiment of the present application.
  • Figure 1 is a flow chart of an environment modeling method based on a decision flow diagram provided in Embodiment 1 of the present application. This embodiment can be applied to the situation of environment modeling for any business scenario.
  • the method may be executed by an environment modeling device based on a decision flow graph, the device may be implemented in the form of hardware and/or software, and the device may be configured in an electronic device. As shown in Figure 1, the method includes the following steps:
  • the target business scenario can be any business scenario with decision-making requirements.
  • the target business scenario in this embodiment may be an open, uncertain, and blurred boundary business scenario.
  • the target business scenario may be an item search scenario. An example is: after the user inputs the search content and makes a search request, the searched recommended item information and the recommended item display order are determined based on the search request. Since the order of recommended items displayed to the user is very important and directly affects the user's purchasing behavior, if we use decision-making methods to perform reinforcement learning on the recommended order and do not want to interfere with the user's normal use, we need to build an item that is close to the real thing.
  • the target business scenario can be a picking and dispatching scenario.
  • An example is: assign orders to pickers to determine the order allocation method with the shortest picking time.
  • the target business characteristics can be all business characteristics collected in the target business scenario, and can be characterized by using business parameter identifiers.
  • the characteristic information of the target business feature may refer to the specific data of the target business feature, that is, the specific business parameter value.
  • the target business characteristics may include business environment characteristics and business decision characteristics, where the business environment characteristics may include environmental parameter information before decision-making and environmental parameter information after decision-making.
  • Business decision-making characteristics can be decision-making parameter information obtained by interacting with the environment according to a preset decision-making method, that is, multiple action information executed by the agent.
  • the preset decision-making method may be a decision-making method in related technologies. For example, in an item search scenario, the preset decision-making method may be to sort according to item sales and/or item evaluation scores. In the picking and dispatching scenario, the preset decision-making method can be to allocate orders according to the shortest path.
  • this embodiment can perform feature processing on the target business characteristics and the feature information of the target business features in the target business scenario to obtain feature information in the form of time series.
  • the characteristic information in the form of time series is: ⁇ trajectory 1: state at time 1, decision action 1, decision result 1, state at time 2, decision action 2..., state at end time N ⁇ , ⁇ trajectory 2: time 1 state, decision action 1, decision result 1, state at time 2, decision action 2..., state at termination time N... ⁇ etc.
  • the business nodes in the target decision flow diagram include: at least one environment state node and at least one decision agent node.
  • At least one environment state node includes the current Environment state sub-node, environment state transition sub-node and next environment state sub-node.
  • the target decision flow graph can be a directed acyclic graph, which can be used to represent the decision-making relationship between different business characteristics at each time point.
  • the input and output of the data flow in the target decision flow graph cannot form a cycle, that is, the structure of the target decision flow graph conforms to the structure of a directed acyclic graph.
  • Each business node in the target decision flow graph represents the decision-making process used to calculate the parameters of the node, and the connections between business nodes represent the data flow direction.
  • the environment state node is a combination node.
  • the environment state node includes the current environment state sub-node, the environment state transition sub-node and the next environment state sub-node.
  • the environment state node includes the current environment state sub-node and the next environment state sub-node.
  • the current environment status sub-node is the environmental observation value at the starting point in the complete business interaction environment.
  • the environment state transfer sub-node is the process of calculating the environment state at the next moment in the environment based on the current environment state and the agent's actions.
  • the next environment state sub-node is the environment state observation value that can be used as the starting point of the next round of business interaction after the complete business interaction is completed.
  • the decision-making agent node can be the key subject node for decision-making in the target business scenario, and is used to decide on actions to be performed in different environmental states.
  • the game scene is the environment
  • the racing car is the decision-making agent
  • the position of the racing car is the state
  • the operation of the racing car is the action
  • how to operate the racing car is the decision
  • the game score is the reward.
  • the target decision flow graph constructed in this embodiment may include at least one environment status node and a decision agent node, and their number may be determined based on the actual situation of the target business scenario.
  • the current environment status sub-node supports the output of data flow.
  • the environment state transfer sub-node supports the input of data flow and outputs it to the next environment state sub-node.
  • the decision-making agent node can support the input and output of data flow at the same time, so that the environment status node and the decision-making agent node can be used to more accurately describe the data flow and decision-making process in the target business scenario.
  • the business nodes in the target decision flow graph also include: at least one environmental agent node and/or at least one static variable node.
  • the environmental agent node may refer to other subject nodes with decision-making capabilities in the target business scenario, and is used to assist in decision-making of actions to be performed in different environmental states.
  • Static variable nodes can refer to fixed business characteristics in the target business scenario, which can participate in and affect the business environment and decision-making, so that the decision-making process can be more accurately characterized.
  • the environmental agent node in this embodiment can support the input and output of data flow at the same time; the static variable node only supports the output of data flow, but does not support the input of data flow.
  • the target decision flow graph constructed in this embodiment may also include at least one environmental agent node and a static variable node, and their number may be determined based on the actual situation of the target business scenario.
  • Figure 2 shows an example of a decision flow graph.
  • the decision flow graph may include one environment state node (for example, including the current environment state sub-node, the environment state transition sub-node and the next environment state sub-node), two decision-making agent nodes, and one environment agent node.
  • a static variable node the data flow between multiple nodes is shown in Figure 2.
  • This embodiment can automatically construct a target decision flow diagram based on target business characteristics to improve construction efficiency.
  • This embodiment can also manually construct a target decision flow diagram based on the configuration operation triggered by the user on the visual interface, so as to meet the user's personalized needs and achieve dynamic configuration.
  • This embodiment describes the decision-making process of multiple business parameters in a unified format by constructing a decision flow diagram in a more standardized manner, so that subsequent environment modeling can be performed more conveniently and accurately based on the decision flow diagram.
  • S120 may include: performing feature analysis on the target business features to determine dependencies between multiple target business features; creating multiple business nodes based on the dependency relationships, and determining data flow information between the multiple business nodes. , construct a target decision flow diagram corresponding to the target business scenario.
  • feature analysis can be performed on the target business features at each moment in the time series form to determine the feature type corresponding to each target business feature, such as environmental state features, decision-making agents, environmental agents or static variables. And the dependency relationship between multiple target business features. For example, business feature A needs to be determined based on business feature B and business feature C. Based on the feature type corresponding to each target business feature, create the corresponding business node.
  • the target business feature is an environmental state feature, create an environment status node corresponding to the business feature; if the target business feature is a decision-making agent, create a decision-making agent node corresponding to the target business feature; if the target business feature is an environment If the target business feature is an intelligent agent, create an environmental agent node corresponding to the target business feature; if the target business feature is a static variable, create a static variable node corresponding to the target business feature.
  • the target decision flow graph can represent the data flow relationship from time T to time T+1. If time T is not the termination time, each trajectory that meets the requirements satisfies the data flow relationship in the target decision flow graph at each time T.
  • S120 may also include: obtaining multiple empty nodes added by the user based on the node adding operation triggered by the user on the visual interface; and determining the corresponding node information of each empty node based on the node information configuration operation triggered by the user on the visual interface.
  • Business configuration information where the business configuration information includes: node name information and the business characteristics bound to the node; configure the corresponding empty node based on the business configuration information to obtain the corresponding business node; trigger multiple business nodes based on the user Through the connection operation, the data flow direction information between multiple business nodes is obtained, and a target decision flow diagram corresponding to the target business scenario is constructed.
  • the user can sort out multiple target business characteristics in time series, determine each node involved in the target business scenario, and add corresponding empty nodes through node addition operations on the visual interface, such as node dragging.
  • environment status node, decision agent node, environment agent node or static variable node and configure the corresponding node information for each empty node added, such as configuring the node name information corresponding to the node, and through the node binding operation , bind the node with the corresponding business characteristics, so that each configured business node can be obtained, and based on the behavioral influence relationship between multiple business parameters, multiple business nodes are connected and processed based on the user's
  • the connection operation can obtain data flow information between multiple business nodes, so that users can manually build a target decision flow diagram based on business needs to meet the user's personalized needs.
  • the node configuration information may also include: node data type, data value range and insertion function information.
  • Node data types include: continuous type, discrete type and default type.
  • Discrete types include: discrete ordered type and discrete unordered type.
  • the inserted function information can be a function built based on expert experience, so that expert experience can be mixed in the decision flow graph by inserting the function, improving the flexibility and accuracy of construction.
  • users can also dynamically configure the node data type, data value range, and insertion function information of each node, so that a more realistic and accurate target decision flow graph can be constructed.
  • the target calculation graph may refer to a computable decision flow graph.
  • the target decision flow graph can correspond to a target calculation graph.
  • the target calculation graph can be directly used in the construction of the virtual environment model corresponding to the target business scenario.
  • the target decision flow graph can be converted into a graph that can be directly used in environment modeling.
  • Target calculation graph For example, based on the business characteristics bound to each business node in the target decision flow graph and the data flow information between multiple business nodes, the target decision flow graph can be converted into a graph that can be directly used in environment modeling. Target calculation graph.
  • S130 may include: performing format conversion on the target decision flow graph to determine target decision data in a structured data format; based on the business characteristics bound to each business node in the target decision data and the relationship between multiple business nodes.
  • the data flow direction information is used to determine multiple computing nodes and the computing relationships between multiple computing nodes, and build a target computing graph.
  • the structured data format can be but is not limited to YAML (Yet Another Markup Language) markup language format or JSON (JavaScript Object Notation) format, etc.
  • the target decision flow graph can be converted into target decision data in a structured data format, for example, a target decision file in YAML format is obtained and stored in the file.
  • all computing nodes and computing nodes in the deep learning network framework such as TensorFlow, Pytorch, etc.
  • the calculation relationship between them is used to construct the target calculation graph.
  • Each computing node is a computable function with parameters, such as a deep neural network or other parameterized function.
  • this embodiment can use the characteristic information of the target business characteristics to verify and determine the correctness of the node decision-making logical relationships in the target calculation graph.
  • feature information in the form of time series can be used to verify the accuracy of the data format of each computing node in the target computing graph and the accuracy of the data flow between computing nodes.
  • constructing a data flow graph requires writing code to configure the data flow direction and define function nodes. Data flow diagrams usually facilitate R&D to check which nodes are involved in the business scenario and the relationships between nodes, and to write code based on their own understanding. The way of writing code implementation will have certain deviations from the actual data flow diagram.
  • S140 Perform environment modeling based on the target calculation graph and the characteristic information of the target business characteristics, and determine the target virtual environment model corresponding to the target business scenario.
  • the target virtual environment model can be a deep learning network model, which can imitate the operation of the real environment in the target business scenario.
  • an initial virtual environment model can be constructed based on the target calculation graph, and the virtual environment model can be trained based on the characteristic information of the target business characteristics to obtain the target virtual environment model after training, so as to utilize the target virtual environment model.
  • Replacement of the actual target business environment for reinforcement learning improves the effectiveness of reinforcement learning and meets the personalized needs of users, which in turn enables reinforcement learning to be implemented in real business scenarios.
  • S140 after S140, it also includes: based on the target virtual environment model, perform reinforcement learning on the preset decision-making model in the target business scenario, and obtain the target decision-making model after reinforcement learning.
  • the preset decision-making model may refer to a decision-making agent node in the target decision-making flow graph.
  • the preset decision-making model is set to determine behavioral action information in different environmental states to maximize the cumulative reward on the trajectory.
  • the preset decision-making model continuously interacts with the virtual environment over a continuous period of time to generate an interaction trajectory, and by maximizing the cumulative reward on the interaction trajectory, the predetermined Set up a decision-making model for reinforcement learning, train the optimal decision-making method, and obtain the final target decision-making model, so that reinforcement learning of the preset decision-making model can be more conveniently performed in the target virtual environment model without disturbing real users, and ensuring Learning effects of target decision-making models.
  • Each business node in the target decision flow diagram may include: at least one An environment state node and at least one decision-making agent node, wherein at least one environment state node may include a current environment state sub-node, an environment state transition sub-node and a next environment state sub-node.
  • a target calculation graph that can directly participate in environment modeling is constructed.
  • Environmental modeling based on the target calculation graph and the characteristic information of the target business characteristics can more easily determine the target virtual environment model corresponding to the target business scenario, so that the target virtual environment model can be used to replace the actual target business environment for reinforcement learning, which greatly improves the efficiency of the environment modeling. It reduces the cost of trial and error in the actual target business environment, thereby improving the reinforcement learning effect and meeting the personalized needs of users.
  • FIG. 3 is a flow chart of an environment modeling method based on a decision flow diagram provided in Embodiment 2 of the present application. Based on the above embodiment, this embodiment describes in detail the construction process of the target virtual environment model. The explanation of terms that are the same as or corresponding to the above embodiments will not be repeated here. Referring to Figure 3, this embodiment provides an environment modeling method based on a decision flow diagram including:
  • an initial virtual environment model corresponding to a preset deep learning network framework can be created based on the target calculation graph, or a corresponding initial virtual environment model can be created based on the machine learning framework currently configured by the user.
  • hyperparameters can be configured based on the preset hyperparameter space, and different hyperparameters can be configured for different business scenarios to build the best initial virtual environment model. For example, if automatic parameter adjustment is configured, optimal parameters can be automatically searched during environment model training.
  • the model structure in the deep learning network framework may include but is not limited to: at least one of the convolutional neural network CNN (Convolutional Neural Network), the long short-term memory network LSTM (Long Short Term Memory network) and the residual network ResNet,
  • CNN Convolutional Neural Network
  • LSTM Long Short Term Memory network
  • ResNet residual network ResNet
  • the process that the agent and the virtual environment perform once each is called an interaction or a step, and a series of data generated by the continuous interaction between the decision-making agent and the virtual environment over a continuous period of time is called a trajectory.
  • the interaction sample data corresponding to the optimization target and the interaction sample data corresponding to the target business can be extracted from the target business feature information.
  • S360 Input the interactive sample data into the initial virtual environment model, and obtain the simulation trajectory based on the output of the initial virtual environment model.
  • the interaction sample data is input into the initial virtual environment model to be trained, it is determined that the environment state data is obtained after each interaction between the decision-making agent and the virtual environment, and based on the multiple interactions generated during a continuous period of time.
  • the environmental state data can obtain the simulation trajectory determined in the initial virtual environment model.
  • S370 Determine the trajectory similarity based on the simulation trajectory and the actual trajectory, and adjust the parameter weights in the initial virtual environment model based on the trajectory similarity until the training ends when the preset convergence condition is reached, and the target virtual environment model corresponding to the target business scenario is obtained.
  • trajectory similarity can be used to characterize the difference between the virtual environment and the real environment. The higher the trajectory similarity, the closer the virtual environment is to the real environment.
  • this embodiment can determine the trajectory similarity between the simulated trajectory and the actual trajectory, that is, the environment score, based on a preset error function such as the average absolute error function or the average squared error function, and When the trajectory similarity is greater than the preset threshold, the parameter weights in the initial virtual environment model can be adjusted, and the adjusted initial virtual environment model can continue to be trained. When the trajectory similarity is less than the preset threshold or the change tends to be stable, it can be determined that the preset convergence condition is reached, the initial virtual environment model training is completed, and the target virtual environment model is obtained.
  • a preset error function such as the average absolute error function or the average squared error function
  • the technical solution of this embodiment determines the trajectory similarity based on the simulated trajectory and the actual trajectory, and adjusts the parameter weights in the initial virtual environment model based on the trajectory similarity until the training ends when the preset convergence condition is reached, and the target business scenario corresponding to the target business scenario is obtained.
  • the target virtual environment model can be modeled based on supervised learning, and the target virtual environment model can be trained more accurately and conveniently.
  • the following is an example of an environment modeling device based on a decision flow diagram provided by an embodiment of the present application.
  • This device belongs to the same inventive concept as the environment modeling method based on a decision flow diagram in the above embodiments.
  • the environment modeling method based on a decision flow diagram For details that are not described in detail in the embodiments of the environment modeling device, please refer to the above embodiments of the environment modeling method based on the decision flow diagram.
  • Figure 4 is a schematic structural diagram of an environment modeling device based on a decision flow diagram provided in Embodiment 3 of the present application.
  • the device includes: a target business characteristic acquisition module 410, a target decision flow graph construction module 420, a target calculation graph construction module 430, and a target virtual environment model determination module 440.
  • the target business data acquisition module 410 is configured to acquire the target business characteristics and the characteristic information of the target business characteristics in the target business scenario to be modeled;
  • the target decision flow diagram construction module 420 is configured to construct the target business based on the target business characteristics.
  • the target calculation graph construction module 430 is configured to build a target calculation graph based on the business characteristics bound to each business node in the target decision flow graph and the data flow information between multiple business nodes.
  • the target virtual environment model determination module 440 is configured to perform environment modeling based on the target calculation graph and the characteristic information of the target business characteristics, and determine the target virtual environment model corresponding to the target business scenario.
  • Each business node in the target decision flow diagram may include: at least one An environment state node and at least one decision-making agent node, wherein at least one environment state node may include a current environment state sub-node, an environment state transition sub-node and a next environment state sub-node.
  • a target calculation graph that can directly participate in environment modeling is constructed.
  • Environmental modeling based on the target calculation graph and the characteristic information of the target business characteristics can more easily determine the target virtual environment model corresponding to the target business scenario, so that the target virtual environment model can be used to replace the actual target business environment for reinforcement learning, and then It improves the reinforcement learning effect and meets the personalized needs of users.
  • the current environment state sub-node supports the output of the data flow;
  • the environment state transfer sub-node supports the input of the data flow and outputs it to the next environment state sub-node;
  • the decision-making agent node supports the input and output of the data flow.
  • the business nodes in the target decision flow graph also include: at least one environmental agent node and/or at least one static variable node; wherein the environmental agent node supports the input and output of the data flow; the static variable node only supports data The output of the stream does not support the input of the data stream.
  • the target decision flow graph building module 420 is set to:
  • the target decision flow graph building module 420 is also set to:
  • multiple empty nodes added by the user are obtained; based on the node information configuration operation triggered by the user for each empty node, the business configuration information corresponding to each empty node is determined, where, business configuration
  • the information includes: node name information and the service characteristics bound to the node; configure the corresponding empty node based on the service configuration information to obtain the corresponding service node; obtain multiple services based on the connection operation triggered by the user on multiple service nodes
  • the data flow information between nodes is used to construct a target decision flow diagram corresponding to the target business scenario.
  • the node configuration information also includes: node data type, data value range and insertion function information; node data types include: continuous type, discrete type and default type, where the discrete types include: discrete ordered type and discrete none sequence type.
  • the target calculation graph building module 430 is set to:
  • the target virtual environment model determination module 440 is set to:
  • the device also includes:
  • the reinforcement learning module is configured to perform reinforcement learning on the preset decision-making model in the target business scenario based on the target virtual environment model after determining the target virtual environment model corresponding to the target business scenario, and obtain the target decision-making model after reinforcement learning.
  • the environment modeling device based on the decision flow diagram provided by the embodiments of the present application can execute the environment modeling method based on the decision flow diagram provided by any embodiment of the application, and has the corresponding functions of executing the environment modeling method based on the decision flow diagram. module.
  • FIG. 5 shows a schematic structural diagram of an electronic device 10 that can be used to implement embodiments of the present application.
  • Electronic devices are intended to refer to various forms of digital computers, such as laptop computers, desktop computers, workstations, personal digital assistants, servers, blade servers, mainframe computers, and other suitable computers.
  • Electronic devices may also represent various forms of mobile devices, such as personal digital assistants, cellular phones, smart phones, wearable devices (eg, helmets, glasses, watches, etc.), and other similar computing devices.
  • the components shown herein, their connections and relationships, and their functions are examples only and are not intended to limit the implementation of the present application as described and/or claimed herein.
  • the electronic device 10 includes at least one processor 11, and a memory communicatively connected to the at least one processor 11, such as a read-only memory (Read-Only Memory, ROM) 12, a random access memory (Random Access Memory, RAM) 13, etc., wherein the memory stores a computer program that can be executed by at least one processor, and the processor 11 can be loaded into the random access memory (RAM) according to the computer program stored in the read-only memory (ROM) 12 or from the storage unit 18.
  • a computer program in RAM) 13 to perform various appropriate actions and processes.
  • various programs and data required for the operation of the electronic device 10 can also be stored.
  • the processor 11, the ROM 12 and the RAM 13 are connected to each other via the bus 14.
  • An input/output (I/O) interface 15 is also connected to the bus 14 .
  • the I/O interface 15 Multiple components in the electronic device 10 are connected to the I/O interface 15, including: an input unit 16, such as a keyboard, a mouse, etc.; an output unit 17, such as various types of displays, speakers, etc.; a storage unit 18, such as a magnetic disk, an optical disk, etc. etc.; and communication unit 19, such as network card, modem, wireless communication transceiver, etc.
  • the communication unit 19 allows the electronic device 10 to exchange information/data with other devices through computer networks such as the Internet and/or various telecommunications networks.
  • Processor 11 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of the processor 11 include, but are not limited to, a central processing unit (Central Processing Unit, CPU), a graphics processing unit (Graphics Processing Unit, GPU), various dedicated artificial intelligence (Artificial Intelligence, AI) computing chips, various running Processors for machine learning model algorithms, digital signal processors (Digital Signal Processing, DSP), and any appropriate processors, controllers, microcontrollers, etc.
  • the processor 11 executes various methods and processes described above, such as an environment modeling method based on a decision flow graph.
  • the environment modeling method based on the decision flow diagram can be implemented as a computer program, which is tangibly included in a computer-readable storage medium, such as the storage unit 18 .
  • part or all of the computer program may be loaded and/or installed onto the electronic device 10 via the ROM 12 and/or the communication unit 19.
  • the processor 11 may be configured to execute the decision flow graph-based environment modeling method in any other suitable manner (eg, by means of firmware).
  • Various implementations of the systems and techniques described above may be implemented in digital electronic circuit systems, integrated circuit systems, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Parts (ASSP), System on Chip (SOC), Complex Programmable Logic Device (CPLD), computer hardware, firmware, software, and/or they realized in a combination.
  • FPGAs Field Programmable Gate Arrays
  • ASICs Application Specific Integrated Circuits
  • ASSP Application Specific Standard Parts
  • SOC System on Chip
  • CPLD Complex Programmable Logic Device
  • computer hardware firmware, software, and/or they realized in a combination.
  • These various embodiments may include implementation in at least one computer program executable and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or
  • a general-purpose programmable processor can receive data and instructions from a storage system, at least one input device, and at least one output device, and transmit data and instructions to the storage system, the at least one input device
  • the systems and techniques described herein may be implemented on an electronic device having a display device (e.g., a cathode ray tube (CRT) or liquid crystal) for displaying information to the user.
  • a display device e.g., a cathode ray tube (CRT) or liquid crystal
  • a display Liquid Crystal Display, LCD monitor
  • a keyboard and pointing device e.g., a mouse or a trackball
  • Other kinds of devices may also be used to provide interaction with the user; for example, the feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and may be provided in any form, including Acoustic input, voice input or tactile input) to receive input from the user.
  • the systems and techniques described herein may be implemented in a computing system that includes back-end components (e.g., as a data server), or a computing system that includes middleware components (e.g., an application server), or a computing system that includes front-end components (e.g., A user's computer having a graphical user interface or web browser through which the user can interact with implementations of the systems and technologies described herein), or including such backend components, middleware components, or any combination of front-end components in a computing system.
  • the components of the system may be interconnected by any form or medium of digital data communication (eg, a communications network). Examples of communication networks include: Local Area Network (LAN), Wide Area Network (WAN), blockchain network, and the Internet.
  • Computing systems may include clients and servers.
  • Clients and servers are generally remote from each other and typically interact over a communications network.
  • the relationship of client and server is created by computer programs running on corresponding computers and having a client-server relationship with each other.
  • the server can be a cloud server, also known as cloud computing server or cloud host. It is a host product in the cloud computing service system to solve the problems that exist in traditional physical host and virtual private server (VPS) services. It has the disadvantages of difficult management and weak business scalability.
  • VPN virtual private server

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Business, Economics & Management (AREA)
  • Mathematical Physics (AREA)
  • Computing Systems (AREA)
  • Biophysics (AREA)
  • Human Resources & Organizations (AREA)
  • Strategic Management (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Medical Informatics (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Human Computer Interaction (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Computer Hardware Design (AREA)
  • Geometry (AREA)
  • Databases & Information Systems (AREA)
  • Economics (AREA)
  • Development Economics (AREA)
  • Educational Administration (AREA)
  • Game Theory and Decision Science (AREA)
  • Marketing (AREA)
  • Operations Research (AREA)
  • Quality & Reliability (AREA)
  • Tourism & Hospitality (AREA)
  • General Business, Economics & Management (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

本申请实施例公开了一种基于决策流图的环境建模方法、装置和电子设备,该方法包括:获取待建模的目标业务场景中的目标业务特征和目标业务特征的特征信息;基于目标业务特征,构建目标业务场景对应的目标决策流图,其中,目标决策流图中的业务节点包括:至少一个环境状态节点和至少一个决策智能体节点;基于目标决策流图中的每个业务节点所绑定的业务特征和各个业务节点之间的数据流向信息,构建目标计算图;基于目标计算图和目标业务特征的特征信息进行环境建模,确定目标业务场景对应的目标虚拟环境模型。

Description

基于决策流图的环境建模方法、装置和电子设备
本申请要求在2022年4月24日提交中国专利局、申请号为202210434180.9的中国专利申请的优先权,该申请的全部内容通过引用结合在本申请中。
本申请要求在2022年5月25日提交中国专利局、申请号为202210579742.9的中国专利申请的优先权,该申请的全部内容通过引用结合在本申请中。
技术领域
本申请实施例涉及计算机技术,例如涉及一种基于决策流图的环境建模方法、装置和电子设备。
背景技术
随着计算机技术的快速发展,强化学习作为机器学习的一种方式受到了越来越多的关注。强化学习是智能体(Agent)以“试错”的方式进行学习,通过与环境进行交互获得的奖赏指导行为,目标是使智能体获得最大的奖赏。
目前,在封闭的运行环境或者规则非常清楚的环境,比如游戏环境中,可以通过大量的“试错”采样进行强化学习,从而获得较好的学习效果。然而,大部分的业务场景中的业务环境是开放的,不确定的,边界模糊的,从而在这些业务环境中进行强化学习是难以实现的并且需要大量的学习成本,从而当前急需一种更加便捷地对业务场景进行环境建模的方式,以便支持不同业务场景下的强化学习。
发明内容
本申请实施例提供了一种基于决策流图的环境建模方法、装置和电子设备,以基于决策流图可以更加便捷地构建出不同业务场景下的虚拟环境模型,从而满足用户的个性化需求。
根据本申请的一方面,提供了一种基于决策流图的环境建模方法,包括:
获取待建模的目标业务场景中的目标业务特征和所述目标业务特征的特征信息;
基于所述目标业务特征,构建所述目标业务场景对应的目标决策流图,其中,所述目标决策流图中的业务节点包括:至少一个环境状态节点和至少一个决策智能体节点,所述至少一个环境状态节点包括当前环境状态子节点、环境状态转移子节点和下一环境状态子节点;
基于所述目标决策流图中的每个业务节点所绑定的业务特征和多个业务节点之间的数据流向信息,构建目标计算图;
基于所述目标计算图和所述目标业务特征的特征信息进行环境建模,确定所述目标业务场景对应的目标虚拟环境模型。
根据本申请的另一方面,提供了一种基于决策流图的环境建模装置,包括:
目标业务特征获取模块,设置为获取待建模的目标业务场景中的目标业务特征和所述目标业务特征的特征信息;
目标决策流图构建模块,设置为基于所述目标业务特征,构建所述目标业务场景对应的目标决策流图,其中,所述目标决策流图中的业务节点包括:至少一个环境状态节点和至少一个决策智能体节点,所述至少一个环境状态节点包括当前环境状态子节点、环境状态转移子节点和下一环境状态子节点;
目标计算图构建模块,设置为基于所述目标决策流图中的每个业务节点所绑定的业务特征和多个业务节点之间的数据流向信息,构建目标计算图;
目标虚拟环境模型确定模块,设置为基于所述目标计算图和所述目标业务特征的特征信息进行环境建模,确定所述目标业务场景对应的目标虚拟环境模型。
根据本申请的又一方面,提供了一种电子设备,所述电子设备包括:
至少一个处理器;以及
与所述至少一个处理器通信连接的存储器;其中,
所述存储器存储有可被所述至少一个处理器执行的计算机程序,所述计算机程序被所述至少一个处理器执行,以使所述至少一个处理器能够执行本申请任一实施例所述的基于决策流图的环境建模方法。
根据本申请的又一方面,提供了一种计算机可读存储介质,所述计算机可读存储介质包括:计算机程序,所述计算机程序被处理器执行时,能够实现本申请任一实施例所述的基于决策流图的环境建模方法。
附图说明
图1是本申请实施例一提供的一种基于决策流图的环境建模方法的流程图;
图2是本申请实施例一所涉及的一种决策流图的示例;
图3是本申请实施例二提供的一种基于决策流图的环境建模方法的流程图;
图4是本申请实施例三提供的一种基于决策流图的环境建模装置的结构示意图;
图5是实现本申请实施例的基于决策流图的环境建模方法的电子设备的结构示意图。
具体实施方式
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述。
需要说明的是,本申请的说明书和权利要求书及上述附图中的术语“目标”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换,以便这里描述的本申请的实施例能够以除了在这里图示或描述的那些以外的顺序实施。此外,术语“包括”和“具有”以及他们的任何变形,意图在于覆盖不排他的包含,例如,包含了一系列步骤或单元的过程、方法、系统、产品或设备不必限于清楚地列出的那些步骤或单元,而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固 有的其它步骤或单元。
实施例一
图1为本申请实施例一提供的一种基于决策流图的环境建模方法的流程图,本实施例可适用于对任意一种业务场景进行环境建模的情况。该方法可以是基于决策流图的环境建模装置来执行,该装置可以采用硬件和/或软件的形式实现,该装置可配置于电子设备中。如图1所示,该方法包括以下步骤:
S110、获取待建模的目标业务场景中的目标业务特征和目标业务特征的特征信息。
其中,目标业务场景可以是任意一种存在决策需求的业务场景。本实施例中的目标业务场景可以是开放的,不确定的,边界模糊的业务场景。例如,目标业务场景可以是物品搜索场景,示例性的为:用户输入搜索内容进行搜索请求之后,基于该搜索请求确定出搜索出的推荐物品信息以及推荐物品展示顺序。由于给用户展示的推荐物品的顺序是非常重要的,直接影响了用户的购买行为,若利用决策方式对推荐顺序进行强化学习,并且也不想干扰用户的正常使用,因而需要构建一个接近真实的物品搜索场景的虚拟环境模型,以便在这个虚拟环境中,通过推荐一些物品,并与环境中的虚拟用户进行交互,也就是购买行为的交互,从而可以在该虚拟环境中通过强化学习得到推荐商品顺序,进而基于学习得出的决策可以提高真实的物品搜索场景中的用户购买率。又如,目标业务场景可以是拣货派单场景,示例性的为:将订单分配给拣货人员,以确定拣货时间最短的订单分配方式,若利用决策方式对订单分配方式进行强化学习,并且也不想干扰用户的正常使用,从而需要构建一个接近真实的拣货派单场景的虚拟环境模型,以便在这个虚拟环境中,通过一些虚拟订单与这个环境进行交互,从而可以在该虚拟环境中强化学习得到拣货时间最短的订单分配方式。
其中,目标业务特征可以是在目标业务场景中采集的所有业务特征,可以利用业务参数标识进行表征。目标业务特征的特征信息可以是指目标业务特征的具体数据,也就是具体的业务参数值。目标业务特征可以包括业务环境特征 和业务决策特征,其中,业务环境特征可以包括决策之前的环境参数信息和决策之后的环境参数信息。业务决策特征可以是按照预设决策方式与环境进行交互而获得的决策参数信息,即智能体执行的多个动作信息。预设决策方式可以为相关技术中的决策方式。例如,在物品搜索场景中,预设决策方式可以是按照物品销量和/或物品评价得分进行排序的方式。在拣货派单场景中,预设决策方式可以是按照最短路径进行订单分配的方式。
示例性地,本实施例可以对目标业务场景中的目标业务特征和目标业务特征的特征信息进行特征处理,获得时间序列形式的特征信息。例如,时间序列形式的特征信息为:{轨迹1:时刻1的状态,决策动作1,决策结果1,时刻2的状态,决策动作2…,终止时刻N的状态}、{轨迹2:时刻1的状态,决策动作1,决策结果1,时刻2的状态,决策动作2…,终止时刻N的状态…}等。
S120、基于目标业务特征,构建目标业务场景对应的目标决策流图,其中,目标决策流图中的业务节点包括:至少一个环境状态节点和至少一个决策智能体节点,至少一个环境状态节点包括当前环境状态子节点、环境状态转移子节点和下一环境状态子节点。
其中,目标决策流图可以是一个有向无环图,其可以用于表征每个时间点不同业务特征之间的决策关系。目标决策流图中的数据流的输入与输出不能形成循环,也就是目标决策流图的结构符合有向无环图的结构。目标决策流图中的每个业务节点代表了用于计算该节点参数所使用的决策过程,业务节点之间的连线代表了数据流向。环境状态节点是一种组合节点,环境状态节点包括当前环境状态子节点、环境状态转移子节点和下一环境状态子节点,或者,环境状态节点包括当前环境状态子节点和下一环境状态子节点。其中,当前环境状态子节点是完整业务交互环境中的起点时刻的环境观测值。环境状态转移子节点是环境中由当前时刻环境状态和智能体动作计算下一时刻环境状态的过程。下一环境状态子节点是在完整业务交互完成后,可以作为下一轮业务交互起点时刻的环境状态观测值。决策智能体节点可以是目标业务场景中决策的关键主 体节点,用于决策出在不同环境状态下所执行的动作。例如,在赛车场景中,游戏场景是环境,赛车是决策智能体,赛车的位置是状态,对赛车的操作是动作,如何操作赛车是决策,比赛得分是奖励。本实施例构建出的目标决策流图中可以包括至少一个环境状态节点和决策智能体节点,其数量可以基于目标业务场景的实际情况进行确定。
其中,当前环境状态子节点支持数据流的输出。环境状态转移子节点支持数据流的输入,且输出给下一环境状态子节点。决策智能体节点可以同时支持数据流的输入和输出,从而利用环境状态节点和决策智能体节点可以更加准确地描述出目标业务场景中的数据流向和决策过程。
示例性地,目标决策流图中的业务节点还包括:至少一个环境智能体节点和/或至少一个静态变量节点。其中,环境智能体节点可以是指目标业务场景中拥有决策能力的其他主体节点,用于辅助决策出在不同环境状态下所执行的动作。静态变量节点可以是指目标业务场景中固定不变的业务特征,其可以参与并影响业务环境与决策,以便可以更加准确地表征出决策过程。本实施例中的环境智能体节点可以同时支持数据流的输入和输出;静态变量节点仅支持数据流的输出,不支持数据流的输入。本实施例构建出的目标决策流图中还可以包括至少一个环境智能体节点和静态变量节点,其数量可以基于目标业务场景的实际情况进行确定。示例性地,图2给出了一种决策流图的示例。如图2所示,决策流图可以包括一个环境状态节点(例如包括当前环境状态子节点、环境状态转移子节点和下一环境状态子节点)、两个决策智能体节点、一个环境智能体节点以及一个静态变量节点,多个节点之间的数据流向如图2所示。
本实施例可以基于目标业务特征自动构建出目标决策流图,以提高构建效率。本实施例也可以基于用户在可视化界面上触发的配置操作,手动构建出目标决策流图,以便满足用户的个性化需求,实现动态配置。本实施例通过构建决策流图的方式,以统一格式更加规范地描述出多个业务参数的决策过程,以便后续基于该决策流图可以更加便捷准确地进行环境建模。
示例性地,S120可以包括:对目标业务特征进行特征分析,确定多个目标业务特征之间的依赖关系;基于依赖关系,创建多个业务节点,并确定多个业务节点之间的数据流向信息,构建出目标业务场景对应的目标决策流图。
示例性地,可以对时间序列形式中的每个时刻下的目标业务特征进行特征分析,确定每个目标业务特征对应的特征类型,比如环境状态特征、决策智能体、环境智能体或者静态变量,以及多个目标业务特征之间的依赖关系,比如,业务特征A需要基于业务特征B和业务特征C进行确定。基于每个目标业务特征对应的特征类型,创建相应的业务节点。例如,若目标业务特征为环境状态特征,则创建该业务特征对应的环境状态节点;若目标业务特征为决策智能体,则创建该目标业务特征对应的决策智能体节点;若目标业务特征为环境智能体,则创建该目标业务特征对应的环境智能体节点;若目标业务特征为静态变量,则创建该目标业务特征对应的静态变量节点。基于多个目标业务特征之间的依赖关系确定多个业务节点之间的数据流向信息,比如,可以将业务特征B和业务特征C的数据流输出到业务特征A中,从而可以自动构建出目标决策流图。该目标决策流图可以表征出从时刻T到时刻T+1之间的数据流向关系。若时刻T不是终止时刻,每条符合要求的轨迹在每个时刻T均满足目标决策流图中的数据流向关系。
示例性地,S120还可以包括:基于用户在可视化界面上触发的节点添加操作,获取用户添加的多个空节点;基于用户针对每个空节点触发的节点信息配置操作,确定每个空节点对应的业务配置信息,其中,业务配置信息包括:节点名称信息和节点所绑定的业务特征;基于业务配置信息对相应的空节点进行配置,获得相应的业务节点;基于用户对多个业务节点触发的连线操作,获取多个业务节点之间的数据流向信息,构建出目标业务场景对应的目标决策流图。
示例性地,用户可以对多个目标业务特征进行时序梳理,确定目标业务场景中所涉及到的每个节点,并在可视化界面上通过节点添加操作,比如节点拖拽方式,添加相应的空节点,比如,环境状态节点、决策智能体节点、环境智 能体节点或者静态变量节点,并针对添加的每个空节点配置相应的节点信息,比如配置节点对应的节点名称信息,并通过节点绑定操作,将该节点与相应的业务特征进行绑定,从而可以获得配置后的每个业务节点,并基于多个业务参数之间的行为影响关系,对多个业务节点进行连线处理,基于用户的连线操作可以获得多个业务节点之间的数据流向信息,从而用户可以基于业务需求手动构建出目标决策流图,满足用户的个性化需求。
其中,节点配置信息还可以包括:节点数据类型、数据取值范围和插入函数信息。节点数据类型包括:连续类型、离散类型和默认类型,其中,离散类型包括:离散有序类型和离散无序类型。插入函数信息可以是基于专家经验构建的函数,从而通过插入该函数使得决策流图中可以混合专家经验,提高构建的灵活性和准确性。示例性地,用户还可以动态配置每个节点的节点数据类型、数据取值范围和插入函数信息,从而可以构建出更加符合实际情况且准确的目标决策流图。
S130、基于目标决策流图中的每个业务节点所绑定的业务特征和多个业务节点之间的数据流向信息,构建目标计算图。
其中,目标计算图(computation graph)可以是指可计算的决策流图。目标决策流图可以对应一个目标计算图。目标计算图可以直接用于目标业务场景对应的虚拟环境模型的构建中。
示例性地,可以基于目标决策流图中的每个业务节点所绑定的业务特征和多个业务节点之间的数据流向信息,将目标决策流图转换为可直接用于环境建模中的目标计算图。
示例性地,S130可以包括:对目标决策流图进行格式转换,确定结构化数据格式的目标决策数据;基于目标决策数据中的每个业务节点所绑定的业务特征和多个业务节点之间的数据流向信息,确定多个计算节点以及多个计算节点之间的计算关系,构建出目标计算图。
其中,结构化数据格式可以是但不限于YAML(Yet Another Markup Language) 标记语言格式或者JSON(JavaScript Object Notation)格式等。示例性地,可以将目标决策流图转换为结构化数据格式的目标决策数据,比如,获得YAML格式的目标决策文件并进行文件存储。根据目标决策数据中的每个业务节点所绑定的业务特征和多个业务节点之间的数据流向信息,可以确定出深度学习网络框架(比如TensorFlow、Pytorch等)中的所有计算节点以及计算节点之间的计算关系,从而构建出目标计算图。其中,每个计算节点是一个带有参数的可计算函数,例如可以是深度神经网络或其他参数化的函数。
需要说明的是,本实施例可以利用目标业务特征的特征信息去验证和判断目标计算图中的节点决策逻辑关系的正确性。例如,可以利用时间序列形式的特征信息验证目标计算图中的每个计算节点的数据格式的准确性以及计算节点之间的数据流向的准确性。相关技术中构建数据流图需要写代码配置数据流向以及定义函数节点。数据流图通常是方便研发任意去查看业务场景中所涉及到哪些节点以及节点之间的关系,并基于自己的理解去编写代码,编写代码实现的方式会与实际的数据流图存在一定的偏差,并且与实际业务特征信息相脱节,并且也不会考虑到兼容深度学习网络模型的训练。本实施例通过构建可直接用于环境建模中的目标计算图,从而可以更加准确合理地进行环境建模,保证了虚拟环境模型的准确性。
S140、基于目标计算图和目标业务特征的特征信息进行环境建模,确定目标业务场景对应的目标虚拟环境模型。
其中,目标虚拟环境模型可以是一种深度学习网络模型,其可以模仿目标业务场景中的真实环境的运行。
示例性地,基于目标计算图可以构建出初始的虚拟环境模型,并基于目标业务特征的特征信息可以对该虚拟环境模型进行训练,获得训练结束后的目标虚拟环境模型,以便利用目标虚拟环境模型代替实际的目标业务环境进行强化学习,提高了强化学习效果,并且满足了用户的个性化需求,进而也可以使得强化学习落地到真实的业务场景中。
示例性地,在S140之后还包括:基于目标虚拟环境模型,对目标业务场景中的预设决策模型进行强化学习,获得强化学习后的目标决策模型。
其中,预设决策模型可以是指目标决策流图中的一个决策智能体节点。预设决策模型设置为在不同的环境状态决策采取的行为动作信息,以使轨迹上的累计奖励最大化。示例性地,在目标虚拟环境模型中,预设决策模型与该虚拟环境在一段连续时间内不断进行交互可以产生一个交互轨迹,并通过对交互轨迹上的累计奖励进行最大化的方式,对预设决策模型进行强化学习,训练出最优的决策方式,获得最终的目标决策模型,从而可以在目标虚拟环境模型中更加便捷地对预设决策模型进行强化学习,无需干扰真实用户,并且保证了目标决策模型的学习效果。
本申请实施例的技术方案,通过基于待建模的目标业务场景中的目标业务特征,构建出目标业务场景对应的目标决策流图,该目标决策流图中的各个业务节点可以包括:至少一个环境状态节点和至少一个决策智能体节点,其中,至少一个环境状态节点可以包括当前环境状态子节点、环境状态转移子节点和下一环境状态子节点。基于目标决策流图中的每个业务节点所绑定的业务特征和多个业务节点之间的数据流向信息,构建可直接参与环境建模中的目标计算图。基于目标计算图和目标业务特征的特征信息进行环境建模,可以更加便捷地确定出目标业务场景对应的目标虚拟环境模型,从而可以利用目标虚拟环境模型代替实际的目标业务环境进行强化学习,大幅降低了在实际目标业务环境试错的成本,进而提高了强化学习效果,并且满足了用户的个性化需求。
实施例二
图3为本申请实施例二提供的一种基于决策流图的环境建模方法的流程图,本实施例在上述实施例的基础上,对目标虚拟环境模型的构建过程进行了详细描述。其中与上述实施例相同或相应的术语的解释在此不再赘述。参见图3,本实施例提供基于决策流图的环境建模方法包括:
S310、获取待建模的目标业务场景中的目标业务特征和目标业务特征的特征信息。
S320、基于目标业务特征,构建目标业务场景对应的目标决策流图。
S330、基于目标决策流图中的每个业务节点所绑定的业务特征和多个业务节点之间的数据流向信息,构建目标计算图。
S340、基于目标计算图,创建初始虚拟环境模型。
示例性地,可以基于目标计算图,创建出预先设置好的深度学习网络框架对应的初始虚拟环境模型,或者基于用户当前配置的机器学习框架创建相应的初始虚拟环境模型。例如,可以基于预设超参空间进行超参配置,不同的业务场景配置不同的超参,以构建出最佳的初始虚拟环境模型。例如,若配置了自动调参,则在环境模型训练过程中可以自动搜索最优参数。例如,深度学习网络框架中的模型结构可以包括但不限于:卷积神经网络CNN(Convolutional Neural Network)、长短期记忆网络LSTM(Long Short Term Mermory network)和残差网络ResNet中的至少一种,从而可以构建出不同结构的初始虚拟环境模型。
S350、基于目标业务特征的特征信息,确定交互样本数据和交互样本对应的实际轨迹。
示例性地,智能体与虚拟环境各执行一次的过程被称为一次交互或一步,决策智能体与虚拟环境在一段连续时间内不断进行交互所产生的一系列数据被称为轨迹。可以基于目标计算图中的每个业务节点所绑定的业务特征信息和多个业务节点之间的数据流向信息,从目标业务特征信息中提取出优化目标所对应的交互样本数据以及在目标业务场景中该交互样本数据所对应的实际轨迹。
S360、将交互样本数据输入至初始虚拟环境模型中,并根据初始虚拟环境模型的输出,获得仿真轨迹。
示例性地,将交互样本数据输入至待训练的初始虚拟环境模型中,确定在决策智能体与虚拟环境每次交互后获得环境状态数据,并基于在一段连续时间 内不断进行交互所产生的多个环境状态数据可以获得在初始虚拟环境模型中确定出的仿真轨迹。
S370、基于仿真轨迹和实际轨迹,确定轨迹相似度,并基于轨迹相似度调整初始虚拟环境模型中的参数权重,直至达到预设收敛条件时训练结束,获得目标业务场景对应的目标虚拟环境模型。
其中,轨迹相似度可以用于表征虚拟环境与真实环境之间的差异。轨迹相似度越高,则表明虚拟环境越接近真实环境
示例性地,类似于监督学习的方式,本实施例可以基于平均绝对误差函数或者平均平方误差函数等预设误差函数,确定出仿真轨迹与实际轨迹之间的轨迹相似度,即环境得分,并可以在轨迹相似度大于预设阈值时,对初始虚拟环境模型中的参数权重进行调整,并继续对调整后的初始虚拟环境模型进行训练。在轨迹相似度小于预设阈值或者变化趋于平稳时,可以确定达到预设收敛条件,初始虚拟环境模型训练结束,获得目标虚拟环境模型。
本实施例的技术方案,通过基于仿真轨迹和实际轨迹确定轨迹相似度,并基于轨迹相似度调整初始虚拟环境模型中的参数权重,直至达到预设收敛条件时训练结束,获得目标业务场景对应的目标虚拟环境模型,从而可以基于监督学习的方式进行环境建模,可以更加准确便捷地训练出目标虚拟环境模型。
以下是本申请实施例提供的基于决策流图的环境建模装置的实施例,该装置与上述各实施例的基于决策流图的环境建模方法属于同一个发明构思,在基于决策流图的环境建模装置的实施例中未详尽描述的细节内容,可以参考上述基于决策流图的环境建模方法的实施例。
实施例三
图4为本申请实施例三提供的一种基于决策流图的环境建模装置的结构示意图。如图4所示,该装置包括:目标业务特征获取模块410、目标决策流图构建模块420、目标计算图构建模块430和目标虚拟环境模型确定模块440。
其中,目标业务数据获取模块410,设置为获取待建模的目标业务场景中的目标业务特征和目标业务特征的特征信息;目标决策流图构建模块420,设置为基于目标业务特征,构建目标业务场景对应的目标决策流图,其中,目标决策流图中的业务节点包括:至少一个环境状态节点和至少一个决策智能体节点,至少一个环境状态节点包括当前环境状态子节点、环境状态转移子节点和下一环境状态子节点;目标计算图构建模块430,设置为基于目标决策流图中的每个业务节点所绑定的业务特征和多个业务节点之间的数据流向信息,构建目标计算图;目标虚拟环境模型确定模块440,设置为基于目标计算图和目标业务特征的特征信息进行环境建模,确定目标业务场景对应的目标虚拟环境模型。
本申请实施例的技术方案,通过基于待建模的目标业务场景中的目标业务特征,构建出目标业务场景对应的目标决策流图,该目标决策流图中的各个业务节点可以包括:至少一个环境状态节点和至少一个决策智能体节点,其中,至少一个环境状态节点可以包括当前环境状态子节点、环境状态转移子节点和下一环境状态子节点。基于目标决策流图中的每个业务节点所绑定的业务特征和多个业务节点之间的数据流向信息,构建可直接参与环境建模中的目标计算图。基于目标计算图和目标业务特征的特征信息进行环境建模,可以更加便捷地确定出目标业务场景对应的目标虚拟环境模型,从而可以利用目标虚拟环境模型代替实际的目标业务环境进行强化学习,进而提高了强化学习效果,并且满足了用户的个性化需求。
可选地,当前环境状态子节点支持数据流的输出;环境状态转移子节点支持数据流的输入,且输出给下一环境状态子节点;决策智能体节点支持数据流的输入和输出。
可选地,目标决策流图中的业务节点还包括:至少一个环境智能体节点和/或至少一个静态变量节点;其中,环境智能体节点支持数据流的输入和输出;静态变量节点仅支持数据流的输出,不支持数据流的输入。
可选地,目标业务特征的数量为多个,目标决策流图构建模块420,设置为:
对多个目标业务特征进行特征分析,确定多个目标业务特征之间的依赖关系;基于依赖关系,创建多个业务节点,并确定多个业务节点之间的数据流向信息,构建出目标业务场景对应的目标决策流图。
可选地,目标决策流图构建模块420,还设置为:
基于用户在可视化界面上触发的节点添加操作,获取用户添加的多个空节点;基于用户针对每个空节点触发的节点信息配置操作,确定每个空节点对应的业务配置信息,其中,业务配置信息包括:节点名称信息和节点所绑定的业务特征;基于业务配置信息对相应的空节点进行配置,获得相应的业务节点;基于用户对多个业务节点触发的连线操作,获取多个业务节点之间的数据流向信息,构建出目标业务场景对应的目标决策流图。
可选地,节点配置信息还包括:节点数据类型、数据取值范围和插入函数信息;节点数据类型包括:连续类型、离散类型和默认类型,其中,离散类型包括:离散有序类型和离散无序类型。
可选地,目标计算图构建模块430,设置为:
对目标决策流图进行格式转换,确定结构化数据格式的目标决策数据;基于目标决策数据中的每个业务节点所绑定的业务特征和多个业务节点之间的数据流向信息,确定多个计算节点以及多个计算节点之间的计算关系,构建出目标计算图。
可选地,目标虚拟环境模型确定模块440,设置为:
基于目标计算图,创建初始虚拟环境模型
基于目标业务特征的特征信息,确定交互样本数据和交互样本对应的实际轨迹;
将交互样本数据输入至初始虚拟环境模型中,并根据初始虚拟环境模型的输出,获得仿真轨迹;
基于仿真轨迹和实际轨迹,确定轨迹相似度,并基于轨迹相似度调整初始虚拟环境模型中的参数权重,直至达到预设收敛条件时训练结束,获得目标业 务场景对应的目标虚拟环境模型。
可选地,该装置还包括:
强化学习模块,设置为在确定目标业务场景对应的目标虚拟环境模型之后,基于目标虚拟环境模型,对目标业务场景中的预设决策模型进行强化学习,获得强化学习后的目标决策模型。
本申请实施例所提供的基于决策流图的环境建模装置可执行本申请任意实施例所提供的基于决策流图的环境建模方法,具备执行基于决策流图的环境建模方法相应的功能模块。
值得注意的是,上述基于决策流图的环境建模装置的实施例中,所包括的各个单元和模块只是按照功能逻辑进行划分的,但并不局限于上述的划分,只要能够实现相应的功能即可;另外,各功能单元的具体名称也只是为了便于相互区分,并不用于限制本申请的保护范围。
实施例四
图5示出了可以用来实施本申请的实施例的电子设备10的结构示意图。电子设备旨在表示各种形式的数字计算机,诸如,膝上型计算机、台式计算机、工作台、个人数字助理、服务器、刀片式服务器、大型计算机、和其它适合的计算机。电子设备还可以表示各种形式的移动装置,诸如,个人数字处理、蜂窝电话、智能电话、可穿戴设备(如头盔、眼镜、手表等)和其它类似的计算装置。本文所示的部件、它们的连接和关系、以及它们的功能仅仅作为示例,并且不意在限制本文中描述的和/或者要求的本申请的实现。
如图5所示,电子设备10包括至少一个处理器11,以及与至少一个处理器11通信连接的存储器,如只读存储器(Read-Only Memory,ROM)12、随机访问存储器(Random Access Memory,RAM)13等,其中,存储器存储有可被至少一个处理器执行的计算机程序,处理器11可以根据存储在只读存储器(ROM)12中的计算机程序或者从存储单元18加载到随机访问存储器(RAM)13中的 计算机程序,来执行各种适当的动作和处理。在RAM 13中,还可存储电子设备10操作所需的各种程序和数据。处理器11、ROM 12以及RAM 13通过总线14彼此相连。输入/输出(Input/Output,I/O)接口15也连接至总线14。
电子设备10中的多个部件连接至I/O接口15,包括:输入单元16,例如键盘、鼠标等;输出单元17,例如各种类型的显示器、扬声器等;存储单元18,例如磁盘、光盘等;以及通信单元19,例如网卡、调制解调器、无线通信收发机等。通信单元19允许电子设备10通过诸如因特网的计算机网络和/或各种电信网络与其他设备交换信息/数据。
处理器11可以是各种具有处理和计算能力的通用和/或专用处理组件。处理器11的一些示例包括但不限于中央处理单元(Central Processing Unit,CPU)、图形处理单元(Graphics Processing Unit,GPU)、各种专用的人工智能(Artificial Intelligence,AI)计算芯片、各种运行机器学习模型算法的处理器、数字信号处理器(Digital Signal Processing,DSP)、以及任何适当的处理器、控制器、微控制器等。处理器11执行上文所描述的各个方法和处理,例如基于决策流图的环境建模方法。
在一些实施例中,基于决策流图的环境建模方法可被实现为计算机程序,其被有形地包含于计算机可读存储介质,例如存储单元18。在一些实施例中,计算机程序的部分或者全部可以经由ROM 12和/或通信单元19而被载入和/或安装到电子设备10上。当计算机程序加载到RAM 13并由处理器11执行时,可以执行上文描述的基于决策流图的环境建模方法的至少一个步骤。备选地,在其他实施例中,处理器11可以通过其他任何适当的方式(例如,借助于固件)而被配置为执行基于决策流图的环境建模方法。
本文中以上描述的系统和技术的各种实施方式可以在数字电子电路系统、集成电路系统、场可编程门阵列(Field Programmable Gate Array,FPGA)、专用集成电路(Application Specific Integrated Circuit,ASIC)、专用标准产品(Application Specific Standard Parts,ASSP)、芯片上系统的系统(System on  Chip,SOC)、负载可编程逻辑设备(Complex Programmable Logic Device,CPLD)、计算机硬件、固件、软件、和/或它们的组合中实现。这些各种实施方式可以包括:实施在至少一个计算机程序中,该至少一个计算机程序可在包括至少一个可编程处理器的可编程系统上执行和/或解释,该可编程处理器可以是专用或者通用可编程处理器,可以从存储系统、至少一个输入装置、和至少一个输出装置接收数据和指令,并且将数据和指令传输至该存储系统、该至少一个输入装置、和该至少一个输出装置。
为了提供与用户的交互,可以在电子设备上实施此处描述的系统和技术,该电子设备具有:用于向用户显示信息的显示装置(例如,阴极射线管(Cathode Ray Tube,CRT)或者液晶显示器(Liquid Crystal Display,LCD)监视器);以及键盘和指向装置(例如,鼠标或者轨迹球),用户可以通过该键盘和该指向装置来将输入提供给电子设备。其它种类的装置还可以用于提供与用户的交互;例如,提供给用户的反馈可以是任何形式的传感反馈(例如,视觉反馈、听觉反馈、或者触觉反馈);并且可以用任何形式(包括声输入、语音输入或者、触觉输入)来接收来自用户的输入。
可以将此处描述的系统和技术实施在包括后台部件的计算系统(例如,作为数据服务器)、或者包括中间件部件的计算系统(例如,应用服务器)、或者包括前端部件的计算系统(例如,具有图形用户界面或者网络浏览器的用户计算机,用户可以通过该图形用户界面或者该网络浏览器来与此处描述的系统和技术的实施方式交互)、或者包括这种后台部件、中间件部件、或者前端部件的任何组合的计算系统中。可以通过任何形式或者介质的数字数据通信(例如,通信网络)来将系统的部件相互连接。通信网络的示例包括:局域网(Local Area Network,LAN)、广域网(Wide Area Network,WAN)、区块链网络和互联网。
计算系统可以包括客户端和服务器。客户端和服务器一般远离彼此并且通常通过通信网络进行交互。通过在相应的计算机上运行并且彼此具有客户端-服 务器关系的计算机程序来产生客户端和服务器的关系。服务器可以是云服务器,又称为云计算服务器或云主机,是云计算服务体系中的一项主机产品,以解决了传统物理主机与虚拟专用服务器(Virtual Private Server,VPS)服务中,存在的管理难度大,业务扩展性弱的缺陷。
应该理解,可以使用上面所示的各种形式的流程,重新排序、增加或删除步骤。例如,本申请中记载的各步骤可以并行地执行也可以顺序地执行也可以不同的次序执行,只要能够实现本申请的技术方案所期望的结果,本文在此不进行限制。

Claims (12)

  1. 一种基于决策流图的环境建模方法,包括:
    获取待建模的目标业务场景中的目标业务特征和所述目标业务特征的特征信息;
    基于所述目标业务特征,构建所述目标业务场景对应的目标决策流图,其中,所述目标决策流图中的业务节点包括:至少一个环境状态节点和至少一个决策智能体节点,所述至少一个环境状态节点包括当前环境状态子节点、环境状态转移子节点和下一环境状态子节点;
    基于所述目标决策流图中的每个业务节点所绑定的业务特征和多个业务节点之间的数据流向信息,构建目标计算图;
    基于所述目标计算图和所述目标业务特征的特征信息进行环境建模,确定所述目标业务场景对应的目标虚拟环境模型。
  2. 根据权利要求1所述的方法,其中,所述当前环境状态子节点支持数据流的输出;所述环境状态转移子节点支持数据流的输入,且输出给所述下一环境状态子节点;所述决策智能体节点支持数据流的输入和输出。
  3. 根据权利要求1所述的方法,其中,所述目标决策流图中的业务节点还包括:至少一个环境智能体节点和至少一个静态变量节点中的至少之一;其中,每个环境智能体节点支持数据流的输入和输出;每个静态变量节点仅支持数据流的输出,不支持数据流的输入。
  4. 根据权利要求1所述的方法,其中,所述目标业务特征的数量为多个,所述基于所述目标业务特征,构建所述目标业务场景对应的目标决策流图,包括:
    对多个目标业务特征进行特征分析,确定所述多个目标业务特征之间的依赖关系;
    基于所述依赖关系,创建多个业务节点,并确定所述多个业务节点之间的数据流向信息,构建出所述目标业务场景对应的目标决策流图。
  5. 根据权利要求1所述的方法,其中,所述基于所述目标业务特征,构建 所述目标业务场景对应的目标决策流图,包括:
    基于用户在可视化界面上触发的节点添加操作,获取用户添加的多个空节点;
    基于用户针对每个空节点触发的节点信息配置操作,确定每个空节点对应的业务配置信息,其中,所述业务配置信息包括:节点名称信息和节点所绑定的业务特征;
    基于所述业务配置信息对相应的空节点进行配置,获得相应的业务节点;
    基于用户对多个业务节点触发的连线操作,获取多个业务节点之间的数据流向信息,构建出所述目标业务场景对应的目标决策流图。
  6. 根据权利要求5所述的方法,其中,所述节点配置信息还包括:节点数据类型、数据取值范围和插入函数信息;所述节点数据类型包括:连续类型、离散类型和默认类型,其中,所述离散类型包括:离散有序类型和离散无序类型。
  7. 根据权利要求1所述的方法,其中,所述基于所述目标决策流图中的每个业务节点所绑定的业务特征和多个业务节点之间的数据流向信息,构建目标计算图,包括:
    对所述目标决策流图进行格式转换,确定结构化数据格式的目标决策数据;
    基于所述目标决策数据中的每个业务节点所绑定的业务特征和所述多个业务节点之间的数据流向信息,确定多个计算节点以及所述多个计算节点之间的计算关系,构建出目标计算图。
  8. 根据权利要求1所述的方法,其中,所述基于所述目标计算图和所述目标业务特征的特征信息进行环境建模,确定所述目标业务场景对应的目标虚拟环境模型,包括:
    基于所述目标计算图,创建初始虚拟环境模型;
    基于所述目标业务特征的特征信息,确定交互样本数据和所述交互样本对应的实际轨迹;
    将所述交互样本数据输入至所述初始虚拟环境模型中,并根据所述初始虚拟环境模型的输出,获得仿真轨迹;
    基于所述仿真轨迹和实际轨迹,确定轨迹相似度,并基于所述轨迹相似度调整初始虚拟环境模型中的参数权重,直至达到预设收敛条件时训练结束,获得所述目标业务场景对应的目标虚拟环境模型。
  9. 根据权利要求1-8任一项所述的方法,在确定所述目标业务场景对应的目标虚拟环境模型之后,还包括:
    基于所述目标虚拟环境模型,对所述目标业务场景中的预设决策模型进行强化学习,获得强化学习后的目标决策模型。
  10. 一种基于决策流图的环境建模装置,包括:
    目标业务特征获取模块,设置为获取待建模的目标业务场景中的目标业务特征和所述目标业务特征的特征信息;
    目标决策流图构建模块,设置为基于所述目标业务特征,构建所述目标业务场景对应的目标决策流图,其中,所述目标决策流图中的业务节点包括:至少一个环境状态节点和至少一个决策智能体节点,所述至少一个环境状态节点包括当前环境状态子节点、环境状态转移子节点和下一环境状态子节点;
    目标计算图构建模块,设置为基于所述目标决策流图中的每个业务节点所绑定的业务特征和多个业务节点之间的数据流向信息,构建目标计算图;
    目标虚拟环境模型确定模块,设置为基于所述目标计算图和所述目标业务特征的特征信息进行环境建模,确定所述目标业务场景对应的目标虚拟环境模型。
  11. 一种电子设备,包括:
    至少一个处理器;以及
    与所述至少一个处理器通信连接的存储器;其中,
    所述存储器存储有可被所述至少一个处理器执行的计算机程序,所述计算机程序被所述至少一个处理器执行,以使所述至少一个处理器能够执行权利要 求1-9中任一项所述的基于决策流图的环境建模方法。
  12. 一种计算机可读存储介质,包括:计算机程序,所述计算机程序被处理器执行时,能够实现权利要求1-9中任一项所述的基于决策流图的环境建模方法。
PCT/CN2022/101444 2022-04-24 2022-06-27 基于决策流图的环境建模方法、装置和电子设备 WO2023206771A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
EP22891179.8A EP4290351A1 (en) 2022-04-24 2022-06-27 Environment modeling method and apparatus based on decision flow graph, and electronic device

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
CN202210434180.9 2022-04-24
CN202210434180 2022-04-24
CN202210579742.9A CN114924684A (zh) 2022-04-24 2022-05-25 基于决策流图的环境建模方法、装置和电子设备
CN202210579742.9 2022-05-25

Publications (1)

Publication Number Publication Date
WO2023206771A1 true WO2023206771A1 (zh) 2023-11-02

Family

ID=82811180

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/101444 WO2023206771A1 (zh) 2022-04-24 2022-06-27 基于决策流图的环境建模方法、装置和电子设备

Country Status (3)

Country Link
EP (1) EP4290351A1 (zh)
CN (1) CN114924684A (zh)
WO (1) WO2023206771A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117197281A (zh) * 2023-11-08 2023-12-08 国网浙江省电力有限公司 基于业务场景的资产数据全生命链动态画像构建方法
CN117574111A (zh) * 2024-01-15 2024-02-20 大秦数字能源技术股份有限公司 基于场景状态的bms算法选择方法、装置、设备和介质

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115903548B (zh) * 2022-12-27 2024-03-08 南栖仙策(南京)科技有限公司 一种磨煤机组控制器的优化方法、装置、设备及存储介质
CN117389659A (zh) * 2023-09-06 2024-01-12 苏州数设科技有限公司 一种面向工业软件的方法库管理方法及装置

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109947567A (zh) * 2019-03-14 2019-06-28 深圳先进技术研究院 一种多智能体强化学习调度方法、系统及电子设备
CN111814050A (zh) * 2020-07-08 2020-10-23 上海携程国际旅行社有限公司 旅游场景强化学习模拟环境构建方法、系统、设备和介质
CN112597217A (zh) * 2021-03-02 2021-04-02 南栖仙策(南京)科技有限公司 一种历史决策数据驱动的智能决策平台及其实现方法
CN112801430A (zh) * 2021-04-13 2021-05-14 贝壳找房(北京)科技有限公司 任务下发方法、装置、电子设备及可读存储介质
US20210200923A1 (en) * 2019-12-31 2021-07-01 Electronics And Telecommunications Research Institute Device and method for providing a simulation environment for training ai agent
CN113157422A (zh) * 2021-04-29 2021-07-23 清华大学 基于深度强化学习的云数据中心集群资源调度方法及装置

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110717647A (zh) * 2019-09-03 2020-01-21 深圳壹账通智能科技有限公司 决策流构建方法、装置、计算机设备和存储介质
CN110704045A (zh) * 2019-09-20 2020-01-17 凡普数字技术有限公司 决策流程的构建方法、装置以及存储介质
CN110705622A (zh) * 2019-09-26 2020-01-17 支付宝(杭州)信息技术有限公司 一种决策方法、系统以及电子设备
CN110942338A (zh) * 2019-11-01 2020-03-31 支付宝(杭州)信息技术有限公司 一种营销赋能策略的推荐方法、装置和电子设备

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109947567A (zh) * 2019-03-14 2019-06-28 深圳先进技术研究院 一种多智能体强化学习调度方法、系统及电子设备
US20210200923A1 (en) * 2019-12-31 2021-07-01 Electronics And Telecommunications Research Institute Device and method for providing a simulation environment for training ai agent
CN111814050A (zh) * 2020-07-08 2020-10-23 上海携程国际旅行社有限公司 旅游场景强化学习模拟环境构建方法、系统、设备和介质
CN112597217A (zh) * 2021-03-02 2021-04-02 南栖仙策(南京)科技有限公司 一种历史决策数据驱动的智能决策平台及其实现方法
CN112801430A (zh) * 2021-04-13 2021-05-14 贝壳找房(北京)科技有限公司 任务下发方法、装置、电子设备及可读存储介质
CN113157422A (zh) * 2021-04-29 2021-07-23 清华大学 基于深度强化学习的云数据中心集群资源调度方法及装置

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117197281A (zh) * 2023-11-08 2023-12-08 国网浙江省电力有限公司 基于业务场景的资产数据全生命链动态画像构建方法
CN117197281B (zh) * 2023-11-08 2024-02-23 国网浙江省电力有限公司 基于业务场景的资产数据全生命链动态画像构建方法
CN117574111A (zh) * 2024-01-15 2024-02-20 大秦数字能源技术股份有限公司 基于场景状态的bms算法选择方法、装置、设备和介质
CN117574111B (zh) * 2024-01-15 2024-03-19 大秦数字能源技术股份有限公司 基于场景状态的bms算法选择方法、装置、设备和介质

Also Published As

Publication number Publication date
CN114924684A (zh) 2022-08-19
EP4290351A1 (en) 2023-12-13

Similar Documents

Publication Publication Date Title
WO2023206771A1 (zh) 基于决策流图的环境建模方法、装置和电子设备
Yao et al. A novel reinforcement learning algorithm for virtual network embedding
US20210256403A1 (en) Recommendation method and apparatus
CN107392255B (zh) 少数类图片样本的生成方法、装置、计算设备及存储介质
CN111144577B (zh) 异构图之中节点表示的生成方法、装置和电子设备
Tao et al. Modelling of combinable relationship-based composition service network and the theoretical proof of its scale-free characteristics
CN108962238A (zh) 基于结构化神经网络的对话方法、系统、设备及存储介质
US11553048B2 (en) Method and apparatus, computer device and medium
EP4113386A2 (en) Method, apparatus and system for federated learning, electronic device, computer readable medium
US11416760B2 (en) Machine learning based user interface controller
CN109189935B (zh) 一种基于知识图谱的app传播分析方法及系统
CN109155005A (zh) 使用伪计数的增强学习
CN116226334A (zh) 生成式大语言模型训练方法以及基于模型的搜索方法
CN113326852A (zh) 模型训练方法、装置、设备、存储介质及程序产品
CN112070310A (zh) 基于人工智能的流失用户预测方法、装置及电子设备
CN106997488A (zh) 一种结合马尔科夫决策过程的动作知识提取方法
CN109925718A (zh) 一种分发游戏微端地图的系统及方法
CN111400473A (zh) 意图识别模型的训练方法及装置、存储介质及电子设备
CN110175469A (zh) 一种社交媒体用户隐私泄漏检测方法、系统、设备及介质
CN113726545B (zh) 基于知识增强生成对抗网络的网络流量生成方法及装置
CN116910567B (zh) 推荐业务的在线训练样本构建方法及相关装置
CN114036388A (zh) 数据处理方法和装置、电子设备、及存储介质
CN111709778B (zh) 出行流量预测方法、装置、电子设备和存储介质
CN116167445B (zh) 量子测量模式的处理方法、装置及电子设备
WO2024016680A1 (zh) 信息流推荐方法、装置及计算机程序产品

Legal Events

Date Code Title Description
WWE Wipo information: entry into national phase

Ref document number: 18253138

Country of ref document: US

ENP Entry into the national phase

Ref document number: 2022891179

Country of ref document: EP

Effective date: 20230517

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22891179

Country of ref document: EP

Kind code of ref document: A1