WO2022035441A1

WO2022035441A1 - Dynamic dispatching with robustness for large-scale heterogeneous mining fleet via deep reinforcement learning

Info

Publication number: WO2022035441A1
Application number: PCT/US2020/046482
Authority: WO
Inventors: Chi Zhang; Shuai ZHENG; Hamed KHORASGANI; Susumu Serita; Chetan Gupta; Philip ODONKOR
Original assignee: Hitachi, Ltd.
Priority date: 2020-08-14
Filing date: 2020-08-14
Publication date: 2022-02-17

Abstract

An Episodic Memory Deep Q Neural Network (EM-DQN) simulator to a dispatcher for a mining system, which involves initializing a state for the mining system through execution of the simulator; executing on the simulator, for each time shift of the mining system, for each truck of the plurality of trucks that need to be dispatched during the each time shift, executing an action associated with the each truck for the each time shift to obtain a reward for the each truck and a transition for the state; storing the transition in a memory; retrieving ones of the plurality of trucks that are delayed based on the transition for the state and dump sites associated with the plurality of trucks for the time shift; and executing memory tailoring on the transition stored in the memory based on the ones of the plurality of trucks that are delayed.

Description

DYNAMIC DISPATCHING WITH ROBUSTNESS FOR LARGE-SCALE HETEROGENEOUS MINING FLEET VIA DEEP REINFORCEMENT LEARNING

BACKGROUND

Field

[0001] The present disclosure is generally directed to mining systems, and more specifically to facilitating dynamic dispatching of a heterogeneous mining fleet through deep reinforcement learning.

Related Art

[0002] The mining sector, an industry typed by a strong aversion to risk and change today, finds itself on the cusp of an unprecedented transformation; one focused on embracing digital technologies such as artificial intelligence (Al) and the Internet of Things (loT) to improve operational efficiency, productivity, and safety. While still in its nascent stage, the adoption of date-driven automation is already reshaping core mining operations. Advanced analytics and sensors for example, are helping lower maintenance costs and decrease downtime, while boosting output and chemical recovery. The potential of automation however extends far beyond.

[0003] In the related art, there is the Open-Pit Mining Operational Planning (OPMOP) problem, an NP-hard problem which seeks to balance the trade-offs between mine productivity and operational costs. While OPMOP encapsulates a wide range of operational planning tasks, the most critical task involves the dynamic allocation of truck-shovel resources.

[0004] In the open-pit mine operations, dispatch decisions orchestrate trucks to shovels for one loading, and to dumps for ore/waste delivery. This process, referred to as a truck cycle, is repeated continually over a 12-hour operational shift.

[0005] FIG. 1A illustrates an example sequence of events contained within a single truck cycle. Specifically, FIG. 1A illustrates example truck activities in one complete cycle in mining operations, namely driving empty to a shovel, spotting and loading, haulage, and maneuvering and dumping load. [0006] FIG. 1B illustrates an example graph representation of dynamic dispatching problem in mining. When trucks finish loading or dumping (highlighted in dashed circles), they need to be dispatched to a new dump or shovel destination. In dynamic allocation systems, trucks are not restricted to fixed, pre-defined shovel/dump routes, instead, they can be dispatched to any shovel/dump as illustrated in FIG. 1B. An additional queuing step is introduced when the arrival rate of trucks to a given shovel/dump exceeds its loading/dumping rate. Queuing represents a major inefficiency for trucks since queued trucks are not contributing to the productivity of the mine. Another form of inefficiency known as shovel starvation occurs when the truck arrival rate falls below the shovel loading rate and results in idle shovels. Consequently, the goal of a dispatch policy is to minimize both starvation for shovels and queuing for trucks, and thereby increase the overall productivity level.

[0007] With a mine constantly evolving, be it through variations in fleet heterogeneity and size, or changing production requirements, open research questions still remain for developing dispatch strategies capable of continually adapting to these changes. This need is further underlined in OPMOP problems focused on dynamic truck allocation. While dynamic allocation makes it possible to actively decrease queue or starvation times, compared to fixed path dispatching strategies, this tends to be computationally more complex and demanding. In fact, efforts in the related art to address such problems using supervised learning approaches have thus far struggled to adequately capture and model the real-time changes involved.

[0008] Particularly, there are a few challenges for dynamic dispatching in OPMOP:

(1) the scale of fleets are often large (e.g., a large size mine can have more than 100 trucks running at the same time so that it is difficult for a dispatcher to make optimal decisions);

(2) heterogeneous fleets with different capacities, driving times, loading/unloading speeds, and so on, make it even more difficult to design a good dispatching strategy; (3) existing heuristic rules such as Shortest Queue (SQ) and Shortest Processing Time First (SPTF) rely on short-term and local indicators (e.g., wait time) to make decisions, leading to short-sighted and sub-optimal solutions. SUMMARY

[0009] To address these challenges, the present disclosure frames the problem as a deep reinforcement learning (RL) problem and develops dispatching strategies to maximize the tonnage of ore delivered and equipment utilization. While RL has proved useful in single agent applications, for problems requiring multi-agent interactions, the convergence guarantees of RL fail owing to the non-stationarity of the environment. Multi-Agent Reinforcement Learning (MARL) provides the mechanisms for addressing this shortcoming. The simplest way to model multi-agent problem is by using an autonomous learner for each agent (e.g., independent DQN) which distinguishes agents by identities. It allows for each extension from a small scale to a large scale of agents and works reasonably well after extensive tuning. However, this approach suffers from high variance in performance, especially when the number of agents is large.

[0010] Contextual DQN (cDQN) tackles these issues in large-scale agent learning as it accelerates the learning procedure by reducing the output dimension of the action value function and by letting agents share contextual information such as geographic context and collaborative context. However, this method is heavily constrained by the geo-based agent definition which cannot be directly applied to more general fleet dispatching domains such as mining and manufacturing, where both the geographical map and the number of agents change with time (e.g., moving from one open pit mine to another, truck failures and/or new truck introduced). In such cases, cDQN needs to be retrained for the new environments. While centralized learning for MARL are typically taking joint observations and individual agent’s actions as inputs, the present disclosure differentiates itself by using partially observed state representations to avoid high-dimensional representations of joint (i.e., global) state. The benefits are not only more dense representation thus more efficient learning, but also facilitating the learner to be robust to the drifting between training and testing environments.

[0011] As experience sharing between independent agents could accelerate learning for multi-agent problems, related art implementations focus on experience defined by each individual, which is different from the present disclosure in which experience is not defined by individuals, but rather an abstract from observations. Joint state listing is another knowledge sharing method investigated in the related art, but is not applicable to the mining applications as the state described are finite and discrete whereas in the present disclosure the state is infinite and continuous.

[0012] In the mining dispatching problem, the main challenge is that the search space increases exponentially with the environment complexity such as the number of trucks, shovels, and dumps, which can make the problem intractable to solve or lead to sub-optimal heuristic solutions. Additionally, in real mine applications, stochastic events such as unpredicted truck downtime can happen and makes the learned dispatch rules less applicable in the new environment and re-learning the new environment is not only inefficient but also unaffordable due to the real-time requirement. Therefore, a good dispatch policy should be robust enough to handle environment fluctuations without scarifying the efficiency.

[0013] In the present disclosure, the example implementations described herein address the above challenges in mining dispatching problem with the ultimate goal of improving productivity level and achieving robustness.

[0014] The example implementations described herein are directed to development of a comprehensive framework with four components. First, example implementations develop a highly-configurable mining simulator with parameters learned from real-world mines to simulate trucks/shovels/dumps and their stochastic activities. Then, a novel state representation to resolve learning efficiency and robustness problems simultaneously are provided in the example implementations. Specifically, the example implementations involve a novel DQN architecture with experience-sharing and memory-tailoring known as Episodic Memory Deep Q Network (EM-DQN) to leverage the proposed state representation, and derive optimal dispatching policies by letting our RL agents learn in the simulated environments. Thus, example implementations propose metrics to effectively evaluate the performance of dispatch rules. Finally, the example implementations propose two modes model inference to not only test out the learned models in training environment but also unseen environments with truck failures to mimic the real scenarios in mines.

[0015] Aspects of the present disclosure can involve a method to generate and integrate an Episodic Memory Deep Q Neural Network (EM-DQN) simulator to a dispatcher for a mining system involving a plurality of trucks and a plurality of dump and shovel sites, the method involving initializing a state for the mining system through execution of the simulator; and executing on the simulator. For each time shift of the mining system, the method can further involve, for each truck of the plurality of trucks that need to be dispatched during the each time shift, executing an action associated with the each truck for the each time shift to obtain a reward for the each truck and a transition for the state; storing the transition in a memory; retrieving ones of the plurality of trucks that are delayed based on the transition for the state and dump sites associated with the plurality of trucks for the time shift; and executing memory tailoring on the transition stored in the memory based on the ones of the plurality of trucks that are delayed.

[0016] Aspects of the present disclosure can involve a non-transitory computer readable medium, storing instructions to generate and integrate an Episodic Memory Deep Q Neural Network (EM-DQN) simulator to a dispatcher for a mining system involving a plurality of trucks and a plurality of dump and shovel sites, the instructions involving initializing a state for the mining system through execution of the simulator; and executing on the simulator for each time shift of the mining system: for each truck of the plurality of trucks that need to be dispatched during the each time shift, executing an action associated with the each truck for the each time shift to obtain a reward for the each truck and a transition for the state; storing the transition in a memory; retrieving ones of the plurality of trucks that are delayed based on the transition for the state and dump sites associated with the plurality of trucks for the time shift; and executing memory tailoring on the transition stored in the memory based on the ones of the plurality of trucks that are delayed.

[0017] Aspects of the present disclosure can involve a system to generate and integrate an Episodic Memory Deep Q Neural Network (EM-DQN) simulator to a dispatcher for a mining system involving a plurality of trucks and a plurality of dump and shovel sites, the system involving means for initializing a state for the mining system through execution of the simulator; means for executing on the simulator, for each time shift of the mining system: for each truck of the plurality of trucks that need to be dispatched during the each time shift, means for executing an action associated with the each truck for the each time shift to obtain a reward for the each truck and a transition for the state; means for storing the transition in a memory; means for retrieving ones of the plurality of trucks that are delayed based on the transition for the state and dump sites associated with the plurality of trucks for the time shift; and means for executing memory tailoring on the transition stored in the memory based on the ones of the plurality of trucks that are delayed. [0018] Aspects of the present disclosure can involve an apparatus configured to generate and integrate an Episodic Memory Deep Q Neural Network (EM-DQN) simulator to a dispatcher for a mining system involving a plurality of trucks and a plurality of dump and shovel sites, the apparatus involving a processor, configured to: initialize a state for the mining system through execution of the simulator; and execute on the simulator, for each time shift of the mining system: for each truck of the plurality of trucks that need to be dispatched during the each time shift, execute an action associated with the each truck for the each time shift to obtain a reward for the each truck and a transition for the state; store the transition in a memory; retrieve ones of the plurality of trucks that are delayed based on the transition for the state and dump sites associated with the plurality of trucks for the time shift; and execute memory tailoring on the transition stored in the memory based on the ones of the plurality of trucks that are delayed.

BRIEF DESCRIPTION OF DRAWINGS

[0019] FIG. 1 A illustrates an example sequence of events contained within a single truck cycle.

[0020] FIG. 1B illustrates an example graph representation of dynamic dispatching problem in mining.

[0021] FIG. 2 illustrates examples of notations that are used in the present disclosure.

[0022] FIG. 3A and 3B illustrate example flow diagrams, in accordance with an example implementation.

[0023] FIG. 4 illustrates an example of DQN configurations, in accordance with an example implementation.

[0024] FIG. 5 illustrates an example diagram of the simulator and interactions with the learner, in accordance with an example implementation.

[0025] FIG. 6 illustrates an example operation of vehicles such as trucks and shovels, in accordance with an example implementation.

[0026] FIG. 7 illustrates a logical view of a vehicle dispatching and simulation system, in accordance with an example implementation. [0027] FIGS. 8A to 8C illustrate example management information that can be utilized to simulate the system, in accordance with an example implementation.

[0028] FIG. 9 illustrates an example computing environment with an example computer device suitable for use in some example implementations.

DETAILED DESCRIPTION

[0029] The following detailed description provides details of the figures and example implementations of the present application. Reference numerals and descriptions of redundant elements between figures are omitted for clarity. Terms used throughout the description are provided as examples and are not intended to be limiting. For example, the use of the term “automatic” may involve fully automatic or semi-automatic implementations involving user or administrator control over certain aspects of the implementation, depending on the desired implementation of one of ordinary skill in the art practicing implementations of the present application. Selection can be conducted by a user through a user interface or other input means, or can be implemented through a desired algorithm. Example implementations as described herein can be utilized either singularly or in combination and the functionality of the example implementations can be implemented through any means according to the desired implementations.

[0030] Example implementations involve a comprehensive procedure to efficiently learn robust truck dispatch rules in highly dynamic mining environments. The mining truck dispatching problem is formulated as Multi-Agent Reinforcement Learning (MARL), and solved by centralized learning. Compared with de-centralized learning, the example implementations significantly reduces the number of learners by training only one single learner and therefore, reduces the learning complexity, which is extremely important for mining dispatching given the size of the problem.

[0031] To realize the centralized learning, example implementations facilitate experience-sharing between agents, where experience is novel abstract state and action representations. The novel representations de-couple state/action from agent identity in MARL problem formulation and therefore, make learning to be no longer constrained by the number of agents. This is especially efficient for large-scale problems such as mining where the complexity is exponential in the number of agents. Additionally, it makes experience sharing among heterogeneous agents possible, as it converts the heterogenous properties of agents into general features. It also enhances robustness when the number of agents in the training environment is different from the testing environment (e.g., trucks failures), which is common in real-world applications but often ignored by existing related art methods.

[0032] In example implementations, there is memory tailoring by coordination to tackle the non-stationarity in MARL and improve Markov property in the shared experiences in which memories are collected individually from each agent are naturally asynchronous. Moreover, due to the proposed abstract state representation, coordination overhead is also minimized. Memory tailoring is different from data cleaning as data is removed on-the-fly by designed rules proposed by the present disclosure, instead of being filtered out by existing data cleaning techniques.

[0033] Example implementations described herein involve a simple yet effective way to combine the above ideas of experience-sharing enabled by novel state/action representations and memory-tailoring for non-stationaiy environment to realize EM-DQN, which effectively and efficiently solves large-scale heterogeneous mining fleet dynamic dispatching problems with robustness.

[0034] Example implementations described herein involve a novel model inference method which emphasizes robustness in mining dispatching. This is a critical aspect for good dispatch models to be applicable to real-world problems, but is often ignored. The model is designed in such a way that when unexpected events (e.g., increase/decrease truck numbers) happen, the learned dispatching rules can still be used without re-training.

[0035] Problem formulation

[0036] FIG. 2 illustrates examples of notations that are used in the present disclosure.

[0037] Agent: The present disclosure considers any dispatchable truck as an agent. Truck fleets can be composed of trucks with varying haulage capacities, driving speeds, loading/unloading time, etc., resulting in truck fleets with heterogeneous agents. Note that shovels and dumps are assumed to be homogeneous.

[0038] State Representation: The present disclosure maintains a local state S_t which captures relevant attributes of the truck queues present each shovel and dump within the mine site. Particularly, when a decision (i.e., dispatching destination) needs to be made for a truck T, the state is represented in a vector as following:

[0039] 1) Truck Capacity: Truck capacity C_r is captured within the state space to allow the learning agent to account for a heterogeneous truck fleet. This affords the agent the ability to develop dispatch strategies aimed at capitalizing the capacity of trucks to maximize productivity.

[0040] 2) Expected Wait Time: For each shovel and dump, example implementations calculate the potential wait time a truck will encounter if it were dispatched to that location. To calculate this, example implementations consider two queue types - an “Actual Queue”, AQ, and an “En-Route Queue”, EQ. As the name suggests, the actual queue accounts for trucks physically queuing for a shovel or dump. The “en-route queue” on the other hand accounts for trucks that have been dispatched to a shovel or dump, but have yet to physically arrive. These two queue distinctions are necessary because they allow us to better predict the expected wait time. Consequently, the expected wait time for shovel k, at time t, WTt^k, is formulated in Eqn. 1 as:

[0041] where LD_i and SP_i represent the average loading and spotting time of truck i ∈ AQ^k (where AQ^k is the set of all trucks in shovel k' actual queue). The second term of this equation focuses on the en-route queue. Specifically, it is concerned with the average loading, spotting and hauling time of truck j ∈ EQ^k* (where EQ^k* is the set of all trucks in shovel k' En Route Queue expected to arrive before truck T if it were dispatched to this location). The following relationship always holds: |EQ^k*| ≤ |EQ^k|, ∀k. The last term is the hauling, spotting and loading time of the current truck T. For dumps, LD and SP are replaced by dumping time DM, and HL is replaced by driving empty time DE in Eqn. 1.

[0042] 3) Total Capacity of Waiting Trucks: For each shovel or dump, the example implementations also calculate TCk_{w, t}, the total capacity of all the trucks in (AQ^k + EQ^k*) which are ahead of truck T. This can be necessary because wait time alone may not be a good indicator of queue length. It is not possible for a queue to have a long wait time despite having few trucks actually queuing. Although simply providing the state space with a count of queuing trucks could have been sufficient, providing the total capacity implicitly achieves the same task, which also provides the learning agent with more useful information.

[0043] 4) Activity Time of Delayed Trucks: Assuming the truck is dispatched to a given location, “Delayed trucks” refer to trucks already en-route for that location which are estimated to arrive after truck T. The number of delayed trucks, DT^k, at shovel k can be derived: DT^k = EQ^k - EQ^k*.

[0044] Based on the number of trucks in DT^k, the activity time, AT can be calculated as follows:

[0045] 5) Capacity of Delayed Trucks: In addition to the activity time, example implementations also calculate the combined capacity TCk_{d, t} of the delayed trucks and make this available within the state vector. Activity time and capacity of delayed trucks is included to allow the learning agent to consider the impact its decisions have on other trucks. Example implementations facilitate the agent to learn when to be selfish and prioritize its interests over other trucks, and also when to perhaps opt for a longer/slower queue for the “greater good”. Accordingly, the state of an agent T at a decision-making time t can be represented as:

[0046] For a mine with N shovels and M dumps, the state vector length is 4 X (N + M ) + 1. Note that when a truck needs to go to a shovel, all dumps related parts in Eqn. 3 are masked as zeros since they have less impact on the current decision making, and vice versa for shovels. This makes the environment always “partially observed” by agents but effectively reduces the computational overheads. The proposed state is different from geo- based state or individual independent state with several benefits: 1) it abstracts properties among heterogeneous agents to ensure a unified representation and consequently, a centralized learning can be implemented easily (as described herein), 2) it is not restricted by the number of agents F so that it does not need re-training when F changes. This can be particularly important as unplanned vehicle downtime is inevitable but re-learning is often undesired due to the processing and memory costs. Note that the change of shovels and dumps are usually rare, so they can be assumed to be fixed.

[0047] Action Representation

[0048] The joint action space encapsulates all possible actions available to all agents. With the context of MARL, each agent takes its own actions without knowing the actions of other agents. Since the dispatch problem inherently tries to determine the best shovel/dump to send a truck, each unique shovel and dump within the mine represents a possible action. Based on this approach, the action space is reduced to a finite and discrete space. The challenge of handling problems with finite, discrete action spaces is well-studied in the related art. Consequently, assuming a mine with n shovels and m dumps, the action space can be formulated according to Eqn. 4.

A = {a^SH1, a^SH2, ... a^SHN, a^DP1, a^DP2, ... , a^DPM} (4)

[0049] Based on this implementation, selecting an action a^SH1 means that the truck in question will be dispatched to Shovel 1. A benefit of using this action space is that it scales very well to any number of shovels and dumps. It is worth noting that the only appropriate dispatch action for a truck currently at a dump is to go to a shovel. A truck is not allowed to go to a different dump if it is currently at a dump. The same applies to shovel locations. Consequently, part of the action space presented to an agent (Eqn. 4) will always be invalid. This can however be addressed in one of two ways: (i) by awarding a large negative reward for invalid actions and ending the learning episode; or (ii) by filtering the actions. Since the latter approach can be more easily implemented by adding simple constraints and avoids unnecessary complexity in learning, example implementations can be directed to such an approach to save on computing resources and memory. However, depending on the desired implementation, the former approach may also be utilized in accordance with an example implementation.

[0050] Reward Function

[0051] The quality of the action from each agent is measured via a reward signal emitted by the environment. In contrast to the related art implementations in multi-agent RL, the reward signal is defined on an individual agent basis as opposed to being shared among agents. Since rewards are not assigned immediately following an action (e.g., owing to varying activity duration times), the approach of reward sharing becomes too cumbersome to compute. Define the individual r associated with taking action a from the reward function R(s_i, a_i) = where C_T is the capacity of truck T, and Δt is the time elapsed to complete the action a (i.e., the time gap between a_t and a_{t -1}).

[0052] Deep Neural Network Building

[0053] In example implementations described herein, there is a novel experience sharing multi-agent learning approach, where the learner collects state, action, and reward (i.e., experience from each individual agent, and then learns in a centralized way. In example implementations, deep neural networks can be used for the learner.

[0054] Experience Sharing in Heterogeneous Agents

[0055] As described above, the state and action are stored in the memory of the learner without distinguishing which agent it comes from and when it is generated. Example implementations described herein involve the premise that even for heterogenous agents, as long as they share the same goal and have similar functionality (i.e., all agents are trucks with loading/driving/dumping capabilities), a proper state representation as described above facilitates abstraction of agent properties and therefore, experience sharing becomes possible among heterogeneous agents. This makes the example implementations described herein significantly different from previous works that learn multiple Q¹ functions where i is agent identity.

[0056] Memory Tailoring by Coordination

[0057] FIG. 3A and 3B illustrate example flow diagrams, in accordance with an example implementation. Since trucks are allowed to cut in line in front of others, and the observation is partially captured by the abstract state representation described herein, these can potentially violate the Markov property for En-Route trucks that are in the delayed queue EQ^k* in Eqn. 1. To address this problem, example implementations involve a memory tailoring algorithm to remove the “corrupted” experience from the memory, as shown in FIG. 3 A.

[0058] In the proposed memory tailoring algorithm of FIG. 3 A, the initial input 300 is the Memory M; delayed truck IDs at shovel/dump k,j = 1, ... DT^k and the memory

tailoring cache MT is initialized. From 301, each shovel/dump site k is retrieved to determine the associated delay truck IDs. At 302, each delayed truck associated with the retrieved shovel/dump site (j = 1 to DT^k ) is retrieved, wherein the transition associated with the delayed truck for the shovel/dump site is extracted at 303 (m =< s, a, r, s' > T_j), and added to the memory tailoring cache MT += m.

[0059] At 304, a determination is made as to whether all of the delayed trucks DT^k have been processed. If so (Yes), the flow proceeds to 305, otherwise (No), the flow proceeds back to 302 to retrieve the next delayed truck and reiterate. At 305, a determination is made as to whether all shovel/dump sites k have been processed. If so (Yes) then the flow proceeds to 306, otherwise (No), the flow proceeds to 301 to process the next shovel/dump site.

[0060] Once all delayed trucks for all shovel/dump sites have been processed, the flow proceeds to 306 to purge the memory of the memory tailoring cache from the input system memory (M = M — MT) to result in new memory M. At 307, the updated new system memory M is provided as output.

[0061] The proposed memory tailoring can be implemented by a coordination mechanism, which is known to be a challenge among large-scale agents due to the high computational costs. However, in the proposed algorithm, this overhead is small because only a small number of trucks in EQ^k* will be affected (i.e., need to be coordinated), where k is the shovel or dump ID at one time. The algorithm of EM-DQN which combines experience sharing and memory tailoring is shown in FIG. 3B.

[0062] FIG. 3B illustrates an example of training an EM-DQN with experience sharing and memory tailoring, in accordance with an example implementation. At 310, state s_t is provided as input, replay memory M is initialized to capacity M_max , and action value function is initialized with random weights θ. At 311, the flow iterates the process from 312 to 319 until max iterations are reached (itr = 1 to max — iterations). At 312, the flow resets the environment and executes the simulation to obtain initial state s₀.

[0063] From 313, one shift duration TS is simulated and processed iteratively from 314 to 317. At 314, each truck T_i in the fleet F of the system is retrieved (i.e., i = 1 to F). At 315, if the truck T_i needs to be dispatched, the flow samples the action a_t by ∈ —greedy policy given s_t, executes a_t in simulator and obtains reward r_t and next state s_t+Δt, stores transition (s_t, a_t, r_t, s_t+Δt)7) in the system memory M, retrieves delayed trucks T_jd given s_t, a_t, and conducts memory tailoring on system memory M by executing the flow of FIG. 3 given T_jd .

[0064] At 316, a determination is made as to whether all trucks have been processed. If so (Yes), then the flow proceeds to 317, otherwise (No), the flow proceeds back to 314 to process the next truck. At 317, a determination is made as to whether the simulation has been executed for the entire shift duration. If so (Yes), then the flow proceeds to 318, otherwise (No), the flow proceeds back to 313 to continue the simulation.

[0065] At 318, the flow samples a batch of transitions (s_t, a_t, r_t, s_t+Δt) from M, where t can be different in one batch, computes the target y_t = r_t + y * ^maxa_t+1Q(s_t+1, a_t+1; θ') > and the error err = y_t — Q(s_t, a_t; θ) and updates the Q-network as θ' ← θ + Vθe². This flow is executed for E number of epochs; e = 1 to E.

[0066] At 319, the action a_t is provided as output.

[0067] Deep Q Network Design

[0068] FIG. 4 illustrates an example of DQN configurations, in accordance with an example implementation. The network (i.e., Q-network in FIG. 3B) is composed of three layers (labeled Layer 1, Layer 2, Layer 3), with all followed by a ReLU activation, except for last layer which has a sigmoid activation. All weights and biases are initialized according to the default initialization. To allow for learning, the Adaptive Moment Estimation (ADAM) optimization algorithm is used, along with a constant learning rate of 10^-5 and a batch size of 1024 samples, number of epochs E of 100, memory size M of 100000, discount factor y of 0.9 in FIG. 3B. Error clipping can also be applied just as in the original DQN. The DQN is trained to minimize the smooth L1 loss. To encourage exploration, a simulated annealing- based epsilon-greedy algorithm is used, decaying from 80% chance of random actions down to 1%. The Q-Network is configured to provide actions involving the scheduling of trucks at dumps and shovels. The optimum action can be extracted by filtering the network output for the shovels and dumps as illustrated in FIG. 4.

[0069] Mining Simulator

[0070] To allow for mining dispatch operations to be simulated, a mining emulator can be developed with frameworks such as SimPy, which is a process-based discrete-event simulation framework. Shovels and dumps are designed as resources with fixed capacity and queuing effect. At the point in time where a truck needs to be dispatched to either a dump or shovel, the state of all dumps and shovels are passed as a state vector to the learned (i.e., neural network). The emulator facilitates the testing of different DQN architectures quickly for developing dispatch strategies.

[0071] Because the underlying systems tend to involve heterogenous fleets, the activity time such as loading, dumping and hauling (as illustrated in FIG. 1A) are a function of destination type (i.e., shovel or dump), activity type, and fleet type. To increase the realism of the simulator, activity times are sampled from a set of Gamma distributions with shape and scale parameters learned from real world data mine.

[0072] FIG. 5 illustrates an example diagram of the simulator and interactions with the learner, in accordance with an example implementation. In the example simulator, there is the neural network 500 as implemented by the EM-DQN that is configured to receive states 501 for each decision block 503 of the trucks in the fleet, and provide actions 502 for the decision block 503 of the trucks as illustrated in FIG. 3B. Example states of the truck can involve driving while empty 504 and hauling 507. Example states of the shovel can include spotting 505 and loading 506. Example state of the dump site can involve dumping 508 from a given truck. The information for each shovel, truck, dump site, and the system in general to facilitate the simulator is provided in the examples of FIG. 8A to 8C. As will be described herein, rewards are provided to the neural network 500 based on the metrics for the simulator.

[0073] Metrics

[0074] In the mining industry, wait time, idle time, utilization, queuing time, etc., are widely recognized metrics to measure the operation efficiency. However, these short-term metrics do not guarantee good overall performance such as production level which are longterm objectives. In example implementations described herein, the following metrics can be utilized:

[0075] Production level: the total amount (tons) of ores delivered from shovels to dumps. This is the one of the more important measurements as it is directly linked to profit mines can make. In an example implementation, the production level is calculated according to each shift in the mining system (e.g., every 12 hours). [0076] Cycle time: the short-term indicator most dispatching rules (e.g., SQ, SPTF) try to minimize. Intuitively, less cycle time yields more cycles and more delivery. However, this may not be true when the system involves heterogeneous trucks with different capacities. Example implementations incorporate cycle time for the purpose of comparing the short-term performance with baselines.

[0077] Matching factor: a mid-term metric that defines the ratio of shovel productivity to truck productivity MF = Since most mining systems

involve heterogeneous trucks and homogeneous shovels, the matching factor is MF =

It is noteworthy that MF

= 1 is the ideal matching of truck and shovel productivities, but it does not guarantee high production levels in heterogeneous settings.

[0078] Model Inference (application)

[0079] Example implementations involve two model inference modes: normal mode and robust mode. In the normal mode, the model is tested in the same environment as training, but with different environment and truck initialization. At any time when a truck finishes a task (e.g., hauling, loading, dumping) and needs to be dispatched, the trained model takes the current environment observation as inputs and generates a destination (e.g., shovel or dump) which is executed by the simulator to dispatch the truck. When a whole shift finishes, the overall productivity level, average matching factor, and average cycle time are calculated and used to evaluate the performance of the model, and/or to compare performance different dispatching rules.

[0080] In the robust mode, stochastic events are introduced to the environment. For example, example implementations randomly remove/add trucks to/from the environment and run the same (trained) model to obtain performance metrics. This model inference mode is particularly important to evaluate the robustness of any learned models.

[0081] FIG. 6 illustrates an example operation of vehicles such as trucks and shovels, in accordance with an example implementation. The mining operation may include a plurality of shovels 601, a plurality of trucks 604, dump sites 603, and other vehicles depending on the desired implementation. Trucks 604 and/or shovels 601 may be communicatively coupled to a computer system 602 through a network 100. Trucks 604 may navigate to shovels 601 to receive a payload and may also form a queue in front of shovels 601 when the shovels are being utilized. Trucks may also navigate to dump sites 603 to offload the payload.

[0082] FIG. 7 illustrates a logical view of a vehicle dispatching and simulation system, in accordance with an example implementation. Sensor data coming from the vehicles 601, 604 can be processed through a complex event processing/streaming engine (CEP) 700 in real time, and processed in batches or windows by computer system 602. Data is processed by the computer system 602 and stored in a relational database 704. Predictor functions 703 may predict: (i) activity durations and (ii) activity scheduling for vehicles based on the using historical data obtained from the database and data obtained from the CEP 700.

[0083] The data from sensors stored in the database are used as input for the EM-DQN model (optimization modules) 701. The outputs of both simulation 702 and predictors 703 along with the data from database 704 can be used in by the EM-DQN model (optimization modules) to generate optimized scheduling. The obtained vehicle activity time forecasts and optimized scheduling can be displayed on a dashboard 705 so that a dispatcher 706 can determine the forecasted activity times and scheduling for the vehicles managed by the vehicle scheduling system. As illustrated in the system of FIG. 7, example implementations can therefore provide predictions and optimized scheduling on any batch of data received from any vehicle at any given point in time.

[0084] FIGS. 8A to 8C illustrate example management information that can be utilized to simulate the system, in accordance with an example implementation. In particular, FIG. 8A illustrates an example of vehicle information in accordance with an example implementation. Vehicle information may include the vehicle identifier, the last known location of the truck, the time stamp of the latest data received, and OEM information. Such OEM information can include the odometer reading, the vehicle model, hauling capacity, and so on according to the desired implementation. Depending on the desired implementation, the vehicle information may include other variables or omit any one of the listed variables. FIG. 8B illustrates an example of topology information, in accordance with an example implementation. In an example implementation of a mining operation, topology information may include shovel identifier, dump site identifier, distance between shovel and dump and route characteristics. Such route characteristics can include the elevation gradient for the route between the shovel and the corresponding dump site and route conditions (e.g., paved, mud, gravel, etc.). Depending on the desired implementation, the topology information may include other variables or omit any one of the listed variables according to the desired implementation. For example, in operations involving railcars, topology information can include distance between stations, rail conditions, and so on. FIG. 8C illustrates an example of vehicle activity information, in accordance with an example implementation. Vehicle activity information can include the vehicle identifier/number, the shovel identifier/number, the dump site identifier/number, shift information, activity information, weather data (e.g., temperature, snow conditions, heavy wind, rain conditions etc.), and activity durations. Depending on the desired implementation, the vehicle activity information may include other variables or omit any one of the listed variables.

[0085] FIG. 9 illustrates an example computing environment with an example computer device suitable for use in some example implementations, such as a computer system 602 configured to facilitate the simulations, or a dispatching system to dispatch schedules to trucks and configured to generate and integrate an Episodic Memory Deep Q Neural Network (EM-DQN) simulator to a dispatcher for a mining system involving a plurality of trucks and a plurality of dump and shovel sites. Computer device 905 in computing environment 900 can include one or more processing units, cores, or processors 910, memory 915 (e.g., RAM, ROM, and/or the like), internal storage 920 (e.g., magnetic, optical, solid state storage, and/or organic), and/or IO interface 925, any of which can be coupled on a communication mechanism or bus 930 for communicating information or embedded in the computer device 905. IO interface 925 is also configured to receive images from cameras or provide images to projectors or displays, depending on the desired implementation.

[0086] Computer device 905 can be communicatively coupled to input/user interface 935 and output device/interface 940. Either one or both of input/user interface 935 and output device/interface 940 can be a wired or wireless interface and can be detachable. Input/user interface 935 may include any device, component, sensor, or interface, physical or virtual, that can be used to provide input (e.g., buttons, touch-screen interface, keyboard, a pointing/cursor control, microphone, camera, braille, motion sensor, optical reader, and/or the like). Output device/interface 940 may include a display, television, monitor, printer, speaker, braille, or the like. In some example implementations, input/user interface 935 and output device/interface 940 can be embedded with or physically coupled to the computer device 905. In other example implementations, other computer devices may function as or provide the functions of input/user interface 935 and output device/interface 940 for a computer device 905.

[0087] Examples of computer device 905 may include, but are not limited to, highly mobile devices (e.g., smartphones, devices in vehicles and other machines, devices carried by humans and animals, and the like), mobile devices (e.g., tablets, notebooks, laptops, personal computers, portable televisions, radios, and the like), and devices not designed for mobility (e.g., desktop computers, other computers, information kiosks, televisions with one or more processors embedded therein and/or coupled thereto, radios, and the like).

[0088] Computer device 905 can be communicatively coupled (e.g., via IO interface 925) to external storage 945 and network 950 for communicating with any number of networked components, devices, and systems, including one or more computer devices of the same or different configuration. Computer device 905 or any connected computer device can be functioning as, providing services of, or referred to as a server, client, thin server, general machine, special-purpose machine, or another label.

[0089] IO interface 925 can include, but is not limited to, wired and/or wireless interfaces using any communication or IO protocols or standards (e.g., Ethernet, 802.1 lx, Universal System Bus, WiMax, modem, a cellular network protocol, and the like) for communicating information to and/or from at least all the connected components, devices, and network in computing environment 900. Network 950 can be any network or combination of networks (e.g., the Internet, local area network, wide area network, a telephonic network, a cellular network, satellite network, and the like).

[0090] Computer device 905 can use and/or communicate using computer-usable or computer-readable media, including transitory media and non-transitory media. Transitory media include transmission media (e.g., metal cables, fiber optics), signals, carrier waves, and the like. Non-transitory media include magnetic media (e.g., disks and tapes), optical media (e.g., CD ROM, digital video disks, Blu-ray disks), solid state media (e.g., RAM, ROM, flash memory, solid-state storage), and other non-volatile storage or memory.

[0091] Computer device 905 can be used to implement techniques, methods, applications, processes, or computer-executable instructions in some example computing environments. Computer-executable instructions can be retrieved from transitory media, and stored on and retrieved from non-transitory media. The executable instructions can originate from one or more of any programming, scripting, and machine languages (e.g., C, C++, C#, Java, Visual Basic, Python, Perl, JavaScript, and others).

[0092] Processor(s) 910 can execute under any operating system (OS) (not shown), in a native or virtual environment. One or more applications can be deployed that include logic unit 960, application programming interface (API) unit 965, input unit 970, output unit 975, and inter-unit communication mechanism 995 for the different units to communicate with each other, with the OS, and with other applications (not shown). The described units and elements can be varied in design, function, configuration, or implementation and are not limited to the descriptions provided. Processor(s) 910 can be in the form of hardware processors such as central processing units (CPUs) or in a combination of hardware and software units.

[0093] In some example implementations, when information or an execution instruction is received by API unit 965, it may be communicated to one or more other units (e.g., logic unit 960, input unit 970, output unit 975). In some instances, logic unit 960 may be configured to control the information flow among the units and direct the services provided by API unit 965, input unit 970, output unit 975, in some example implementations described above. For example, the flow of one or more processes or implementations may be controlled by logic unit 960 alone or in conjunction with API unit 965. The input unit 970 may be configured to obtain input for the calculations described in the example implementations, and the output unit 975 may be configured to provide output based on the calculations described in example implementations.

[0094] Memory 915 can be configured to manage management information as illustrated in FIGS. 8A to 8C to facilitate the implementations described herein, as well as to store the episodic states and transitions in memory.

[0095] Processor(s) 910 can be configured to initialize a state for the mining system through execution of the simulator 702; execute on the simulator 702, for each time shift of the mining system: for each truck of the plurality of trucks that need to be dispatched during the each time shift, execute an action associated with the each truck for the each time shift to obtain a reward for the each truck and a transition for the state; store the transition in a memory 915; retrieve ones of the plurality of trucks that are delayed based on the transition for the state and dump sites associated with the plurality of trucks for the time shift; and execute memory tailoring on the transition stored in the memory 915 based on the ones of the plurality of trucks that are delayed based on FIG. 3B through use of the management information as illustrated in FIGS. 8A to 8C.

[0096] Processor(s) 910 can be configured to execute memory tailoring by, for each of the ones of the plurality of trucks that are delayed, executing an action associated with the each of the ones of the plurality of trucks that are delayed to obtain a reward and a transition for the state, and accumulating the transition in the memory 915; and modifying the transition in the memory 915 from taking a difference between the transition in the memory 915 and the accumulated transitions as illustrated in FIG. 3 A.

[0097] Processor(s) 910 can be configured to sample transitions in the memory to determine error between a target network and the Deep Q Neural Net; and update weights for the Deep Q Neural Network based on the error as illustrated at 318 of FIG. 3B.

[0098] Depending on the desired implementation, the simulator 702 can configured to generate a dispatch schedule for the plurality of trucks for the each time shift of the mining system, and wherein the dispatcher 706 is configured to dispatch the schedule generated by the simulator to the plurality of trucks and provide feedback from the plurality of trucks to the simulator 702 to determine error as illustrated in FIG. 7. Dispatcher 706 can involve a dedicated system connected to the plurality of trucks, such as an Internet of Things (loT) gateway, a function executed by processor(s) 910 to communicate with the trucks via network 950, or otherwise in accordance with the desired implementation. Simulator 702 can also be configured to generate the dispatch schedule based on optimization of one or more of production level, cycle time, or comparison of shovel productivity to truck productivity, or other metrics in accordance with the desired implementation.

[0099] Some portions of the detailed description are presented in terms of algorithms and symbolic representations of operations within a computer. These algorithmic descriptions and symbolic representations are the means used by those skilled in the data processing arts to convey the essence of their innovations to others skilled in the art. An algorithm is a series of defined steps leading to a desired end state or result. In example implementations, the steps carried out require physical manipulations of tangible quantities for achieving a tangible result. [0100] Unless specifically stated otherwise, as apparent from the discussion, it is appreciated that throughout the description, discussions utilizing terms such as “processing,” “computing,” “calculating,” “determining,” “displaying,” or the like, can include the actions and processes of a computer system or other information processing device that manipulates and transforms data represented as physical (electronic) quantities within the computer system’s registers and memories into other data similarly represented as physical quantities within the computer system’s memories or registers or other information storage, transmission or display devices.

[0101] Example implementations may also relate to an apparatus for performing the operations herein. This apparatus may be specially constructed for the required purposes, or it may include one or more general-purpose computers selectively activated or reconfigured by one or more computer programs. Such computer programs may be stored in a computer readable medium, such as a computer-readable storage medium or a computer-readable signal medium. A computer-readable storage medium may involve tangible mediums such as, but not limited to optical disks, magnetic disks, read-only memories, random access memories, solid state devices and drives, or any other types of tangible or non-transitory media suitable for storing electronic information. A computer readable signal medium may include mediums such as carrier waves. The algorithms and displays presented herein are not inherently related to any particular computer or other apparatus. Computer programs can involve pure software implementations that involve instructions that perform the operations of the desired implementation.

[0102] Various general-purpose systems may be used with programs and modules in accordance with the examples herein, or it may prove convenient to construct a more specialized apparatus to perform desired method steps. In addition, the example implementations are not described with reference to any particular programming language. It will be appreciated that a variety of programming languages may be used to implement the teachings of the example implementations as described herein. The instructions of the programming language(s) may be executed by one or more processing devices, e.g., central processing units (CPUs), processors, or controllers.

[0103] As is known in the art, the operations described above can be performed by hardware, software, or some combination of software and hardware. Various aspects of the example implementations may be implemented using circuits and logic devices (hardware), while other aspects may be implemented using instructions stored on a machine-readable medium (software), which if executed by a processor, would cause the processor to perform a method to carry out implementations of the present application. Further, some example implementations of the present application may be performed solely in hardware, whereas other example implementations may be performed solely in software. Moreover, the various functions described can be performed in a single unit, or can be spread across a number of components in any number of ways. When performed by software, the methods may be executed by a processor, such as a general purpose computer, based on instructions stored on a computer-readable medium. If desired, the instructions can be stored on the medium in a compressed and/or encrypted format.

[0104] Moreover, other implementations of the present application will be apparent to those skilled in the art from consideration of the specification and practice of the teachings of the present application. Various aspects and/or components of the described example implementations may be used singly or in any combination. It is intended that the specification and example implementations be considered as examples only, with the true scope and spirit of the present application being indicated by the following claims.

Claims

CLAIMS What is claimed is:

1. A method to generate and integrate an Episodic Memory Deep Q Neural Network (EM-DQN) simulator to a dispatcher for a mining system involving a plurality of trucks and a plurality of dump and shovel sites, the method comprising: initializing a state for the mining system through execution of the simulator; executing on the simulator, for each time shift of the mining system: for each truck of the plurality of trucks that need to be dispatched during the each time shift, executing an action associated with the each truck for the each time shift to obtain a reward for the each truck and a transition for the state; storing the transition in a memory; retrieving ones of the plurality of trucks that are delayed based on the transition for the state and dump sites associated with the plurality of trucks for the time shift; and executing memory tailoring on the transition stored in the memory based on the ones of the plurality of trucks that are delayed.

2. The method of claim 1, wherein the executing memory tailoring comprises: for each of the ones of the plurality of trucks that are delayed, executing an action associated with the each of the ones of the plurality of trucks that are delayed to obtain a reward and a transition for the state, and accumulating the transition in the memory; and modifying the transition in the memory from taking a difference between the transition in the memory and the accumulated transitions.

3. The method of claim 1, further comprising: sampling transitions in the memory to determine error between a target network and the Deep Q Neural Net; and updating weights for the Deep Q Neural Network based on the error.

4. The method of claim 1, wherein the simulator is configured to generate a dispatch schedule for the plurality of trucks for the each time shift of the mining system, and wherein the dispatcher is configured to dispatch the schedule generated by the simulator to the plurality of trucks and provide feedback from the plurality of trucks to the simulator to determine error.

5. The method of claim 4, wherein the simulator is configured to generate the dispatch schedule based on optimization of one or more of production level, cycle time, or comparison of shovel productivity to truck productivity.

6. A non-transitoiy computer readable medium, storing instructions to generate and integrate an Episodic Memory Deep Q Neural Network (EM-DQN) simulator to a dispatcher for a mining system involving a plurality of trucks and a plurality of dump and shovel sites, the instructions comprising: initializing a state for the mining system through execution of the simulator; executing on the simulator, for each time shift of the mining system: for each truck of the plurality of trucks that need to be dispatched during the each time shift, executing an action associated with the each truck for the each time shift to obtain a reward for the each truck and a transition for the state; storing the transition in a memory; retrieving ones of the plurality of trucks that are delayed based on the transition for the state and dump sites associated with the plurality of trucks for the time shift; and executing memory tailoring on the transition stored in the memory based on the ones of the plurality of trucks that are delayed.

7. The non-transitory computer readable medium of claim 6, wherein the executing memory tailoring comprises: for each of the ones of the plurality of trucks that are delayed, executing an action associated with the each of the ones of the plurality of trucks that are delayed to obtain a reward and a transition for the state, and accumulating the transition in the memory; and modifying the transition in the memory from taking a difference between the transition in the memory and the accumulated transitions.

8. The non-transitory computer readable medium of claim 6, further comprising: sampling transitions in the memory to determine error between a target network and the Deep Q Neural Net; and updating weights for the Deep Q Neural Network based on the error.

9. The non-transitory computer readable medium of claim 6, wherein the simulator is configured to generate a dispatch schedule for the plurality of trucks for the each time shift of the mining system, and wherein the dispatcher is configured to dispatch the schedule generated by the simulator to the plurality of trucks and provide feedback from the plurality of trucks to the simulator to determine error.

10. The non-transitory computer readable medium of claim 9, wherein the simulator is configured to generate the dispatch schedule based on optimization of one or more of production level, cycle time, or comparison of shovel productivity to truck productivity.

11. An apparatus configured to generate and integrate an Episodic Memory Deep Q Neural Network (EM-DQN) simulator to a dispatcher for a mining system involving a plurality of trucks and a plurality of dump and shovel sites, the apparatus comprising: a processor, configured to: initialize a state for the mining system through execution of the simulator; execute on the simulator, for each time shift of the mining system: for each truck of the plurality of trucks that need to be dispatched during the each time shift, execute an action associated with the each truck for the each time shift to obtain a reward for the each truck and a transition for the state; store the transition in a memory; retrieve ones of the plurality of trucks that are delayed based on the transition for the state and dump sites associated with the plurality of trucks for the time shift; and execute memory tailoring on the transition stored in the memory based on the ones of the plurality of trucks that are delayed.

12. The apparatus of claim 11, wherein the processor is configured to execute memory tailoring by: for each of the ones of the plurality of trucks that are delayed, executing an action associated with the each of the ones of the plurality of trucks that are delayed to obtain a reward and a transition for the state, and accumulating the transition in the memory; and modifying the transition in the memory from taking a difference between the transition in the memory and the accumulated transitions.

13. The apparatus of claim 11, wherein the processor is configured to: sample transitions in the memory to determine error between a target network and the Deep Q Neural Net; and update weights for the Deep Q Neural Network based on the error.

14. The apparatus of claim 11, wherein the simulator is configured to generate a dispatch schedule for the plurality of trucks for the each time shift of the mining system, and wherein the dispatcher is configured to dispatch the schedule generated by the simulator to the plurality of trucks and provide feedback from the plurality of trucks to the simulator to determine error.

15. The apparatus of claim 14, wherein the simulator is configured to generate the dispatch schedule based on optimization of one or more of production level, cycle time, or comparison of shovel productivity to truck productivity.