GB2575891A

GB2575891A - Accelerator subsystem with GPU, transportation route price system, cache and method of acceleration of a permutation analysis therefor

Info

Publication number: GB2575891A
Application number: GB1900299.7A
Authority: GB
Inventors: Ahiska Yavuz
Original assignee: Peerless Ltd
Current assignee: Peerless Ltd
Priority date: 2018-07-18
Filing date: 2019-01-09
Publication date: 2020-01-29
Also published as: GB201900299D0

Abstract

An accelerator subsystem with graphics processing unit (GPU) 240 for transportation routes includes an input 241 arranged to receive input data from a host processor 220. The input data includes a start point and an end point. A permutation enumeration circuit 242 performs permutation analysis on the routes and is coupled to the input and a bit array 810. A cache 246 stores a route validity indicator for each of the route permutations and is updated by a simplified router circuit 248 which determines whether each of the transportation route permutations is valid, such as having a shortest distance. An output 243 outputs valid journey routes identified by a route validity indicator to the host processor. A cache 246 for use in a transportation routeing system is divided into regions, with a size of respective regions being set according to data stored and a probability associated with a complexity of the routes represented by the data in each region. The permutations may be used for determining a set of train tickets with a lowest overall cost for a journey.

Description

Title: ACCELERATOR SUBSYSTEM WITH GPU, TRANSPORTATION ROUTE PRICE SYSTEM, CACHE AND METHOD OF ACCELERATION OF A PERMUTATION ANALYSIS THEREFOR

Description

Field of the invention

The field of the invention relates to an accelerator subsystem with a graphics processing unit, a transportation route price computation system, a cache and a method for transport permutation analysis. In particular, the field of the invention relates to an accelerator subsystem, a cache and a method for analysing transport permutations using a graphics processing unit in order to reduce latency and increase throughput.

Background of the invention

It is known that the pricing of transportation tickets is a complex problem that requires the evaluation of huge numbers of potential pricing options for large numbers of potential routes, for example for users of public transport that travel long distances. Furthermore, the time taken to determine the best pricing strategy is often critical as the complex and numerous computations must execute while the passenger is deciding which route to take. A key step of the pricing process involves the computation for journeys that may be taken over a number of journey legs, where these legs are split into different groupings that may be ticketed independently, for example by the same or different transport service providers. Recently ticketing systems have become more complex wherein a passenger is issued tokens, instead of tickets, in order to travel a number of journey legs. Subsequently and periodically the passenger is issued one or more tickets retrospectively that covers all said journey legs at the optimum price.

FIG. 1 illustrates a known system 100 for determining and pricing of transportation routes for transportation users that uses a ticket allocator 110 and a routeing server 140. Here, a user enters a new journey 112 to a terminal, where the new journey 112 is added 114 to details of previous journeys that are held in the user history 120. An optimizer 122 searches for a combination of tickets covering the complete user history 120 stored in memory that results in an optimum ticket 116, e.g. the lowest overall cost, or set of tickets. The validity of each candidate set of tickets is tested by submitting (and receiving a

-2response thereto) 130 each option to the routeing server 140. The routeing server 140 is typically an external cloud based online journey planner engine that is used to plan routes, calculate fares and establish ticket availability. In the United Kingdom, National Rail offers ‘The Online Journey Planner’, which further accesses real-time information database directly from GB rail industry’s official train running information engine (Darwin). This means that all journey plans take account of all delays, schedule changes and last minute cancellations made by the train companies. Furthermore National Rail issues a periodically updated database called ‘Fares and Associated Data Feed Interface Specification (“FADFIS”)’ to facilitate ticketing services. The system 100 aims to be realtime. However, the inventor of the present invention has recognised that there are at least three factors that adversely affect the responsiveness of the current known system 100. In the implementation of the known system 100, the ticket allocator 110 includes an embodied microservices-based process flow management system with access to the above mentioned online services, and may have a local copy FADFIS.

First, in the known route planning system 100, as a user's journeys build up over the selected time period, say a week, the number of route combinations to test becomes very large. Table 1 illustrates, as an example, a journey of fourteen legs having 190,899,322 different groupings most of which are likely to be invalid route options.

Table 1:

Group size	Number of permutations
1	1
2	2
3	5
4	15
5	52
6	203
7	877
8	4140
9	21147
10	115975
11	678570
12	4213597
13	27644437
14	190899322

The number of groupings may be derived from Bell’s number [referenced at:

-3https://en.wikipedia.org/wiki/Partition of a set).

Therefore, determining the validity of each transport route option and validating each with the routeing server 140, is time consuming. A known technique evaluates all permutations of items in a group by constructing a graph in which each node holds pointers to connected nodes. When the graph is complete, the graph is traversed by recursive descent and back-tracking in order to find each leaf node (of a tree) that represents a unique permutation of items from the group. As highlighted in Table 1, the size of the graph becomes very large as the size of the group increases, and the amount of memory required to hold the data for each option becomes excessive, as does the time taken to construct and traverse the graph. This is typically dominated by the time taken to access memory. Thus, as the number of permutations becomes very large, it becomes difficult and expensive to compute all of the options, store and process the data and select an optimum transportation option and ticket for the user/passenger.

Secondly, there may be a large number of users, which further complicates the computation process. Thirdly, access to the routeing server 140 is also typically problematic, as it is a shared 3^rd party resource. Thus, the system performance is typically limited by the routeing server 140 and normal CPU-based routeing servers 140 take too long to compute results for complex journeys.

To overcome these problems in a multi journey leg token based ticketing system, the ticket issuer typically offers:

(i) non-optimised pricing or offers approximate pricing during real-time client interaction; and (ii) performs accurate optimum pricing via non-real time batch processing.

However, performing accurate optimum ticket price processing is extremely computationally demanding as described above.

Thus, a need exists for an improved route price computation system, a cache management system and a method therefor that will reduce the computational complexity for route determination (including pricing options). A need also exists for a substantial increase in throughput and a reduction in operational costs by minimising the need for making calls to a 3rd party routeing server, such as routeing server 140.

-4Summary of the invention

The present invention provides a route price computation system, a cache management system and a number of methods therefor, which may be used for analysing transportation permutations, for example, and which may factor in ticket pricing in the route planning, as described in the accompanying claims. Specific embodiments of the invention are set forth in the dependent claims. These and other aspects of the invention will be apparent from and elucidated with reference to the embodiments described hereinafter.

Brief description of the drawings

Further details, aspects and embodiments of the invention will be described, by way of example only, with reference to the drawings. In the drawings, like reference numbers are used to identify like or functionally similar elements. Elements in the figures are illustrated for simplicity and clarity and have not necessarily been drawn to scale.

FIG. 1 illustrates a known system for determining and pricing of transportation routes

FIG. 2 illustrates an example route price computation system, for example a system that uses an accelerator subsystem with based on a 3-dimensional graphics processing unit (GPU) in order to accelerate ticket pricing, adapted in accordance with examples of the invention.

FIG. 3 illustrates a simplified transportation route diagram that shows distance travelled according to a connectivity between stations, rather than as a direct Euclidean distance, in accordance with some examples of the invention.

FIG. 4 illustrates a further simplified transportation route diagram that shows a connectivity of possible stations, where identifiers (IDs) are assigned to stations in incrementing order based on connectivity, in accordance with examples of the invention.

FIG. 5 illustrates a simplified table of a first example of 3-dimensional GPU caching, in accordance with examples of the invention.

FIG. 6 illustrates a simplified table of a second modified example of 2-dimensional GPU

-5caching, in accordance with examples of the invention.

FIG. 7 illustrates a simplified example of a behaviour of state-machine permutations with only three items, for example for use in the route price computation system of FIG. 2, in accordance with examples of the invention.

FIG. 8 illustrates a simplified example of a behaviour of a state-machine that is used to generate all permutations of a given number of items, for example for use in the route price computation system of FIG. 2, in accordance with examples of the invention.

FIG. 9 illustrates a simplified example of a flowchart of a permutation computation method, in accordance with some examples of the invention.

FIG. 10 illustrates a simplified example of a cache that is divided by powers of two, for use in the route price computation system of FIG. 2, in accordance with examples of the invention.

Detailed description

In accordance with some example embodiments of the present invention, there is provided a route price computation system, a cache management system and a number of methods therefor, which may be used for analysing transportation permutations, for example, and which may factor in ticket pricing in route planning. In some examples, the route price computation system, cache management system and methods therefor describe a mechanism by which all possible permutations are formed in an arrangement of items such that each item is represented precisely once, but where the items are gathered in sets of various sizes. A substantial increase in throughput and a reduction in operational costs may be achieved by minimising the need for making calls to a 3rd party routeing server by calculating a portion of the inquiries in a local simplified router, as provided for herewith.

Furthermore, the inventor has recognised that the amount of storage required to evaluate all permutations can be reduced by holding the permutation data in registers that are local to a processing unit or state-machine, as described with reference to FIG. 8. In this manner, all possible permutations may be determined in a manner that is efficient, in both resources and time.

-6Examples of the invention are ideally suited for a token-based multi journey leg post travel ticketing system wherein accurate optimum pricing can be computed in real time (or by batch processing) cost effectively by efficiently evaluating all permutations and maintaining a local cache. However, it is envisaged that other applications or systems may benefit from the concepts described herein.

By improving the efficiency of the cache the number of requests to the 3rd party routeing server may be reduced and therefore overall system performance may be increased substantially. Furthermore, by improving the calculation of permutations in a manner that is suitable for deployment on a GPU, not only may the performance be increased, but the cost of achieving a specified performance may be reduced. In some examples, it is envisaged that the system performance herein described may be sufficient to support real-time pricing enquiries for a new leg of a multi leg journey history.

In some examples of the invention, the permutations of a group of items may be defined such that each group of items is represented precisely once, but where the items are gathered into sets in different combinations of items. For example, the group of items A, B, C may be configured to have the permutations as illustrated in Table 2, noting that not all permutations are valid routes. In the example below, let’s assume that A,B,C are the legs of a journey from Guildford Station to Waterloo, London, wherein A represents the leg from Guilford to Surbiton and B Surbiton to Clapham Junction and C from Clapham to Waterloo. Clearly the single ticketing option AC,B does not make sense. Single ticket (or travel tokens) can only be issued for a continuous journey i.e., a single ticket cannot be issued to cover a journey leg from Guilford to Surbiton and Clapham to Waterloo. Therefore in the context of travel ticketing, the Permutation 3 below is not a valid route and this permutation can be eliminated from further pricing considerations.

Table 2:

	Set
Permutation 1	ABC
Permutation 2	AB	C
Permutation 3	AC	B
Permutation 4	A	BC
Permutation 5	A	B	C

-7In Table 2, permutation 1 combines all three items into one set, whereas permutation 2 is made up of two sets by combining A and B but keeping C separate. The inventor has identified that a determination of permutations in this manner is useful in route planning, in which all possible combinations are considered for validity and efficiency.

Referring now to FIG. 2, an example route price computation system 200 is illustrated, for example a route price computation system 200 that uses an accelerator subsystem with GPU 240, which is adapted to implement accelerated ticket pricing, in accordance with examples of the invention. In examples of the invention, the incorporation of an accelerator subsystem with GPU 240 in a ticket pricing server may improve performance of the route price computation system 200 by an order of magnitude or more, if the construction of the route price computation system 200 and method employed fits the hardware constraints of the accelerator subsystem with GPU 240. For clarity an accelerator subsystem as herein described comprises hardware that accommodates a GPU.

In this regard, a GPU breaks a task into thousands of small work items, each of which is executed in a separate thread. A number of threads, typically 32, are grouped into a ‘warp’, and all threads in a warp execute in a lock-step’ manner, for example on single instruction multiple data (SIMD) hardware processors. When a thread in a warp stalls due to an unresolved dependency, such as a memory read, the warp is suspended, and another thread executed. This switch is instantaneous, because all warp state is held ‘onchip’, and with this mechanism the accelerator subsystem with GPU 240 can hide memory latency. For example, when a thread stalls waiting for data it is suspended and another run in its place. Thus, in this manner, the hardware does not sit ‘idle’ waiting for data. For example, the high performance Intel Xeon CPU supports 28 threads with 77G bytes/sec memory bandwidth; the Nvidia V100 GPU has 163840 threads and 900G bytes/sec memory bandwidth, yet despite the high bandwidth a GPU typically has more compute performance than bandwidth. On paper, a typical GPU is able to boost performance by a factor of ten. However, in accordance with some examples of the invention, the accelerator subsystem with GPU 240 may be able to exceed this performance level if the route price computation system 200 method employed is structured to exploit special features of the accelerator subsystem with GPU 240, as described herein.

The example route pricing computation system 200 uses an accelerator subsystem with

GPU 240 in order to accelerate ticket pricing computations performed by ticket pricing

-8circuit 232. A host central processing unit (CPU) 220 receives data relating to a new journey enquiry 210 from a passenger. A resident microservices process flow manager 211 then assigns appropriate queue-managed tasks to that enquiry, one of the tasks being the new journey pricing task circuit 222. The new journey pricing task circuit 222 (which may be implemented in hardware, software or firmware) then checks the validity of the enquiry, say whether journey leg(s) are compliant with routeing guidance of an operator of the railway network via an information service provider 212 (say Darwin), and/or locally kept database (say FADFIS). The resident microservices process flow manager 211 then combines the received data with previous journeys made by the same customer/ passenger, obtained from a customer context database 213. The new journey pricing task 222 residing on the CPU 220 then submits the combined data to the accelerator subsystem with GPU 240 via a GPU driver 224. In some examples, the GPU enumerates (preferably) all combinations of journey legs in permutation enumeration circuit 242.

In the illustrated example, an input 241 of the accelerator subsystem with GPU 240 is arranged to receive input data from a host processor 220 wherein the input data relates to a journey (or each leg of a journey) that includes at least a start point/ station and an end point/ station. The accelerator subsystem with GPU 240 also represents the overview of software flow diagram of the software, which resides in the GPU. The permutation enumeration circuit 242 is coupled to the input 241. The permutation enumeration circuit 242 is configured to perform a permutation analysis on a plurality of transportation routes.

After the permutation enumeration task described in conjunction with 242, the GPU software program is progressed to filter out legs that are obviously invalid, in filter 244, for example, where no single ticket can be issued for non-adjacent legs. One example of this would be to filter out some legs that need to be in a time-ordered sequence, and check those legs that remain against data held in a cache 246. The process for storing the filtered data in the cache 246 is further articulated herein, for example in the section under the heading: ‘Variable line size cache’. If the cache 246 returns a ‘hit’, the combination of legs is returned to the CPU 220 for pricing, for example in a form of a ‘hit-queue’ 226. A hit indicates that this particular journey leg has been priced before. In some examples, a cache ‘miss’ is passed to a simplified router 248 that attempts to determine a validity of simple routes, and updates the cache 246 accordingly.

In some examples, the concept of a ‘simple route’ encompasses a specified route that is either the shortest or allowed route that can be priced using a local pricing database (say

-9FADFIS), without a need for a taxing routeing server enquiry. In some examples, the confirmation of the shortest routes may be determined by comparing the distance of the route to the locally stored distance between the starting and ending stations. In some examples, the allowed route could be determined by searching the FADFIS. However, if the simplified router 248 is unable to evaluate the journey 210, the simplified router 248 returns the route as a ‘miss’ to the CPU 220 via an output 243, for example in a form of a ‘miss-queue’ 228. The CPU 220 may optionally check the journey 210 against a larger and slower file cache 230, and, (in some example embodiments, only) if that misses, pass the route to the routeing server 234 directly, or via a pricing module or program (not shown). In some examples, the routeing server 234 may be a third party cloud based server (say Online Journey Planner of GB National Rail) or an in-house server which performs a similar task.

The information returned from the routeing server 234 is used to update caches, such as host cache 230 and generate a ticket price in pricing circuit 232. Furthermore, in some examples, if a route that was previously evaluated is extended by another leg, the host cache 230 may be pre-loaded with a state recorded at the time of the previous evaluation, and before the new route is processed in order to improve its ‘hit’-rate.

Some example embodiments of the invention may also be employed to calculate best pricing in a real-time manner, whilst the user/passenger is deciding on the route options, or alternatively as a batch process to calculate the minimum price at the end of the completed journey with multiple legs, with a guarantee that he/she will be offered the best pricing. In this manner, examples of the invention relate to determining a set of potentially valid routes, which may then be priced locally without making a taxing referral to a 3rd part routeing server, such as routeing server 234, where a (minimised) journey ‘price’ 260 is the final output of the system.

It is envisaged that, in other examples, additional data and/or travel options may be input into, or available within, the example route price computation system 200. For example, it is envisaged that a full implementation of the route price computation system 200 may include other factors or data, for example various ticketing schemes, such as season ticketing, that are integrated into a route decision making process. For example, it is envisaged that any journey legs that benefit from existing season ticketing data of the passenger may be automatically priced at zero.

- 10In operation, a CPU 220 thread may queue each journey as a single ‘work packet’ and the accelerator subsystem with GPU 240 is configured to read from the queue and dispatch the work to some or all arithmetic logic units (ALUs) (not shown) of the parallel processing multi cores of the GPU within the accelerator subsystem with GPU 240. In some examples, in a first operation, preferably all permutations of the journey legs are evaluated. In some examples, this may be performed in a parallel-processing manner as disclosed later. In some examples, in a second step, preferably each permutation is tested against a cache holding the results of previous queries; where the cache is preferably structured to maximise its efficiency as described herein with respect to FIG. 10.

Simplified router

In some examples, in the next step, a simple test may be used to check a validity of a set of legs using a routeing rule that states that the shortest route between two stations is valid, for example employed in simplified router 248. Here, it is envisaged in one example that the simplified router 248 may combine distances of (preferably) all legs in a permutation, and then compare this with the (preferably) shortest possible distance between a start point and an end point of the journey. If the values match, the simplified router 248 may determine that a particular combination of legs is a valid journey, which may be priced locally without a need to make a call/referral to the routeing server 234. However, if the values do not match, the simplified router 248 may determine that the particular combination of legs is potentially invalid and that the route may require further analysis. The time taken to determine the distance of each leg is generally a dominant component of the system's performance. Hence, in examples of the invention, it is beneficial to use the texture mapping hardware present in a GPU, e.g. within accelerator subsystem with GPU 240, which is notably not present in a CPU 220. A GPU has many parallel processing cores, each with its own local cache memory. In examples of the invention, therefore, the GPU allows a plurality of valid legs to be processed to determine the shortest route, and then updates a route validity indicator (for example in a form of a validity bit) of each of said route permutations in the local cache 246.

In examples of the invention, texture mapping in a GPU may use specialised hardware to index into a two-dimensional table held in memory. It typically offloads all address calculations from the processor and has dedicated datapaths and caches. In examples of the invention the distance between stations may be held in a 2D table, so that the texture mapping can accelerate this step. For example, the aforementioned FADFIS holds the

- 11 physical distance between adjacent stations. Using this, it is possible to build a big table of shortest distances between any two stations. When legs are combined it is then further possible to test the combined distance against the shortest distance from source to destination stations, and if they are the same the journey is valid.

Referring now to FIG. 3, a simplified transportation route diagram 300 is illustrated that shows distance travelled according to a connectivity between stations, rather than as a direct Euclidean distance, in accordance with examples of the invention. In this regard, the cache 246 in FIG. 2 may be configured to store a table that holds, say, a minimum distance from (preferably) any station to (preferably) any other station within a set of stations that the system may support. In this example, it is noteworthy that the distance recorded is not the direct Euclidean distance, but the distance taking into account the connectivity between (preferably all) intermediary stations. Thus, for example, the distance recorded for the path from node ‘A’ 302 to node ‘D’ 308 goes through node ‘B’ 304 and node ‘C’ 306, with the distances A-B 320, B-C 322 and C-D 326 resulting in a total distance of 4.2. This extra connectivity is actually less than the apparently more direct path from node ‘A’ 302 to node ‘B’ 304 to node ‘D’ 308, which traverses the distances A-B 320 and B-D 324, resulting in a total distance of 4.5.

In some examples, a two dimensional (2D) table may be built and stored in local cache memory of each of parallel processing the core of multi core GPU, in order to record the point-to-point distances, for example because GPU texture mapping hardware takes a two-dimensional address as an index into a map and returns the data present at that address. The natural format of the 2D table defines rows to represent start points and columns to represent end points, or vice versa. Hence, the distance between node ‘A’ 302 and node ‘D’ 308 would lie at a location T[Ai,DJ where T is the table and the subscript / represents the unique integer index assigned to that station. Using the stations described above as an example, the possible route:

distance = T[Ai,BJ + T[Bi,DJ [1 ] would fail because it is not the shortest, which is found by reading T[Ai,DJ. Hence it would require further checking against additional routeing rules, for example in simplified router 248, whereas the journey:

distance = T[Ai,BJ + T[Bi,CJ + T[Ci,Di] [2]

- 12would pass, because the distance matches T[Ai,Di],

Inspection of the table access patterns shows that an entry from a unique row is preferably read for each station visited, resulting in, for example, one memory read per station. As described above, an accelerator subsystem with GPU 240 has more compute performance than memory bandwidth. Therefore, the table in FIG. 5 is preferably restructured (in a novel manner in 2 dimensions) to make it match to the local cache memory that is present of each of the GPU’s parallel processing multi cores.

Referring next to FIG. 4, let us consider a further simplified transportation route diagram 400 that shows connectivity between eight possible stations, e.g. railway stations, where identifiers (IDs) are assigned to stations and sorted in an incrementing order based on connectivity, in accordance with some examples of the invention. This example results in stations that are closely connected having similar ID values, for example stations ‘2’, ‘3’, ‘4’ and ‘5’ being consecutive stations. FIG. 4 illustrates an optimisation that exploits the typical connectivity of stations in which each station connects only to its neighbours to the extent that it is likely that data from a consecutive run will be needed. This simplified example is in contrast to an example whereby, say, station ‘6’ is connected to station O’, but not directly to station ‘5’ or station 7. In this, contrasting example, the transportation options would work, but the optimal route determination would require many more table lookups. Furthermore, in some examples, where paths split, the first path taken may be chosen randomly.

The first row 502 of the table 500 in FIG. 5 holds the distance between consecutive stations, the second row 504 holds distances between stations separated by a hop of one other station, the next row 506 holds stations separated by a hop of two stations, and so on. This organisation advantageously groups data in a manner that is efficient for caching, when stations on a route are more likely to be closer together than far apart. Note that the final row 520 represents the distance from a station to itself, hence is always zero and may be omitted. The coordinates used to index the table now become T[A,, Bi-AJ.

In some examples, the GPU caching may be two-dimensional, that is one line of the cache may hold data from a rectangular patch (e.g. patch 514) of the texture map, which may reduce the efficiency of this process. To compensate, in some example embodiments, the table layout may be modified, as shown in FIG. 6.

- 13Thus, and referring to FIG. 6, a simplified modified table 600 of a second example of 2dimensional GPU caching, is illustrated in accordance with examples of the invention. The simplified modified table 600 spreads data that would otherwise be in one row across multiple rows. The example in FIG. 6 assumes that the cache operates in 2x2 patches, such as patch 620 or patch 622 or patch 604. However, it is envisaged that those skilled in the art will recognise that the same technique may be applied to patches of other dimensions, e.g. 3x3 patches.

The coordinates used to access the table in some example embodiments now become:

X = Ai

Y = B,-A, tmp = X.bit[1 ]

X.bit[1] = Ybit[1]

Ybit[1] = tmp distance = T[X, Y]

Where the X.bit[1] represents bit one of the integer value X. The general method for calculating coordinates for any sized patch is:

L = Iog2(patch dimension)

X = Ai

Y = B,-A, tmp = x.bit[L..L*2]

X. bit[L..L*2] = Y.bit[size..size*2]

Y. bit[L..L*2] = tmp

Where X.bit[L..L*2] represents a set of bits taken from integer X starting at position L and ending at position L*2. The two dimensional organisation of this table makes it suitable for lossless compression available in GPUs that works on 2D patches, and may further improve performance of the accelerator subsystem by a factor of two. GPU hardware typically includes proprietary lossless compression hardware e.g. Arm refers to their compression hardware as arm frame buffer compression (AFBC). In some examples of the invention, if data can be organised such that it has some 2D locality (i.e. similarity) then it may be more amenable to compression. When the simplified router 248 has

-14completed its processing it returns the status of the permutation to the host CPU 220 in FIG. 2 via an output 243.

Referring now to FIG. 7, a simplified example of permutation generation architecture 700, with only three items represented by the letters ‘A’, ‘B’, and ‘C’ and for example for use in the accelerator subsystem with GPU 240 of FIG. 2, is illustrated in accordance with examples of the invention. In examples, of the invention, the generation of the various route permutations is determined in an efficient manner. Here, for example, all permutations of items in a group of three are evaluated. Although this example shows items in a group of three, those skilled in the art will recognize that the same principles may be applied to larger or smaller groups. This example is illustrated in a form of a tree and branch structure, in which each subsequent level, from level ‘0’ 706 through level ‘1’ 704, level ‘2’ 702 and so on, introduces another item to permute by, adding it to all existing sets and also to the null set (i.e. it is given its own set).

The simplified example of permutation generation architecture 700 illustrates the permutations of said three items. Letters that are adjacent are allocated to the same set, whilst a comma represents a start of a new set, for example (AB,C) 716 describes a permutation with two sets, one holding ‘A’ and ‘B’, the other holding ‘C’. At level ‘0’ 706 there is only item ‘A’ 718 in a single set. At level ‘1’ 704 the item ‘B’ is introduced and added to the set holding ‘A’, = (AB) 714 for one permutation and to the null set = (A,B) for another. At level ‘2’ 702 the process is repeated with ‘C’ being introduced and added to the set holding ‘AB’, = (ABC) 712 and ‘AB,C’ 716 to signify a new set.

In the simplified example of permutation generation architecture 700 of FIG. 7, the computation starts at a first point, e.g. point 1 712, which corresponds to the first permutation (ABC) 712, which will be deemed to be the lowest level. Here, all letters are present in one set and the bit array 713 is output wherein all positions corresponding to all letters in row 1 of the bit array 713 are set to ‘1’ and the remainder positions to O’. In this example, each column of the permutation bit array represents an item in the group (A,B,C), where each row corresponds to a set of items. In this example, the first permutation (ABC) is a single set and therefore can be expressed in the first row, whereas (AB,C) has two sets and can be expressed in the first and the second row. Similarly, the permutation (A,B,C) has three sets, which can be expressed in the first, the second and the third rows. The algorithm then goes 'up' to the higher level, i.e. the point ‘2’ 714. In this example, going ‘up’ removes the last element/item in the bit array, in this case ‘C’, by

- 15clearing it in the bit array that results in the new bit array 715.

The next step then goes 'down' to a lower level, i.e. the point ‘3’ 716 where it appends C to the next set, which is found by searching for the set that has that bit ‘True’ and moving to the next. In this example; because item ‘C’ has already been appended to (AB) at first permutation (ABC) point‘T 712 there are no sets left to append it to. Therefore, it is added to the null set to form (AB, C) at point ‘3’ 716 with corresponding permutation bit array 717.

From point ‘3’ 716 the algorithm goes back ‘up’ to point ‘2’ and then, because all nodes below ‘2’ are complete, which is indicated by that level's bit (e.g. last column of the row 2 of bit array 717) being the only one ‘True’ in a set, it continues up to point ‘4’ 718. From point ‘4’ 718 it descends again, in this instance to (A,B) on level ‘1’ 704, until all permutations have been generated.

In addition to saving bandwidth and memory footprint, this implementation is advantageously suited to parallel-processing hardware, because the generation of permutations can be distributed by giving each instantiation a different start vector (e.g. the bit array and level value) and a count of the number of steps to take. Advantageously, these are easily pre-calculated for different sized groups. In one example embodiment a multi core GPU may be employed wherein said cores’ function as the parallel-processing hardware, which can be programmed to run multiple permutation generators at the same time by loading each with a different set of bits in each array and a corresponding level value. This is important in a GPU that will run many permutations of the same route at the same time.

The output of the permutations 720 is an array of sets in a form of bit masks, which is convenient for further manipulation and also as a compact key. For example, when evaluating the validity of journeys the bit masks may be used to rapidly identify journeys already taken. In the route price computation system described herein, the elements A,B,C given on the example above correspond to each leg of a journey and permutations 720 correspond to possible leg permutations in a given journey with three legs, in this example. This algorithm forms the basis for the code presented in the Listing 1.

Hence, it can be seen that the array (such as bit array 810 in FIG. 8) represents both a permutation and, in conjunction with the level value, the state needed to progress to the

-16next permutation. Furthermore, and advantageously, the storage required to generate all permutations of, for example, 16 items is many orders of magnitude less than known techniques, e.g., a Bell number of 16 is: 10,480,142 147 or approximately 10G, wherein each value is 2 bytes leading to 20G bytes storage).Those skilled in the art will recognize that the efficiency of the method disclosed herein is suitable for embodiment in, but not limited to, electronic hardware state-machines, microcontrollers, digital signal processors, graphics processing units and central processing units and that it enables a significant improvement in the rate at which permutations can be evaluated, a lowering of cost and a reduction in power-consumption.

Efficient generation of permutations

Referring now to FIG. 8, a simplified example of one approach to generating all permutations of a given number of items, for use in the route price computation system of FIG. 2, is illustrated in accordance with examples of the invention. In this example, the permutation generation is performed by the GPU located within the accelerator subsystem 240 in FIG. 2 within permutation enumeration circuit 242, which enumerates (preferably) all combinations of journey legs, where each leg of a journey has a start station and an end station.

In this example, the permutation generation is performed using a group of n items, which may be represented for example as a two dimensional bit array 810 of η x n boolean flags, taking n² bits, e.g. 16 items requires 256 bits or 32 bytes. In this example, each column of the array represents an item in the group, each row represents a set of items, and the bit array 810 as a whole embodies a permutation. A row decoder 820 decodes items in rows of the bit array 810 and a column decoder 830 decodes items in columns of the bit array 810.

In some examples, the bit array 810 in FIG. 8 is preferably kept in the registers of the GPU. In some examples, the bit array 810 in FIG. 8 may be kept in a local cache memory of each processing core. In some examples, the Table in FIG. 5 and the Table in FIG. 6 may be kept in local cache memory of each processing core.

In some examples of the invention, a state machine 840 is connected to the bit array 810 and configured to generate all permutations of a given number of items. In this example, a

- 17state-machine 840 is coupled to a level register 850. In some examples of the invention, the novelty of the permutation generator lies in the observation that the tree data structure that is normally used can advantageously be compressed from many megabytes to a few bits, by using the 'level' value to track where in the tree that the algorithm is operating. Additionally, an array of bits is used to track the state of the algorithm at that level (i.e. determining what it should do next) whilst also representing the permutation itself. With this approach the tree does not have to be constructed and then traversed, but can be constructed and traversed at the same time. Advantageously, in accordance with examples of the invention, it is this dynamic, in-place generation of permutations that makes it suitable for a graphics processing unit, such as the GPU within the accelerator subsystem with GPU 240 in FIG. 2.

In one example, the behaviour of the state machine 840 may generate all permutations of a given number of items according to the code in Listing 1, which is written in the Python™ programming language for clarity. However, a skilled artisan will appreciate that this code is one of many examples of code type and format that can be written to generate all permutations of a given number of items in accordance with the example embodiments herein described, including, but not limited to, the hardware description language Verilog™.

Listing 1 import numpy as np def permute(size): # size = number of items to permute # allocate 2D bit array to hold all the sets in a permutation # initalise one row to all True, the rest to False sets = np.zeros((size, size), dtype=np.bool) for i in range(size):

sets[0][i] = True # level is the position in the tree of permutations encoded in the table # it also represents the column of the array being manipulated # start with the last bit level = size-1 output_perm utation(sets) while True:

while True:

# move up the tree to node that has nodes below it that are unvisited complete, level = up(sets, level) if not complete:

break

-18if complete and level == 0:

return # complete while True:

# move down the tree to find a leaf complete, level = down(sets, level) if complete:

break output_perm utation(sets) def up(sets, level):

# move up a level in tree complete = False index, valid = find_set(sets, level) if valid and single(sets, index, level):

sets[index][level] = False complete = True return complete, level-1 def down(sets, level):

# move down one level of the tree # work through nodes below this level # use 'sets' array to track which have been visited level += 1 complete = False index, valid = find_set(sets, level) if valid:

# move to next set by clearing this set's bit and setting the next sets[index][level] = False sets[index+1][level] = True else:

# first visit, start at zero sets[0][level] = True if level == len(sets[0])-1:

complete = True # reached leaf return complete, level def find_set(sets, level):

# find set with level bit set for i in range(len(sets[0])):

if sets[i][level]:

return i, True return 0, False def single(sets, index, level):

# test for single bit in a set, where the bit is the level for i in range(len(sets[index])):

if (i != level) and sets[index][i]:

return False return True

-19def clear(sets, index):

# clear all bits in a set for i in range(len(sets[index])): sets[index][i] = False def output_permutation(sets):

# pass permutation to next stage of processing # in this example, just print the sets s = for i in range(len(sets[O])):

t = for j in range(len(sets[O])):

if sets[i][j]:

t += str(j)+'' iff != :

if s != :

s += ' s += t print(s)

Referring now to FIG. 9, a simplified flowchart 900 illustrates a first example of a permutation computation method, in accordance with examples of the invention and executes the behaviour shown in FIG. 7 and Listing 1. In some examples, the simplified flowchart 900 may be implemented within the state machine 840 of FIG. 8.

The process in the simplified flowchart 900 generates each possible permutation of the route, where a permutation is represented by the contents of, say, bit array 810 of FIG. 8: where each row of bit array 810 represents a set, and each column represents the legs of the route included within that set. The simplified flowchart 900 starts at 902. At 904, all bits of row Ό’ are set to be true (e.g. a “1”) and all other rows are set to be false (“0”), thereby initialising the bit array (e.g. bit array 810 of FIG. 8) of the permutation calculation by representing a single set that contains all route legs, say corresponding to position 712 in FIG. 7 and representing the first permutation.

In this example, the simplified flowchart 900 executes a series of “up” and “down” movements that correspond to changes shown in FIG. 7, for example the transition 712 to 714 is an “up” and the transition 714 to 716 is a “down” movement; noting that “up” and “down” refer to movement on FIG. 7 and not to changes in the level value, which moves in the opposite direction, i.e. an “up” movement causing the level value to decrease and a

-20“down” movement causing the level value to increase. Through these steps, all permutations are generated in sequence. The general principle in the simplified flowchart 900 is to always execute a “down” movement if possible, and to execute an “up” if it is not possible. In FIG. 7 the “up” routine is called at 906 and exited at 929. In FIG. 7 the “down” is called at 940 and exited at 959.

After initialising the bit array at 904, e.g. bit array 810 of FIG. 8, which is equivalent to the state in 712 in FIG. 7 as stated above, it is not possible to execute a “down” process at 940, as the system is initialised at the highest level. Therefore, at step 906 “up” is performed by calling the “up” routine (unconditionally corresponding to a move to 714 in FIG. 7, followed by a “down” routine unless it is not possible. In some examples; the option to perform a “down” routine is tested at 914 during the “up” routine, which looks for a bit that corresponds to the level being the only entry ‘true’ in a row. Until this condition is met the flowchart 900 continues to 930 and thereafter executing the “down” routine at 940, which corresponds to a move to 716 in FIG. 7.

At 930, after exiting the “up” routine, a determination is made as to whether the ‘up’ process has completed. If it has not, in 930, a determination is made as to whether the level value is equal to ‘0’ in 932. If, in 932, a determination is made that the level is equal to O’, the flowchart exits at 934. However, if in 932 a determination is made that the level is not equal to O’, the flowchart loops back to 930. If, in 930, a determination is made that the ‘up process has completed, the flowchart transitions to a ‘down’ process at 940.

The “up” routine 906 starts by setting the logical parameter “complete” to “false” at 908. At 910 the index of the member of the group that will be manipulated in this iteration is found by searching for the first row of the array that has the column corresponding to the level that is set ‘true’. If such an entry is not found, this iteration is identified as not being valid. A further test at 912 determines if the bit found in step 910 is the only bit that is set in that row, and is therefore identified as a single. If the current entry is both valid and single at 914, then processing of this level value is complete. In either case the level value is decremented.

To perform a “down” routine at 940, the level value is used to search the bit array and find the row that has this bit set, when it is cleared and the same bit set in the next row. Having completed this operation in step 952 the process is repeated until the level is equal to the maximum level and it is no longer possible to perform a “down” routine. Hence, an “up”

-21 routine is performed at 906. The process of the “down” routine at 940 is followed by the “up” routine at 906 and is continued until the level value reaches zero, at which point the process completes at 934.

At 940, a ‘down’ routine is commenced. At 942, the complete level value is incremented, (i.e. Ievel:=level+1). Thereafter, at 944, a bit index and a validity of a selected route are located. At 946, a test is made as to whether the selected route is a valid route. If the selected route is an invalid route at 946, the 0 index at the current permutation level is set ‘false’ as illustrated at 948. The ‘down’ routine of the example flowchart then loops to 954.

However, if the selected route is a valid route at 946, the bit index at the current permutation level is cleared to ‘false’ 950. At 952, the bit index is incremented (i.e. bit index:= bit index+1) and the new index value at the current level in the permutation is set to a ‘true state. For clarification the level is not set true or false (it is a numerical value), it's the bit at index+1 of the set of bits referenced by the level number that is set false. At 954, a determination is made as to whether the level is at a maximum. If the level is at a maximum at 954, the complete level is then set to a ‘true state at 956 and if the level is not at a maximum at 954, the complete level is then set to a ‘false state at 958. The flowchart then loops to 959.

In a similar manner, a determination is then made at 960, after returning from the ‘down’ routine of 940, as to whether the ‘down’ process has completed. If the ‘down’ process has not completed at 960, the flowchart loops back to 940 and the ‘down’ process is re-run. If the ‘down’ process has completed in 960, the flowchart moves to 962 where an output of the permutation array is performed. In some examples, the output of the permutation may be a constructed database (for example as stored in the cache 246 of FIG. 2), which is much smaller than databases output in known systems. Here, the output constructed database includes data stored on a per journey basis, indicating which legs do not require a pricing enquiry, e.g. from a cloud-based pricing system. Thereafter, the flowchart loops to 906 and an ‘up’ process is re-commenced.

Variable line size cache

In some examples of the invention, the cache 246 in FIG. 2 has been adapted to provide an improved performance of the accelerator subsystem with GPU 240. In this example, each leg of a journey may be identified by a unique code and stored in the cache 246

-22such that each unique code that may be concatenated in order to produce a ‘key’ with a variable line size that represents the current route of multiple journey legs being analysed. The key is hashed to produce an address in the cache where the route may be stored. A consequence of hashing is that more than one route may map to a given address. In this manner, the addressed location may hold a copy of the route key, as well as any related data, such as validity or cost. In some examples, the route key may be used to check that data for the route being analysed (and no other route) is accessed.

In some examples, the size of the route key depends on the number of legs in the route. However, longer routes are less likely than short ones. Hence, if the cache line is able to accommodate the key of the longest route, a significant amount of cache is wasted for short routes. Thus, some examples of the invention propose an improved cache that aims to further avoid or mitigate this potential waste.

In some examples of the invention, the cache 246 is divided into a number of regions, with a proportion of the cache 246 allocated to each key size being proportional to the probability of routes of that complexity, i.e. dependent upon the number of legs.

In some example embodiments, each leg of a route may be represented by, say, three bytes, which in example embodiments is a sufficient size to hold a key that represents the two legs of the journey being joined. In this example case, three bytes of the code may include, for example, each of: a start station, a common station, an end station. However, it is envisaged that in other embodiments the number of bytes or the purpose of the data may differ.

In some examples, if a second leg is concatenated with the first leg, another three bytes may be added to the key, and so on, for each leg added. Table 3 shows the size of the route key needed for up to (in this illustration) fourteen legs:

Table 3:

Legs	2	3	4	5	6	7	8	9	10	11	12	13	14
Bytes	6	9	12	15	18	21	24	27	30	33	36	39	42

In some examples, the cache 246 holds only the route keys, i.e. for the concatenation of

-23routes. Hence, in some examples, the cache 246 does not hold any single-leg routes. The proportion of possible routes for each number of legs may be calculated to be:

Table 4:

Legs	Proportion of total	Size of key	Weighted proportion
2	0.4763	6	0.3355
3	0.3068	9	0.3242
4	0.1442	12	0.2032
5	0.0526	15	0.0926
6	0.0154	18	0.0326
7	0.0037	21	0.0092
8	0.0008	24	0.0021
9	0.0001	27	0.0004
10	1.87E-005	30	6.57E-005
11	2.26E-006	33	8.76E-006
12	2.26E-007	36	9.56E-007
13	1.74E-008	39	7.96E-008
14	1.24E-009	42	6.13E-009

Over 47% of possible routes have only two legs, but the allocation of space in the cache should take account of the size each entry requires. Hence, in accordance with examples of the invention, the weighted proportion column takes into account the key size. Those skilled in the art will recognise that different tables may be produced according to the maximum number of legs that may be encountered, and that minor deviations of the implemented size from the precisely calculated size does not materially affect the benefits of the concepts described herein.

Referring now to FIG. 10, a simplified example of a cache 1000 that is divided by powers of two, for use in the route price computation system of FIG. 2, is illustrated in accordance with examples of the invention. The inventor of the present invention recognised and appreciated that dividing the cache 1000 into many small sections may add complexity with little benefit. Therefore, in some examples, a simpler approach is to divide the cache 1000 according to ‘powers of 2’ 1060 as depicted in FIG. 10. Here, a first percentage 1010

-24of the cache 1000, say 25% in this example, is allocated to a 6-byte identifier (ID) 1012. A second percentage 1020 of the cache 1000, say 25% in this example, is allocated to a 9byte ID 1022. A third percentage 1030 of the cache 1000, say 25% in this example, is allocated to a 12-byte ID 1032. A fourth percentage of the cache 1040, say 25% in this example, is allocated to a 42-byte ID 1042. In this manner, a better usage of the cache may be achieved, as portions of the cache are allocated according to a likely possibility of the route combinations being valid.

Table 5:

Legs	Proportion of cache
2	0.25
3	0.25
4	0.25
>4	0.25

In keeping with known cache designs, multiple entries may be held at the same address, such that more than one route with same key may be resident in the cache 1000 at the same time. When the route is hashed, the resulting address points to a set of lines that are each tested until a matching key is found, or the route is determined to have missed the cache 1000. Unlike the known approach, the concept herein described with respect to FIG. 10 allows a variable number of lines per set, such that the total size of the set remains constant. In this manner, keeping the set size constant for the whole cache simplifies the design and allows systems with different maximum route distances in order to use the same RAM. Hence, in a further example embodiment the set size of the route cache is held constant, and the number of entries in each set varies according to their size. An additional benefit of keeping the line size constant is that in an alternative example embodiment, a processor such as a GPU, may implement the cache in a programmable manner and exploit its existent caches. Table 6 shows the number of sets used by each region of the cache, in one example, assuming a 128 byte set and 3 bytes per leg:

Table 6:

Legs

Cache lines

Bytes used

2	21	126
3	14	126
4	10	120
14	3	126

In some examples, if additional data is stored alongside the key, it may reduce the number of lines, for example the effect of adding one bit to record if the entry is a ‘valid’ or ‘invalid’ journey is shown in Table 7 and in FIG. 10.

Table 7:

Legs	Cache lines	Bytes used
2	20	120
3	14	126
4	10	120
14	3	126

In some examples, the cache may hold additional data beyond ‘valid’ and ‘invalid’, such as route prices relating to the route accounting for factors, such as the age of the traveller, the time of travel or the use of a discount card; such pricing information will typically also be associated with an expiry date after which the price data should not be used.

In some examples, the hash function for each region may be different, for example because the input key has a different width. Here, the output of the hash is the same for all regions, i.e., a bit-index to a 128 byte aligned address. The indicated address may be loaded and each entry tested in turn for the target key. If the target key is found, the cache has a ‘hit’, and if it is not found it has a ‘miss’. Those skilled in the art will recognise that alternative embodiments may use different set sizes.

Hence, the disclosed cache 1000 improves efficiency in systems that evaluate routes and is suitable for implementation in, but not limited to, any of: electronic hardware statemachines, microcontrollers, digital signal processors, graphics processing units and central processing units.

In the foregoing specification, the invention has been described with reference to specific examples of embodiments of the invention. It will, however, be evident that various

-26modifications and changes may be made therein without departing from the scope of the invention as set forth in the appended claims and that the claims are not limited to the specific examples described above. For example, it is envisaged that the route price computation system 200 configured to analyse transportation permutations, cache management system and methods therefor, which may factor in ticket pricing in the route planning, may be employed over a tremendous range of ticketing applications.

In other example embodiments, the route price computation system may be designed to serve a multitude of clients with a multitude of enquiries which may include other tasks in addition to a pricing enquiry. Those skilled in the art will recognize that the functionality specified in CPU 220 portion of the route price computation system 200 can be implemented in different ways, so long as the interface to the permutation analysis accelerator system is maintained as specified herein. Furthermore, it is envisaged that the hardware acceleration employed by the accelerator subsystem may be performed in firmware, such as using field programmable logic/gate arrays (FPGAs instead of a GPU), which may be programmed to provide the hardware acceleration in a similar manner to the hardware functionality described for the GPU herein. In such events, the GPU driver 224 will be replaced by the special driver for said firmware. In other examples, it is envisaged that the hardware acceleration employed by the accelerator subsystem may be performed in software, using one or more processors coupled to the CPU. A skilled artisan will appreciate that the level of integration of circuits or components may be, in some instances, implementation-dependent.

Furthermore, because the illustrated embodiments of the present invention may for the most part, be implemented using electronic components and circuits known to those skilled in the art, details will not be explained in any greater extent than that considered necessary as illustrated above, for the understanding and appreciation of the underlying concepts of the present invention and in order not to obfuscate or distract from the teachings of the present invention.

The connections as discussed herein may be any type of connection suitable to transfer signals from or to the respective nodes, units or devices, for example via intermediate devices. Accordingly, unless implied or stated otherwise, the connections may for example be direct connections or indirect connections. The connections may be illustrated or described in reference to being a single connection, a plurality of connections, unidirectional connections, or bidirectional connections. However, different embodiments

-27may vary the implementation of the connections. For example, separate unidirectional connections may be used rather than bidirectional connections and vice versa. Also, a plurality of connections may be replaced with a single connection that transfers multiple signals serially or in a time multiplexed manner. Likewise, single connections carrying multiple signals may be separated out into various different connections carrying subsets of these signals. Therefore, many options exist for transferring signals.

Those skilled in the art will recognize that the boundaries between logic blocks are merely illustrative and that alternative embodiments may merge logic blocks or circuit elements or impose an alternate decomposition of functionality upon various logic blocks or circuit elements. Thus, it is to be understood that the architectures depicted herein are merely exemplary, and that in fact many other architectures can be implemented that achieve the same functionality.

Any arrangement of components to achieve the same functionality is effectively ‘associated’, such that the desired functionality is achieved. Hence, any two components herein combined to achieve a particular functionality can be seen as being ‘associated with’ each other, such that the desired functionality is achieved, irrespective of architectures or intermediary components. Likewise, any two components so associated can also be viewed as being ‘operably connected,’ or ‘operably coupled,’ to each other to achieve the desired functionality. Furthermore, those skilled in the art will recognize that boundaries between the above described operations are merely illustrative. The multiple operations may be executed at least partially overlapping in time. Moreover, alternative example embodiments may include multiple instances of a particular operation, and the order of operations may be altered in various other embodiments.

Also for example, in one embodiment, the illustrated examples may be implemented as circuitry located on a single integrated circuit or within a same device. Alternatively, the examples may be implemented as any number of separate integrated circuits or separate devices interconnected with each other in a suitable manner.

Also for example, the examples, or portions thereof, may implemented as soft or code representations of physical circuitry or of logical representations convertible into physical circuitry, such as in a hardware description language of any appropriate type. Also, examples of the invention are not limited to physical devices or units implemented in nonprogrammable hardware but can also be applied in wireless programmable devices or

-28units able to perform the desired device functions by operating in accordance with suitable program code. However, other modifications, variations and alternatives are also possible. The specifications and drawings are, accordingly, to be regarded in an illustrative rather than in a restrictive sense.

In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word ‘comprising’ does not exclude the presence of other elements or steps then those listed in a claim. Furthermore, the terms ‘a’ or ‘an,’ as used herein, are defined as one, or more than one. Also, the use of introductory phrases such as ‘at least one’ and ‘one or more’ in the claims should not be construed to imply that the introduction of another claim element by the indefinite articles ‘a’ or ‘an’ limits any particular claim containing such introduced claim element to inventions containing only one such element, even when the same claim includes the introductory phrases ‘one or more’ or ‘at least one’ and indefinite articles such as ‘a’ or ‘an.’ The same holds true for the use of definite articles. Unless stated otherwise, terms such as ‘first’ and ‘second’ are used to arbitrarily distinguish between the elements such terms describe. Thus, these terms are not necessarily intended to indicate temporal or other prioritization of such elements. The mere fact that certain measures are recited in mutually different claims does not indicate that a combination of these measures cannot be used to advantage.

Claims

1. An accelerator subsystem with graphics processing unit, GPU, (240) for transportation routes, the accelerator subsystem with GPU (240) comprising:

an input (241) arranged to receive input data from a host processor (220) wherein the input data relates to a journey that includes at least a start point and an end point;

a permutation enumeration circuit (242) coupled to the input (241) and comprising or coupled to a bit array (810), wherein the permutation enumeration circuit (242) is configured to perform a permutation analysis on a plurality of transportation routes;

a cache (246) configured to store a route validity indicator, for each of the plurality of transportation route permutations.

a simplified router circuit (248) coupled to the cache (246) and configured to determine whether each of the plurality of transportation route permutations is a valid route and in response thereto update said route validity indicator of said transportation route permutation in the cache (246); and an output (243) of the accelerator subsystem (240) configured to output at least valid journey routes as identified by a respective route validity indicator to the host processor (220).

2. The accelerator subsystem with GPU (240) of Claim 1, wherein the simplified router circuit (248) is further configured to employ texture mapping to determine a distance of a plurality of concatenated legs of a journey.

3. The accelerator subsystem with GPU (240) of any preceding Claim, wherein the bit array (810) is an η x n two-dimensional bit array (810) of columns and rows and the permutation enumeration circuit (242) is configured to evaluate all permutations of a group of n items where each row of said bit array (810) represents one set of items to be treated as a single item.

4. The accelerator subsystem with GPU (240) of Claim 3, wherein each column of said array (810) represents a presence or an absence of an item within said set of items.

5. The accelerator subsystem with GPU (240) of any preceding Claim, wherein each leg of a journey is identified by a unique code and stored in the cache (246) such that each unique code is concatanatable to produce a key with a variable line size that represents a current route of multiple journey legs being analysed.

6. The accelerator subsystem with GPU (240) of Claim 5, wherein the key is hashed to produce an address in the cache (246) where the transportation route is stored.

7. The accelerator subsystem with GPU (240) of any preceding Claim, wherein the permutation enumeration circuit (242) is configured to generate a plurality of permutations simultaneously by provision of a plurality of said bit arrays (810) and a plurality of level values, each of said bit arrays (810) and said level values (850) being initialized to unique values.

8. The accelerator subsystem with GPU (240) of any preceding Claim, wherein the simplified router circuit (248) comprises a table (500, 600) arranged to hold a distance from one station to another station according to a plurality of transportation routes and the simplified router circuit (248) is configured to determine a distance of a plurality of concatenated legs of a journey based on records of different journey hop distances in different array rows of the table.

9. The accelerator subsystem with GPU (240) of any preceding Claim, wherein the simplified router comprises a distance cache (500, 600) organised in rectangular patches that is configured to match an organisation of local cache memory of a parallel processing multi core of a GPU within the accelerator subsystem with GPU (240) for efficient processing.

10. The accelerator subsystem with GPU (240) of any preceding Claim wherein the simplified router circuit (248) is implemented within one from a group of: hardware state machine, microcontroller.

11. The accelerator subsystem with GPU (240) of any preceding Claim, wherein the cache (246) is divided into a number of regions, with a size of respective regions being set according to the data that is stored therein and a probability associated with a complexity of the transportation routes represented by the data in each region.

12. The accelerator subsystem with GPU (240) of Claim 11, wherein the size of

-31 respective regions of the cache (246) comprise a power of two division of a total cache size and each of said regions additionally holds a plurality of transportation route complexities.

13. The accelerator subsystem with GPU (240) of Claim 11 or Claim 12, wherein each of said regions uses a set size with a variable number of lines per set, the number of said lines being dependent on the complexity of the transportation route represented by the data in a respective region.

14. The accelerator subsystem with GPU (240) of any preceding Claim, wherein the output of the accelerator subsystem configured to output both valid and invalid routes to the host processor (220).

15. The accelerator subsystem with GPU (240) of any preceding Claim wherein the simplified router circuit (248) is configured to update the cache (246) and mark a record as valid if the route is a shortest distance from the start point to the end point of a journey.

16. A transportation route price computation system comprising a host processor (220) and an accelerator subsystem with GPU (240) according to any of the preceding Claims.

17. A method (700) for a transportation routeing system using a graphics processing unit, GPU, (240), the method comprising at an accelerator subsystem:

receiving input data from a host processor (220) wherein the input data relates to a journey that includes at least a start point and an end point;

performing a permutation analysis on a plurality of transportation routes;

storing in a cache (246) a route validity indicator for each of the plurality of transportation route permutations;

determining whether each of the plurality of transportation route permutations is a valid route and in response thereto updating said route validity indicator of said transportation route permutation in the cache (246);

outputting at least valid journey routes as identified by a respective route validity indicator to a host processor.

18. The method (700) for a transportation routeing system according to Claim 17, the method further comprising:

dividing a plurality of tasks associated with transportation route permutation analysis

-32 by a permutation enumeration circuit (242) of the GPU;

employing texture mapping hardware for the divided tasks; and determining a length of a plurality of concatenated legs of a journey.

19. A cache (246) for use in a transportation routeing system that uses a graphics processing unit, GPU wherein a storage of the cache (246) is divided into a number of regions, with a size of respective regions being set according to data that is stored and a probability associated with a complexity of transportation routes represented by the data in each regiom