WO2021191168A1

WO2021191168A1 - System and method for predicting road crash risk and severity using machine learning trained on augmented datasets

Info

Publication number: WO2021191168A1
Application number: PCT/EP2021/057324
Authority: WO
Inventors: Abimbola ADEGBULU
Original assignee: Extracover Holdings Limited
Priority date: 2020-03-21
Filing date: 2021-03-22
Publication date: 2021-09-30
Also published as: GB202004141D0

Abstract

The present invention relates to systems and methods for synthesising data based on an input dataset in order to generate an augmented dataset for road risk modelling and prediction using machine learning. Particularly, the invention relates to methods of accident data pre-processing and road accident risk modelling to predict road accident frequency and severity using contextual, environmental and road information (and machine learning and deep learning techniques) using synthesised data to enhance an existing dataset in order to create training data for the modelling. Aspects and/or embodiments seek to provide a method for enriching datasets for use as training data, for example to use in predicting road crash risk and severity using historical positive data for crashes and enriched contextual data for roads. Further aspects and/or embodiments seek to provide methods for training models using this training data and using these trained models for inference of classifications of road crash risk and severity.

Description

SYSTEM AND METHOD FOR PREDICTING ROAD CRASH RISK AND SEVERITY USING MACHINE LEARNING TRAINED ON AUGMENTED DATASETS

Field

The present invention relates to systems and methods for synthesising data based on an input dataset in order to generate an augmented dataset for road risk modelling and prediction using machine learning. Particularly, the invention relates to methods of accident data pre-processing and road accident risk modelling to predict road accident frequency and severity using contextual, environmental and road information (and machine learning and deep learning techniques) using synthesised data to enhance an existing dataset in order to create training data for the modelling.

Background

Road crashes (or road accidents) are associated with 1.2 million deaths worldwide each year and cause around 50 million injuries or disabilities each year worldwide.

Many governments and non-profit agencies collect data, such as driving or riding data, to produce a range of statistics and visual graphs to help understand road crash causes and to help to propose solutions for different roads and driving zones to improve road safety and reduce the risk and severity of crashes.

Improved methods and systems for assessing roads and road safety are generally desired as these may help mobility service providers, local governments, fleet operators and transportation agencies as well as drivers and riders of vehicles to understand the range and severity of risks associated with roads.

Summary of Invention

Aspects and/or embodiments seek to provide a method for enriching datasets for use as training data, for example to use in predicting road crash risk and severity using historical positive data for crashes and contextual data. Further aspects and/or embodiments seek to provide methods for training models using this training data and using these trained models for inference of classifications of road crash risk and severity.

According to an aspect there is provided a computer-implemented method, comprising the steps of: receiving map data, wherein the map data comprises a plurality of map segments; receiving a plurality of positive data; determining a correlation between each of the plurality of positive data and at least one of the plurality of map segments; generating negative data for each of the plurality of map segments by sampling the positive data; combining the positive and generated negative data; outputting the combined data.

By generating negative data from the positive data, a substantially more complete dataset can be generated that substantially more accurately represents characteristics of a map and/or the real-world properties/data pertaining to a map. By synthesising the negative data from the positive data, the properties of the synthetic (negative) data can more substantially accurately reflect those of the real (positive) data and can result in a training dataset that can more accurately train a model to predict both positive and negative classifications. The term map portions can be exchanged for similar terminology by the skilled reader, for example with the term map nodes or geographical regions. The term or portions of the term positive data can be exchanged by the skilled reader with for example historical data and/or crash data and/or positive label data and/or metadata.

Optionally, the method further comprises a step of density smoothing, wherein the step of density smoothing comprises averaging the positive data and/or negative data across one or more of the plurality of map segments. Optionally, the method further comprises a step of spatially smoothing data over at least one or more neighbouring map segments, wherein positive data correlated with a map segment is correlated across a plurality of neighbouring map segments. Optionally, the positive data comprises time data and a plurality of positive label data, and wherein the time data comprises a plurality of time values, further wherein each of the plurality of positive label data is correlated with a time value, and further comprising a step of temporally smoothing data wherein the step of temporally smoothing data comprises re-correlating each of the plurality of positive label data with a plurality of time values.

By performing smoothing on the positive and/or negative data, concentrations of respective positive and/or negative data/data labels can be distributed more smoothly and can provide a substantially more realistic dataset for training and/or resulting trained model.

Optionally, the method further comprises performing pre-processing of any or any combination of: the positive data; and the map data.

By pre-processing the positive data and/or map data, the data can be cleaned and/or formatted in a substantially accurate or predetermined way and can therefore be more accurately and/or predictably used as training data.

Optionally, the method further comprises: receiving a plurality of context data, wherein the context data comprises one or more contextual features; correlating each of the one or more contextual features with each of the map segments and/or positive data; and wherein the step of generating negative data for each of the plurality of map segments by sampling the positive data comprises generating negative data for each of the plurality of map segments by sampling the positive data and context data. Optionally, the one or more contextual features comprises any or any combination of one or more static features or dynamic features. Optionally, one or more static features comprise any or any combination of geohash, historical crash counts, curvature of the road segment, orientation of the road segment, speed limit, number of nearby shops and amenities, proximity to nearby junctions and proximity to road elements. Optionally, one or more dynamic features comprise any or any combination of time of day, day of week, month, bank holidays, solar azimuth, solar altitude, temperature, precipitation, wind speed, wind direction and visibility.

Context data, or contextual features, such as any or any combination of data such as the type of road, the position of the sun in the sky, the road topography, the type of road, the proximity of buildings, the type of buildings nearby, historical data for population density, and many other contextual data relating to the latitude and longitude of interest at the date and time of interest can be used to enhance, augment or enrich the positive and/or negative data.

Optionally, the positive data comprises any or any combination of: location; time; severity; date; data or metadata relating to a vehicle collision; accident data.

By providing positive data, or positive label data, such as any or any combination of: location; time; severity; date; data or metadata relating to a vehicle collision; accident data, data relevant to a location and/or crash data can be processed to generate a training dataset to model such data.

Optionally, sampling the positive data comprises iteratively and/or randomly selecting one or more of the positive data and modifying one or more values of the selected one or more of the positive data; optionally wherein the sampling continues until a threshold amount of negative data is generated. Optionally, the method comprises a method for generating an augmented dataset for training a machine learning prediction model.

By sampling the positive data, negative data can be generated to create a substantially representative dataset for use as training data.

Optionally, the map segments comprises any or any combination of: points on a map; areas of a map; segments of a road; map nodes; nodes of a road.

The map segments can be any or any combination of map-like features or elements and can therefore be used to represent locations and/or portions of maps or map nodes.

Optionally, the method further comprises a user interface and wherein the user interface is used to display in an overlaid fashion any or any combination of: the map data; the positive data; the generated negative data; the context data; the combined data.

By representing the data in a user interface and overlaying the data, a user can visualise, manually process and label the data and extract insights or query/extract data or metadata of interest. Further a user can identify and/or select data for use in training or modelling.

Optionally, the positive data and/or negative data comprises any or any combination of: latitude; longitude; and time. The positive and/or negative data can indicate a position and/or time for use in determining the properties or modelling of a map or pertaining to a map/mapped area.

According to a further aspect, there is provided a computer-implemented method, comprising the steps of: training a classification model using the combined data output by the method according to any aspect/embodiment, wherein the classification model is trained to output a positive and/or negative classification in response to a time and/or position input.

Using the training data of the above aspect and/or aspects or embodiments described herein can allow for a substantially accurate or representative model (that can be iteratively further trained or re-trained as further data is collected) of a mapped area.

Optionally, the classification model is further trained to determine any or any combination of: a relevant map segment for each position input; a position density for each position input; one or more contextual features for each position input; a risk score for each position and/or time input; a severity for each position and/or time input.

By training the model to provide a variety of outputs, the outputs can be used to determine properties and/or risks associated with positions and/or times and the mapped areas.

According to another aspect, there is provided a computer-implemented method, of predicting a risk score and severity of a road crash comprising the steps of: receiving a position input; receiving a time input; determining a map portion associated with the position input; determining a position density of the position input; determining one or more contextual features associated with the determined map portion; performing classification using a trained model of according to an aspect/embodiment using the position input, the position density, the time input and the determined contextual features to determine a the risk score and the severity of a road crash for the position input and time input; and outputting the risk score and the severity. Optionally, the position input comprises at least on latitude value and at least one longitude value. Optionally, the time input comprises a time and date value.

By using the trained model of the above aspect and/or the aspects/embodiments described herein, a score can be determined for portions of the mapped area at certain times and/or dates.

Optionally, determining the map portion associated with the position input comprises determining the nearest map portion to the position input.

Optionally, the trained model comprises any or any combination of: a machine learning model; a deep learning model; a neural network; a modular neural network; a fully connected neural network; a fully connected deep learning network.

Optionally, the method further comprises determining a score per map portion of the map data, optionally wherein the score comprises any or any combination of: a risk score; a severity. Other aspects can be provided including an apparatus and/or system and/or computer readable medium providing some of the aspects recited above.

Brief Description of Drawings

Embodiments will now be described, by way of example only and with reference to the accompanying drawings having like-reference numerals, in which:

Figure 1 shows a system architecture able to be used with the described example embodiment;

Figure 2 shows a learning and prediction system according to the example embodiment;

Figure 3 shows more detail about the pre-processing workflow according to the example embodiment;

Figure 4 shows how positive data collected over a three-day window can be correlated to a mapped area using different grid sizes; and

Figure 5 shows how positive data collected over a three-hour window can be correlated to a mapped area using different grid sizes.

Specific Description

Machine learning techniques, including deep learning have recently made rapid improvements in relation to prediction accuracy in many applications. According to embodiments, machine learning models are trained on road crash datasets to provide driving/riding assistant systems and techniques to help drivers/riders to avoid road crashes by giving them on-line (i.e. during a journey) or off-line (e.g. when planning a journey) warnings.

According to at least some embodiments, a processor circuit may perform processing associated with storing road crashes in a memory, each record comprising an indication of position and an indication of a date and time involved in a road crash. The processor may perform pre-processing of data to create a rich data set and may use machine learning to produce risk scores and crash severity ratings from inputs including position, date and time (or any combination thereof). The processor circuit may perform processing associated with storing each risk score and crash severity rating in memory as a prediction. The aim of the system according to these embodiments is to predict vehicle crash risk score and crash severity for individual road segments (i.e. a region or sub-division of a road or road network) for a given time and date.

Systems and methods according to embodiments use machine learning techniques to predict future road crashes ora probability of future road crashes. In other embodiments, given a series of historical road crash data, a machine learning model is used to predict the risk and severity of a road crash in a specific location at a specific time. Predictions may be based on private data and/or publicly available data (e.g. data collected and made available by governments and non-profit agencies).

According to some embodiments, predicting road crash probability and severity is performed using supervised machine learning combined with the use of a classification model to map a function of input variables to labels indicating a crash or a non-crash. The input variables can be an accident, latitude (in degrees), longitude (in degrees), district (e.g., Kensington and Chelsea), geohash (an encoding of a geographic location into a short string of letters and digits), accident severity (e.g., fatal, serious, slight), year, month, day, day of the week, week of the year, time, road number, road class, road speed limit, junction detail (e.g., roundabout, mini-roundabout, T junction, crossroads, etc.), light conditions (e.g., daylight, lights lit, lights unlit, etc.), solar azimuth (angle along the horizon in degrees), solar altitude (angle of the sun relative to the Earth's horizon in degrees), urban or rural area, special conditions (e.g., auto traffic signal, road signs, etc.), weather summary, temperature, wind speed, visibility, fuel price, bank holiday, one way street, entertainment venues, schools, hospitals, shops, housing, public transport, parking, traffic signals, pedestrian crossing, cycle way, bus stop, give way, forest, vehicle type (e.g., cycle, motorcycle, car, minibus, etc.) and journey type (e.g., commute, social, other, etc). In some embodiments, a machine learning model is trained to estimate crash risk using a set of roads and times where crashes occurred augmented by negative samples (i.e. a sample of roads and times where crashes did not occur). As severe class imbalance distribution is common in vehicle crash data (termed the “severe class imbalance issue”), in some embodiments the training set is kept slightly imbalanced, and an informative sampling approach is used to cause the model to learn the conditions (that can be fine-grained) that are related to road crashes.

Referring to Figure 1, a computer system architecture 100 that can be used with the described embodiment will now be described.

The system architecture 100 comprises one or more processors 110, one or more input devices 120, one or more display devices 130 and one or more network interfaces 140 connected via an interconnection network (or bus) 150. These components 110, 120, 130, 140, 150 are in communication with one or more computer readable media 160 on which is provided an operating system 162, a network communication layer 164, a learning and prediction system 166 according to the example embodiment and one or more applications 168.

In other embodiments, the system architecture 100 may be implemented on any electronic device that runs software applications derived from compiled instructions, including without limitation smartphones, tablet computers, personal computers, servers, media players and game consoles. The system architecture 100 may encompass any architecture that may implement the features and processes described herein in relation to the embodiments described.

In some embodiments, there may be one or more processors 110 and these processors 110 may comprise one or more cores and/or one or more co-processors and/or graphical processing units.

In some embodiments, the one or more input devices 120 can be any known input device technology including, but not limited to, a keyboard (including a virtual keyboard), a mouse, a track ball, a touch-sensitive pad or a touch-sensitive display.

In some embodiments, the one or more display devices 130 can be any known display technology, including but not limited to display devices using liquid crystal display (LCD) technology or light emitting diode (LED) technology.

In some embodiments, the interconnection network 150 can be any known internal or external network or bus technology, including but not limited to ISA, EISA, PCI, PCI-Express, NuBus, USB, Serial-ATA, Firewire, 802.11 a/b/c/ac/etc, wired networking, wireless networking, virtual bus, or virtual network, or any combination thereof.

In some embodiments, the one or more computer readable media 160 may be any medium that participates in providing instructions to the one or more processors 110 for execution, including without limitation non-volatile storage memory such as optical disks, magnetic disks, flash drives or volatile media such as SDRAM or ROM, or any combination thereof.

In some embodiments, the one or more computer readable media 160 may include instructions for implementing an operating system 162 (for example Linux, MacOS®, Windows®). The operating system 162 may be multi-user, multi-processing, multi-tasking, multi-threading, real-time or any combination thereof. The operating system 162 may perform basic tasks, including but not limited to recognising input from the input devices 120, sending output to the display devices 130, managing storage on the one or more computer readable media 160, controlling peripheral devices (e.g. disk drives, printers, etc), which can be controlled directly or via an input/output controller; and managing traffic on the interconnection network/bus 150. Network communications may use instructions to establish and maintain network connections (e.g. software for implementing communication protocols such as TCP/IP, HTTP, Ethernet, InfiniBand®, etc).

In the example embodiment, the learning and prediction system 166 includes instructions to perform machine learning and prediction as described in more detail below to analyse road crashes and make future crash predictions. In some embodiments, the learning and prediction system 166 may use one or more machine learning models. In the example embodiment, machine learning models are trained using past data in order to predict future probabilities but in other embodiments predictions can be made for future events and current/past events and/or probabilities.

In some embodiments, the one or more applications 168 are software that use or implement the output or capabilities of the learning and prediction system 166 alone or in combination with other capabilities/functionality. In other embodiments the functions of the one or more applications 168 can be implemented in the operating system 162.

Referring to Figure 2, the learning and prediction system 200 of the example embodiment will now be described.

The system 200 is shown as a block diagram of the learning and prediction system architecture that may implement the features and processes described herein and shows the use of a nested modular approach having two steps in the example embodiment. In other embodiments, different approaches can be used and more or fewer steps may be employed.

The latitude and longitude of interest 210 is provided as one input to the system 200, along with the date and time of interest 220 and other features 230. In the example embodiment, the latitude and longitude of interest 210 is provided as two co-ordinates but in other embodiments a range of co-ordinates can be provided, or other indications of location can be provided such as map data or map segments. In the example embodiment, a date and time 220 are provided for the determining of a risk assessment for that date and time in combination with at least the latitude and longitude of interest 210. In the example embodiment, other features 230 are provided as an input but in other embodiments this other feature data may not be provided.

When training the system 200, a set of latitudes and longitudes for positive crash data are fed to the density estimate model 240 to produce a geographical probability density value for each crash. The latitude and longitude values are provided to a density estimation model 240 when the system 200 is being used for prediction in order to output a geographical probability density value for the position indicated by the latitude and longitude values.

The latitude and longitude of interest 210, the date and time of interest 220 and other features 230 are all provided to the pre-processing module 250. The pre-processing module 250 then determines an enriched dataset for the given position indicated by the latitude and longitude of interest 210 at the date and time of interest 220 using the other features 230. An enriched dataset is output by the pre-processing module 250.

The geographical density value and the enriched dataset are both input into the classification model 260 by the density estimate model 240 and the pre-processing module 250. In other embodiments the enriched dataset can be a set of contextual features, including any or any combination of data such as the type of road, the position of the sun in the sky, the road topography, the type of road, the proximity of buildings, the type of buildings nearby, historical data for population density, and many other contextual data relating to the latitude and longitude of interest 210 at the date and time of interest 220.

The classification model 260, which has been trained on historical accident/road crash data, then classifies the input geographical density value and the enriched dataset for the given position indicated by the latitude and longitude of interest 210 at the date and time of interest 220 and outputs a risk score 270 with a value between zero and one (zero indicating no risk and one indicating a 100% probability of a risk of having a traffic collision) and also a crash severity value (which in the example embodiment has the value of zero to indicate no injury, a value of one to indicate a minor injury and a value of two to indicate a major injury and a value of three to indicate a fatality). Other values for the risk score and crash severity score can be used in other embodiments. Because the classification model 260 of the example embodiment is trained on historical accident data, the values for the risk score and crash severity are based on this historical data.

In an example embodiment, the density estimation can be modelled in the following way:

1. Compute density estimation

2. Count number of accidents occured in areas or map segments above a density threshold (hotspot areas): hit rate = % of accidents captured by hotspot areas

3. Since a lower density threshold (larger hotspot areas) leads to a higher “hit rate”, consider the metric inversely proportional to the hotspot area. For example, use the Predictive Accuracy Index (see “The Utility of Hotspot Mapping for Predicting Spatial Patterns of Crime” by Chainey et al., 2008 which is herein incorporated by reference):

n = number of captured accidents N = number of total accidents a_h= hotspot area A = total area

4. Compute PAI as a function of density threshold to get a PAI curve (see “A spatio- temporal kernel density estimation framework for predictive crime hotspot mapping and evaluation" by Hu et al., 2018, which is herein incorporated by reference)

The classification model 260 in the example embodiment is a machine learning model having a fully connected neural network structure but in other embodiments can be any machine learning model capable of performing classification, having a fully, partially connected or a hybrid structure. Referring to Figure 3, the workflow 300 of the pre-processing module 250 of the example embodiment will now be described.

Initially, the road for which predictions are to be made is segmented 310 into a plurality of segments 315 so that for a given latitude and longitude it is possible to select the closest of the road segments 315 stored in the road data. If already segmented, the segments and nodes are retrieved for all roads in a region of interest, for example using a map API such as OpenStreetMap or Overpass. This map data 315 can be stored in a database or a map server can be built to store, update and retrieve this data. As an example, the map can be segmented into smaller chunks approximately 45 meters in length.

The road data 315 is then cleaned and the static features are expanded (i.e. the map data 315 is enriched with further relevant data (similar to the input variables listed above) that is associated with the relevant portions of the map data 315). This can include excluding large outliers, observations with missing or incomplete data. Specifically, the static features are the parts of the input data that do not change with time so includes features derived from the road geometry, such as geohash, road segment curvature, road orientation, proximity to nearby junctions, number of shops and amenities around each and nearby nodes. The output of this process is the expanded road data 325.

In parallel, historical crash data 330 is cleaned 335 and each crash point from the crash data 330 is mapped to the nearest node in the map. In some embodiments, historical crash data includes structured metadata about the crash elaborating on the location, time, severity, etc. The output of this process is the cleaned crash data 340.

The expanded road data 325 and the cleaned crash data 340 is provided to a sampling module 345 to create negative samples which are representative of non-crash records. In the example embodiment, the crash prediction is treated as a binary classification problem, where the positive class is the occurrence of a crash and the negative class is the absence of a crash on a given road segment at a given date and time. Combinations of date/time and road segment pairs corresponding to the historical road crashes are used as positive examples. For the negative examples, the example embodiment uses a sampling approach to build a set of non-crash examples that are substantially similar to the crash examples so that the machine learning model can learn the differences between a crash and the absence of a crash. The sample approach works as follows:

1. A node is randomly selected from the expanded road data 325 and a day of the year and an hour of the day is randomly chosen and combined with the map node.

2. A crash record is randomly selected from the cleaned crash data 340 and the day of the year is altered.

3. A crash record is randomly selected from the cleaned crash data 340 and the hour of the day is altered. 4. From these examples, the combinations corresponding to a crash in the cleaned crash data 340 are removed in order to create negative examples from the remaining examples.

5. Steps 1 to 4 are repeated until a large number of negative examples has been created (in the example embodiment, a few times the number of positive samples in the cleaned crash data 340 are generated) and these negative examples are output as non-crash data 350.

Thus, in example embodiments, the general logic is to randomly sample non-negative events over a range of "conditions" such that they would be more reflective of the world. These "conditions" can include geographical distribution (location), temporal distribution (time), road segment (road type), weather conditions, or any other parameters (event types etc). Once these parameters have been selected (or sampled), the following additional logic is applied:

1. Link the sampled dataset to the true crash dataset.

2. Ensure that every geographical region (square/grid box or map segment) and space in time (time interval) has an appropriately balanced event ratio, i.e. there are more negative events than crash events. If this is not the case, the data is re-sampled or the balances are adjusted.

3. Run temporal and spatial smoothing to smooth the events in order to avoid substantially extreme concentrations in one square/grid box or map segment (temporally by smoothing events over neighbouring times or over a time range and spatially by smoothing events over neighbouring square/grid box or map segments).

In example embodiments, the smoothing can be done in a number of different ways. For example, using techniques such as kernel density estimation (KDE), Gaussian processes (GP), Bayesian methods, non-voluming preserving (NVP) transformations, likelihood free inferences or combinations of the above. In example embodiments, a spatio-temporal framework for predictive hotspot mapping and evaluation is used. This framework has four major features: (1) a spatio-temporal kernel density estimation (STKDE) method is applied to include the temporal component in predictive hotspot mapping, (2) a data-driven optimisation technique, the likelihood cross-validation, is used to select the most appropriate bandwidths, (3) a statistical significance test is designed to filter out false positives in the density estimates, and (4) a new metric, the predictive accuracy index (PAI) curve, to evaluate predictive hotspots at multiple areal scales.

In another example embodiment, kernel density estimation (KDE) and kriging, can be used for identifying crash hotspots in a road network.

The expanded road data 325, the cleaned crash data 340 and the non-crash data 350 are provided to a static & dynamic feature expansion module 355 which expands the number of features available for performing binary classification by introducing more static and dynamic features. In the example embodiment, more features are used to help to make a more accurate classification. Static features include but are not limited to geohash, historical crash counts, curvature of the road segment, orientation of the road segment, speed limit, number of nearby shops and amenities e.g. pubs and schools, proximity to nearby junctions and proximity to road elements e.g. speed cameras, traffic signals. The dynamic features change depending on when we are making the prediction and include but are not limited to time variables (hour, month, day, etc), solar geometry and the weather feeds. Specifically, dynamic features can also include but are not limited to time of day, day of week, month, bank holidays, solar azimuth, solar altitude, temperature, precipitation, wind speed, wind direction and visibility.

In example embodiments, external data (accident data from multiple sources, both external and internal) can be used to model and understand how accidents relate to contextual information/environmental data, such as road surface, geometry, weather (solar azimute, precipitation, wind speed etc.), points of interest (amenities such as bus stops, schools, churches etc), date/time related information (rush hour/peak time, night time driving, working patterns, driving regularity etc.). These insights, metadata or associated labelled data can be used to understand how environmental or locational factors impact the risk or chances of accidents (actuarial risk) and hence insurance prices.

In example embodiments, with access to telematics data (e.g., GPS location coordinates of trips) the riskiness or risk score of each trip can be determined. In addition, an insurance policy price according to the average locational risk of a user can also be determined. In this way, drivers that drive at safer locations get a lower generated associated price.

In example embodiments, this method provides a platform to build a system that proactively enables users to select routes/journeys that are safer. Thus, instead of reactively pricing based on historical routes, drivers can be rewarded if they choose safer options.

In example embodiments, the risk score to area/post code score can be summarised or aggregated (or relative difference of scores) instead of factor loadings (such as relativities) from standard or traditional postcode models.

In addition to safety by scoring map segments for their risk values, embodiments may also be applied to insurance pricing, providing location risk assessment for telematics, mobility providers, car OEMs etc.

The output of the static & dynamic feature expansion module 355 is an expanded crash and non-crash dataset 360 which can then be used to train 365 the classification module 260 to predict the risk score 270 and crash severity 280. Referring now to Figure 4, there is shown a three-day windowed set of positive data overlaid on a map in three different representations 410, 420, 430 each with different region/cell sizes.

Referring now to Figure 5, there is shown a three-hour windowed set of positive data overlaid on a map in two different representations 510, 520 each with different region/cell sizes.

The described features, aspects and/or embodiments may be implemented in one or more computer programs that may be executable on a programmable system including at least one programmable processor coupled to receive data and instructions from, and to transmit data and instructions to, a data storage system, at least one input device, and at least one output device. A computer program is a set of instructions that can be used, directly or indirectly, in a computer to perform a certain activity or bring about a certain result. A computer program may be written in any form of programming language (e.g., Python, C/C++, Java), including compiled or interpreted languages, and it may be deployed in any form, including as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment.

Suitable processors for the execution of a program of instructions may include, by way of example, both general and special purpose microprocessors, and the sole processor or one of multiple processors or cores, of any kind of computer. Generally, a processor may receive instructions and data from a read-only memory (ROM) or a random-access memory (RAM) or both. The essential elements of a computer may include a processor for executing instructions and one or more memories for storing instructions and data. Generally, a computer may also include, or be operatively coupled to communicate with, one or more mass storage devices for storing data files; such devices include magnetic disks, such as internal hard disks and removable disks; magneto-optical disks; and optical disks. Storage devices suitable for tangibly embodying computer program instructions and data may include all forms of non volatile memory, including by way of example semiconductor memory devices, such as EPROM, EEPROM, and flash memory devices; magnetic disks such as internal hard disks and removable disks; magneto-optical disks; and CD-ROM and DVD-ROM disks. The processor and the memory may be supplemented by, or incorporated in, ASICs (application- specific integrated circuits) and FPGAs (field-programmable gate array).

To provide for interaction with a user, the features may be implemented on a computer having a display device such as a CRT (cathode ray tube) or LCD (liquid crystal display) monitor for displaying information to the user and a keyboard and a pointing device such as a mouse or a trackball by which the user can provide input to the computer.

The features may be implemented in a computer system that includes a back-end component, such as a data server, or that includes a middleware component, such as an application server or an Internet server, or that includes a front-end component, such as a client computer having a graphical user interface or an Internet browser, or any combination of them. The components of the system may be connected by any form or medium of digital data communication such as a communication network. Examples of communication networks include, e.g., a LAN, a WAN, and the computers and networks forming the Internet.

The computer system may include clients and servers. A client and server may generally be remote from each other and may typically interact through a network. The relationship of client and server may arise by virtue of computer programs running on the respective computers and having a client-server relationship to each other.

One or more features or steps of the disclosed aspects and/or embodiments may be implemented using an API. An API may define one or more parameters that are passed between a calling application and other software code (e.g., an operating system, library routine, function) that provides a service, that provides data, or that performs an operation or a computation. The API may be implemented as one or more calls in program code that send or receive one or more parameters through a parameter list or other structure based on a call convention defined in an API specification document. A parameter may be a constant, a key, a data structure, an object, an object class, a variable, a data type, a pointer, an array, a list, or another call. API calls and parameters may be implemented in any programming language. The programming language may define the vocabulary and calling convention that a programmer will employ to access functions supporting the API. In some implementations, an API call may report to an application the capabilities of a device running the application, such as input capability, output capability, processing capability, power capability, communications capability, etc.

Any system feature as described herein may also be provided as a method feature, and vice versa. As used herein, means plus function features may be expressed alternatively in terms of their corresponding structure.

Any feature in one aspect may be applied to other aspects, in any appropriate combination. In particular, method aspects may be applied to system aspects, and vice versa. Furthermore, any, some and/or all features in one aspect can be applied to any, some and/or all features in any other aspect, in any appropriate combination.

It should also be appreciated that particular combinations of the various features described and defined in any aspects can be implemented and/or supplied and/or used independently.

Claims

CLAIMS:

1. A computer-implemented method, comprising the steps of: receiving map data, wherein the map data comprises a plurality of map segments; receiving a plurality of positive data; determining a correlation between each of the plurality of positive data and at least one of the plurality of map segments; generating negative data for each of the plurality of map segments by sampling the positive data; combining the positive and generated negative data; outputting the combined data.

2. The method of any preceding claim, wherein the method further comprises a step of density smoothing, wherein the step of density smoothing comprises averaging the positive data and/or negative data across one or more of the plurality of map segments.

3. The method of any preceding claim, wherein the method further comprises a step of spatially smoothing data over at least one or more neighbouring map segments, wherein positive data correlated with a map segment is correlated across a plurality of neighbouring map segments.

4. The method of any preceding claim, wherein the positive data comprises time data and a plurality of positive label data, and wherein the time data comprises a plurality of time values, further wherein each of the plurality of positive label data is correlated with a time value, and further comprising a step of temporally smoothing data wherein the step of temporally smoothing data comprises re-correlating each of the plurality of positive label data with a plurality of time values.

5. The method of any preceding claim, wherein the method further comprises performing pre-processing of any or any combination of: the positive data; and the map data.

6. The method of any preceding claim, wherein the method further comprises: receiving a plurality of context data, wherein the context data comprises one or more contextual features; correlating each of the one or more contextual features with each of the map segments and/or positive data; and wherein the step of generating negative data for each of the plurality of map segments by sampling the positive data comprises generating negative data for each of the plurality of map segments by sampling the positive data and context data.

7. The method of any preceding claim wherein the one or more contextual features comprises any or any combination of one or more static features or dynamic features.

8. The method of claim 7 wherein one or more static features comprise any or any combination of geohash, historical crash counts, curvature of the road segment, orientation of the road segment, speed limit, number of nearby shops and amenities, proximity to nearby junctions and proximity to road elements.

9. The method of claim 7 wherein one or more dynamic features comprise any or any combination of time of day, day of week, month, bank holidays, solar azimuth, solar altitude, temperature, precipitation, wind speed, wind direction and visibility.

10. The method of any preceding claim wherein the positive data comprises any or any combination of: location; time; severity; date; data or metadata relating to a vehicle collision; accident data.

11. The method of any preceding claim wherein sampling the positive data comprises iteratively and/or randomly selecting one or more of the positive data and modifying one or more values of the selected one or more of the positive data; optionally wherein the sampling continues until a threshold amount of negative data is generated.

12. The method of any preceding claim, wherein the method comprises a method for generating an augmented dataset for training a machine learning prediction model.

13. The method of any preceding claim, wherein the map segments comprises any or any combination of: points on a map; areas of a map; segments of a road; map nodes; nodes of a road.

14. The method of any preceding claim, wherein the method further comprises a user interface and wherein the user interface is used to display in an overlaid fashion any or any combination of: the map data; the positive data; the generated negative data; the context data; the combined data.

15. The method of any preceding claim, wherein the positive data and/or negative data comprises any or any combination of: latitude; longitude; and time.

16. A computer-implemented method, comprising the steps of: training a classification model using the combined data output by the method according to any preceding claim, wherein the classification model is trained to output a positive and/or negative classification in response to a time and/or position input.

17. The method of claim 16, wherein the classification model is further trained to determine any or any combination of: a relevant map segment for each position input; a position density for each position input; one or more contextual features for each position input; a risk score for each position and/or time input; a severity for each position and/or time input.

18. A computer-implemented method, comprising the steps of: receiving a position input; receiving a time input; determining a map portion associated with the position input; determining a position density of the position input; determining one or more contextual features associated with the determined map portion; performing classification using the trained model of any of claims 16 or 17 using the position input, the position density, the time input and the determined contextual features to determine a score for the position input and time input; and outputting the score.

19. The method of claim 18, wherein the position input comprises at least on latitude value and at least one longitude value.

20. The method of any of claims 18 or 19, wherein the time input comprises a time and date value.

21. The method of any of claims 18 to 20, wherein determining the map portion associated with the position input comprises determining the nearest map portion to the position input.

22. The method of any of claims 18 to 21, wherein the trained model comprises any or any combination of: a machine learning model; a deep learning model; a neural network; a modular neural network; a fully connected neural network; a fully connected deep learning network.

23. The method of any of claims 18 to 22, wherein the method further comprises determining a score per map portion of the map data, optionally wherein the score comprises any or any combination of: a frequency risk score; a severity.