WO2022215063A1 - A machine learning model blind-spot detection system and method - Google Patents

A machine learning model blind-spot detection system and method Download PDF

Info

Publication number
WO2022215063A1
WO2022215063A1 PCT/IL2022/050255 IL2022050255W WO2022215063A1 WO 2022215063 A1 WO2022215063 A1 WO 2022215063A1 IL 2022050255 W IL2022050255 W IL 2022050255W WO 2022215063 A1 WO2022215063 A1 WO 2022215063A1
Authority
WO
WIPO (PCT)
Prior art keywords
blind
records
training data
spots
spot
Prior art date
Application number
PCT/IL2022/050255
Other languages
French (fr)
Inventor
Maor IVGI
Yuval Dafna
Original Assignee
Stardat Data Science Ltd.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Stardat Data Science Ltd. filed Critical Stardat Data Science Ltd.
Publication of WO2022215063A1 publication Critical patent/WO2022215063A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • BPERFORMING OPERATIONS; TRANSPORTING
    • B60VEHICLES IN GENERAL
    • B60WCONJOINT CONTROL OF VEHICLE SUB-UNITS OF DIFFERENT TYPE OR DIFFERENT FUNCTION; CONTROL SYSTEMS SPECIALLY ADAPTED FOR HYBRID VEHICLES; ROAD VEHICLE DRIVE CONTROL SYSTEMS FOR PURPOSES NOT RELATED TO THE CONTROL OF A PARTICULAR SUB-UNIT
    • B60W10/00Conjoint control of vehicle sub-units of different type or different function
    • B60W10/18Conjoint control of vehicle sub-units of different type or different function including control of braking systems

Definitions

  • the present invention relates to the field of blind-spot detection systems and methods.
  • a machine learning model blind-spot detection system comprising a processing circuitry configured to: obtain: (a) a training data-set, the training data-set comprising a plurality of records, each record including a plurality of features, wherein one or more of the features are numeric features, and (b) a machine learning model, trained using the training data-set, capable of predicting a target value, and a probability of the target value, based on the collection of features; identify one or more candidate blind-spots, being feasible subspaces within a data representation of the records, wherein the number of records within the feasible spaces is non-zero and below a threshold; generate, for at least one candidate blind-spot, an ordered synthetic training data-set, including ordered synthetic records that are within the candidate blind-spot, wherein each of the ordered synthetic records is a modified copy of at least one given record within the corresponding candidate blind-spot, modified by
  • the smoothness condition is one or more of: the difference is below a difference threshold, or a trend of the ordered list of predicted probabilities can be described by a second-degree polynomial function.
  • the gradual increase or decrease is determined based on a distribution of the numeric feature within the training data-set.
  • the processing circuitry is further configured to retrain the machine learning model using a complementary training data-set including records covering the actual blind-spots.
  • the processing circuitry is further configured to obtain records that were not part of the training data-set, identify whether these records are within the one or more actual blind-spots or whether these records form new actual blind spots, and inform of it.
  • a method for detecting blind-spots of a machine learning model comprising: obtaining: (a) a training data-set, the training data-set comprising a plurality of records, each record including a plurality of features, wherein one or more of the features are numeric features, and (b) a machine learning model, trained using the training data-set, capable of predicting a target value, and a probability of the target value, based on the collection of features; identifying one or more candidate blind-spots, being feasible subspaces within a data representation of the records, wherein the number of records within the feasible spaces is non-zero and below a threshold; generating, for at least one candidate blind-spot, an ordered synthetic training data-set, including ordered synthetic records that are within the candidate blind-spot, wherein each of the ordered synthetic records is a modified copy of at least one given record within the corresponding candidate blind-spot, modified by changing a given numeric
  • the smoothness condition is one or more of: the difference is below a difference threshold, or a trend of the ordered list of predicted probabilities can be described by a second-degree polynomial function.
  • the gradual increase or decrease is determined based on a distribution of the numeric feature within the training data-set.
  • the processing circuitry is further configured to retrain the machine learning model using a complementary training data-set including records covering the actual blind-spots.
  • the processing circuitry is further configured to obtain records that were not part of the training data-set, identify whether these records are within the one or more actual blind-spots or whether these records form new actual blind spots, and inform of it.
  • a non-transitory computer readable storage medium having computer readable program code embodied therewith, the computer readable program code, executable by at least one processor to perform a method for detecting blind-spots of a machine learning model, the blind-spot detection comprising one or more components, the method comprising: obtaining: (a) a training data-set, the training data-set comprising a plurality of records, each record including a plurality of features, wherein one or more of the features are numeric features, and (b) a machine learning model, trained using the training data-set, capable of predicting a target value, and a probability of the target value, based on the collection of features; identifying one or more candidate blind-spots, being feasible subspaces within a data representation of the records, wherein the number of records within the feasible spaces is non-zero and below a threshold; generating, for at least one candidate blind-spot, an ordered synthetic training data-
  • Fig. 1 is a schematic illustration of an operation of the machine learning model blind-spot detection system, in accordance with the presently disclosed subject matter
  • Fig. 2 is a block diagram schematically illustrating one example of a machine learning model blind-spot detection system, in accordance with the presently disclosed subject matter.
  • Fig. 3 is a flowchart illustrating one example of a sequence of operations carried out by a machine learning model blind- spot detection system, in accordance with the presently disclosed subject matter.
  • DSP digital signal processor
  • FPGA field programmable gate array
  • ASIC application specific integrated circuit
  • non-transitory is used herein to exclude transitory, propagating signals, but to otherwise include any volatile or non-volatile computer memory technology suitable to the application.
  • the phrase “for example,” “such as”, “for instance” and variants thereof describe non-limiting embodiments of the presently disclosed subject matter.
  • Reference in the specification to “one case”, “some cases”, “other cases” or variants thereof means that a particular feature, structure or characteristic described in connection with the embodiment(s) is included in at least one embodiment of the presently disclosed subject matter.
  • the appearance of the phrase “one case”, “some cases”, “other cases” or variants thereof does not necessarily refer to the same embodiment(s).
  • certain features of the presently disclosed subject matter which are, for clarity, described in the context of separate embodiments, may also be provided in combination in a single embodiment.
  • various features of the presently disclosed subject matter, which are, for brevity, described in the context of a single embodiment may also be provided separately or in any suitable sub-combination.
  • Fig. 2 illustrates a general schematic of the system architecture in accordance with an embodiment of the presently disclosed subject matter.
  • Each module in Fig. 2 can be made up of any combination of software, hardware and/or firmware that performs the functions as defined and explained herein.
  • the modules in Fig. 2 may be centralized in one location or dispersed over more than one location.
  • the system may comprise fewer, more, and/or different modules than those shown in Fig. 2.
  • Any reference in the specification to a method should be applied mutatis mutandis to a system capable of executing the method and should be applied mutatis mutandis to a non-transitory computer readable medium that stores instructions that once executed by a computer result in the execution of the method.
  • Any reference in the specification to a system should be applied mutatis mutandis to a method that may be executed by the system and should be applied mutatis mutandis to a non-transitory computer readable medium that stores instructions that may be executed by the system.
  • Any reference in the specification to a non-transitory computer readable medium should be applied mutatis mutandis to a system capable of executing the instructions stored in the non-transitory computer readable medium and should be applied mutatis mutandis to method that may be executed by a computer that reads the instructions stored in the non-transitory computer readable medium.
  • Fig. 1 showing a schematic illustration of an operation of a machine learning model blind-spot detection system (also interchangeably referred to herein as “system”), in accordance with the presently disclosed subject matter.
  • a data representation of a training data set used in training a machine learning model being evaluated by the system of the presently disclosed subject matter, includes one or more dots 102 dispersed in an environment 100, creating a data manifold form. It is to be understood that the presentation of environment 100 in Fig. 1 as a two-dimension space is due to technical limitations. In reality, the plurality of dots 102 dispersed in environment 100 may form a multi-dimensional spatial structure.
  • Each of dots 102 represents a record including a collection of features, which are individual measurable properties or characteristics having values that are, for example, numeric, alphanumeric, alphabetic, etc.
  • Each record further includes a target value and probability of the target value, predicted by the evaluated machine learning model based on the collection of features.
  • each record may be, for example, characteristics of an image, such as: the number of pixels, wavelength, resolution, size, etc.
  • the features of each record may be, for example, characteristics of a hank loan request, such that they refer to, for example, age, income, family status, gender, race, etc.
  • the dots 102 dispersed in environment 100 may form subspaces, for example, subspaces 104a-104d (as shown in Fig. 1), each made of a cluster of several dots. Out of these subspaces, the system of the presently disclosed subject matter may particularly focus, for example, on subspaces having a non-zero and below a predetermined threshold number of dots. These subspaces are classified as candidate blind-spots.
  • Candidate blind-spots are areas within environment 100 in which the machine learning model being evaluated by the system of the presently disclosed subject matter is potentially ill-trained, due to the insufficient number of records/samples within these areas. These ill-trained areas may possibly cause the machine learning model to provide a false prediction, when encountering a prediction request that falls within them, giving the scarce amount of information the model is basing its prediction on.
  • subspaces 104a and 104c would be classified as candidate blind-spots as both are non-zero subspaces with a number of records (2) that is below the predetermined threshold (3).
  • Fig. 2 is a block diagram schematically illustrating one example of the machine learning model blind-spot detection system 200, in accordance with the presently disclosed subject matter.
  • machine learning model blind-spot detection system 200 can comprise a network interface 206.
  • the network interface 206 e.g., a network card, a WiFi client, a LiFi client, 3G/4G client, or any other component
  • system 200 can receive through network interface 206 one or more machine learning models and corresponding training data-sets used to train the machine learning models from one or more external systems.
  • system 200 does not include the network interface 206. This enables further protection of the user's potentially sensitive data as it restricts access to the user's data to only a certain computer or an organization using the program.
  • System 200 can further comprise or be otherwise associated with a data repository 204 (e.g., a database, a storage system, a memory including Read Only Memory - ROM, Random Access Memory - RAM, or any other type of memory, etc.) configured to store data, optionally including, inter alia, machine learning models, training data-sets, thresholds, candidate blind spots, actual blind spots, ordered synthetic data-sets, target values, ordered list of predicted probabilities of target values etc.
  • Data repository 204 can be further configured to enable retrieval and/or update and/or deletion of the stored data.
  • System 200 further comprises processing circuitry 202.
  • Processing circuitry 202 can be one or more processing units (e.g., central processing units), microprocessors, microcontrollers (e.g., microcontroller units (MCUs)) or any other computing devices or modules, including multiple and/or parallel and/or distributed processing units, which are adapted to independently or cooperatively process data for controlling relevant system 200 resources and for enabling operations related to system's 200 resources.
  • processing units e.g., central processing units
  • microprocessors e.g., microcontroller units (MCUs)
  • MCUs microcontroller units
  • the processing circuitry 202 comprises a blind-spot detection module 208, configured to perform a blind-spot detection process, as further detailed herein, inter alia with reference to Fig. 3.
  • FIG. 3 there is shown a flowchart illustrating one example of a sequence of operations carried out for detecting actual blind-spots, in accordance with the presently disclosed subject matter.
  • the machine learning model blind-spot detection system can be configured to perform a blind-spot detection process 300, e.g., using blind-spot detection module 208.
  • the machine learning model blind-spot detection system 200 obtains a training data-set including one or more records, each including one or more features of which one or more are numeric features, and a machine learning model, trained using the training data-set, capable of predicting a target value and a probability of the target value for each record, based on its collection of features (block 302).
  • the one or more records include information about individuals, such that each record has the information on the following features of an individual: age, income, family status, gender, and race, and the machine learning model is directed to provide an estimation on whether each of the individuals is suited to receive a loan from a designated hank.
  • the system 200 operates on the data representation of the training data-set and identifies candidate blind-spots, which are feasible subspaces within the data representation having a number of records (e.g., individuals) that is non-zero and below a predetermined threshold (block 304).
  • the system 200 identifies a single subspace including three records (referring to three individuals), which is below a predetermined threshold of four records.
  • the system 200 As the candidate blind-spots are identified, the system 200 generates, for each identified candidate blind-spot, a synthetic training data-set (block 306).
  • the synthetic training data-set is identical to the original training data-set obtained by system 200, with the addition of one or more ordered series of synthetic records located within the identified candidate blind-spots.
  • the one or more ordered series of synthetic records include modified copies of one or more records (e.g., individuals) within the corresponding candidate blind-spot possessing, for example, a change in a given numeric feature of the numeric features of the given record.
  • a single record (representing a single individual) of the three records within the identified subspace is selected so as to form a series of ordered synthetic records from.
  • Each ordered synthetic record possesses a change in its age feature.
  • the change in the numeric feature is articulated, for example, by a gradual increase or decrease in the value of the given numeric feature (e.g., age) within the series of ordered synthetic records of the synthetic training data-set.
  • the series of ordered synthetic records represent synthetic individuals having identical values to their income, family status, gender, and race features, except for the age feature, where each individual generated has an age feature value that differs from the value of the previously generated individual and the value of the following generated individual equally (e.g., each individual generated is two months older/younger than the individual previously generated, and is two months younger/older than the following generated individual, respectively).
  • the change in the value of the numeric feature is determined based on the distribution of the numeric feature within the training data-set.
  • the distribution of the age feature within the training data-set is a normal distribution, such that the synthetic individuals generated are formed according to it.
  • the distribution can vary between different numeric features, for example, between the age feature and the income feature, such that a change of a single unit within the age feature, e.g., from 30 to 31, has much more influence than a change of a single unit within the income feature, e.g., 10,000 USD and 10,001 USD.
  • the distribution can vary within the same numeric feature, for example, the age feature, such that an age change of a specific individual from, for example, 30 to 30.2 (thirty years and two months) has much less influence than a change in the age of a specific individual from, for example, 1 to 1.2 (one year and two months).
  • system 200 identifies if any of the candidate blind-spots are actual blind-spots. To do so, the machine learning model obtained by system 200 operates on the one or more series of ordered synthetic records (e.g., individuals) of each corresponding candidate blind-spot, and generates an ordered list of predicted probabilities of target values for the one or more records (e.g., individuals) within the series of ordered synthetic records.
  • the machine learning model obtained by system 200 operates on the one or more series of ordered synthetic records (e.g., individuals) of each corresponding candidate blind-spot, and generates an ordered list of predicted probabilities of target values for the one or more records (e.g., individuals) within the series of ordered synthetic records.
  • system 200 generates an ordered list including, for each ordered synthetic record of the single record selected, a predicted possibility, between zero and one, as to whether each ordered synthetic record would be able to return a loan given to him from the designated bank, and by that, whether each ordered synthetic record (synthetic individual) is suited to receive a loan from the designated hank.
  • System 200 then obtains the ordered list of predicted probabilities of target values of each ordered synthetic record and identifies whether one or more differences between sequential values of the ordered list of predicted probabilities meet a smoothness condition. If the smoothness condition is not met, the candidate blind-spot is identified as an actual blind-spot.
  • the smoothness condition is, for example, one or more of: the difference is below a difference threshold, or a trend of the ordered list of predicted probabilities can be described by a second-degree polynomial function.
  • system 200 obtains the ordered list of predicted probabilities of the series of ordered synthetic records and identifies if there are differences between sequential values of predicted probabilities within the list that are above a threshold difference of 0.2. If the answer is yes, system 200 identifies the candidate blind spot as actual blind-spot and informs the user.
  • system 200 is capable of training the machine learning model using a complementary training data-set, including records covering the actual blind- spots.
  • system 200 can receive additional data associated with records that were not part of the training data-set, and identify whether these records fall within subspaces of the data representation that were previously identified as actual blind spots. If one or more of these records fall within the area of an actual blind spot, system 200 informs the user. In other cases, additionally or alternatively to the above, system 200 can identify whether these records, which were not part of the training data-set, form new actual blind spots (in which the machine learning model behaves erratically) within the data representation and inform the system 200’ s user of it.
  • system can be implemented, at least partly, as a suitably programmed computer.
  • the presently disclosed subject matter contemplates a computer program being readable by a computer for executing the disclosed method.
  • the presently disclosed subject matter further contemplates a machine-readable memory tangibly embodying a program of instructions executable by the machine for executing the disclosed method.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Evolutionary Computation (AREA)
  • General Physics & Mathematics (AREA)
  • Medical Informatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • Chemical & Material Sciences (AREA)
  • Combustion & Propulsion (AREA)
  • Transportation (AREA)
  • Mechanical Engineering (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The presently disclosed subject matter aims to provide a system and method for detecting potential blind spots in a model and assessing the model's behavior within these spots. The system and method provide early warning to allow the implementations of safeguards within the model or trigger a new cycle of data collection and model training in order to avoid potential prediction errors of the model.

Description

A MACHINE LEARNING MODEL BLIND-SPOT DETECTION SYSTEM AND METHOD
TECHNICAL FIELD
The present invention relates to the field of blind-spot detection systems and methods.
BACKGROUND As businesses adopt more data-based techniques to optimize and automate performances in their sector’s competitive landscape, they expose themselves to the risk of ‘black swans’ (unpredictable events that are beyond what is normally expected of a situation having potentially severe consequences). In particular, Machine Learning (ML) models generated by ML algorithms base their predictions for novel input on past data. If the data used for training the ML models did not include highly rare or improbable events, then upon encountering such events, the ML model may be devastatingly wrong. Moreover, ML models usually tend to be overly confident in such events, making them more likely to avoid raising any red flags until it is too late. An example of such a case is a model built to operate in the financial sector, trained over data from the past decade. If the inflation rate would suddenly jump to 5% or, alternatively, if the US federal funds interest rate would be increased to 5%, the ML model would be in uncharted waters and could cause a big loss of capital in short times.
In light of the risks mentioned above, there is a need in the art for a new machine learning model blind-spot detection system and method. GENERAL DESCRIPTION
In accordance with a first aspect of the presently disclosed subject matter, there is provided a machine learning model blind-spot detection system, the machine learning model blind-spot detection system comprising a processing circuitry configured to: obtain: (a) a training data-set, the training data-set comprising a plurality of records, each record including a plurality of features, wherein one or more of the features are numeric features, and (b) a machine learning model, trained using the training data-set, capable of predicting a target value, and a probability of the target value, based on the collection of features; identify one or more candidate blind-spots, being feasible subspaces within a data representation of the records, wherein the number of records within the feasible spaces is non-zero and below a threshold; generate, for at least one candidate blind-spot, an ordered synthetic training data-set, including ordered synthetic records that are within the candidate blind-spot, wherein each of the ordered synthetic records is a modified copy of at least one given record within the corresponding candidate blind-spot, modified by changing a given numeric feature of the numeric features of the given record, wherein the change of the numeric feature is gradually increasing or decreasing within the ordered synthetic training data-set; identify, one or more actual blind-spots, if any, being candidate blind-spots wherein at least one difference between sequential values of an ordered list of predicted probabilities of target values predicted by the machine learning model using the ordered synthetic records of the corresponding candidate blind-spot do not meet a smoothness condition.
In some cases, the smoothness condition is one or more of: the difference is below a difference threshold, or a trend of the ordered list of predicted probabilities can be described by a second-degree polynomial function.
In some cases, the gradual increase or decrease is determined based on a distribution of the numeric feature within the training data-set.
In some cases, the processing circuitry is further configured to retrain the machine learning model using a complementary training data-set including records covering the actual blind-spots.
In some cases, the processing circuitry is further configured to obtain records that were not part of the training data-set, identify whether these records are within the one or more actual blind-spots or whether these records form new actual blind spots, and inform of it.
In accordance with a second aspect of the presently disclosed subject matter, there is provided a method for detecting blind-spots of a machine learning model, the method comprising: obtaining: (a) a training data-set, the training data-set comprising a plurality of records, each record including a plurality of features, wherein one or more of the features are numeric features, and (b) a machine learning model, trained using the training data-set, capable of predicting a target value, and a probability of the target value, based on the collection of features; identifying one or more candidate blind-spots, being feasible subspaces within a data representation of the records, wherein the number of records within the feasible spaces is non-zero and below a threshold; generating, for at least one candidate blind-spot, an ordered synthetic training data-set, including ordered synthetic records that are within the candidate blind-spot, wherein each of the ordered synthetic records is a modified copy of at least one given record within the corresponding candidate blind-spot, modified by changing a given numeric feature of the numeric features of the given record, wherein the change of the numeric feature is gradually increasing or decreasing within the ordered synthetic training data-set; identifying, one or more actual blind-spots, if any, being candidate blind-spots wherein at least one difference between sequential values of an ordered list of predicted probabilities of target values predicted by the machine learning model using the ordered synthetic records of the corresponding candidate blind-spot do not meet a smoothness condition.
In some cases, the smoothness condition is one or more of: the difference is below a difference threshold, or a trend of the ordered list of predicted probabilities can be described by a second-degree polynomial function.
In some cases, the gradual increase or decrease is determined based on a distribution of the numeric feature within the training data-set.
In some cases, the processing circuitry is further configured to retrain the machine learning model using a complementary training data-set including records covering the actual blind-spots.
In some cases, the processing circuitry is further configured to obtain records that were not part of the training data-set, identify whether these records are within the one or more actual blind-spots or whether these records form new actual blind spots, and inform of it.
In accordance with a second aspect of the presently disclosed subject matter, there is provided a non-transitory computer readable storage medium having computer readable program code embodied therewith, the computer readable program code, executable by at least one processor to perform a method for detecting blind-spots of a machine learning model, the blind-spot detection comprising one or more components, the method comprising: obtaining: (a) a training data-set, the training data-set comprising a plurality of records, each record including a plurality of features, wherein one or more of the features are numeric features, and (b) a machine learning model, trained using the training data-set, capable of predicting a target value, and a probability of the target value, based on the collection of features; identifying one or more candidate blind-spots, being feasible subspaces within a data representation of the records, wherein the number of records within the feasible spaces is non-zero and below a threshold; generating, for at least one candidate blind-spot, an ordered synthetic training data-set, including ordered synthetic records that are within the candidate blind- spot, wherein each of the ordered synthetic records is a modified copy of at least one given record within the corresponding candidate blind-spot, modified by changing a given numeric feature of the numeric features of the given record, wherein the change of the numeric feature is gradually increasing or decreasing within the ordered synthetic training data-set; identifying, one or more actual blind-spots, if any, being candidate blind-spots wherein at least one difference between sequential values of an ordered list of predicted probabilities of target values predicted by the machine learning model using the ordered synthetic records of the corresponding candidate blind-spot do not meet a smoothness condition.
BRIEF DESCRIPTION OF THE DRAWINGS
In order to understand the presently disclosed subject matter and to see how it may be carried out in practice, the subject matter will now be described, by way of non limiting examples only, with reference to the accompanying drawings, in which:
Fig. 1 is a schematic illustration of an operation of the machine learning model blind-spot detection system, in accordance with the presently disclosed subject matter;
Fig. 2 is a block diagram schematically illustrating one example of a machine learning model blind-spot detection system, in accordance with the presently disclosed subject matter; and,
Fig. 3 is a flowchart illustrating one example of a sequence of operations carried out by a machine learning model blind- spot detection system, in accordance with the presently disclosed subject matter.
DETAILED DESCRIPTION
In the following detailed description, numerous specific details are set forth in order to provide a thorough understanding of the presently disclosed subject matter. However, it will be understood by those skilled in the art that the presently disclosed subject matter may be practiced without these specific details. In other instances, well- known methods, procedures, and components have not been described in detail so as not to obscure the presently disclosed subject matter.
In the drawings and descriptions set forth, identical reference numerals indicate those components that are common to different embodiments or configurations.
Unless specifically stated otherwise, as apparent from the following discussions, it is appreciated that throughout the specification discussions utilizing terms such as “obtaining", “generating", “identifying", “retraining” or the like, include action and/or processes of a computer that manipulate and/or transform data into other data, said data represented as physical quantities, e.g., such as electronic quantities, and/or said data representing the physical objects. The terms “computer”, “processor”, “processing resource”, “processing circuitry”, and “controller” should be expansively construed to cover any kind of electronic device with data processing capabilities, including, by way of non-limiting example, a personal desktop/laptop computer, a server, a computing system, a communication device, a smartphone, a tablet computer, a smart television, a processor (e.g. digital signal processor (DSP), a microcontroller, a field programmable gate array (FPGA), an application specific integrated circuit (ASIC), etc.), a group of multiple physical machines sharing performance of various tasks, virtual servers co- residing on a single physical machine, any other electronic computing device, and/or any combination thereof.
The operations in accordance with the teachings herein may be performed by a computer specially constructed for the desired purposes or by a general-purpose computer specially configured for the desired purpose by a computer program stored in a non-transitory computer readable storage medium. The term "non-transitory" is used herein to exclude transitory, propagating signals, but to otherwise include any volatile or non-volatile computer memory technology suitable to the application.
As used herein, the phrase "for example," "such as", "for instance" and variants thereof describe non-limiting embodiments of the presently disclosed subject matter. Reference in the specification to "one case", "some cases", "other cases" or variants thereof means that a particular feature, structure or characteristic described in connection with the embodiment(s) is included in at least one embodiment of the presently disclosed subject matter. Thus, the appearance of the phrase "one case", "some cases", "other cases" or variants thereof does not necessarily refer to the same embodiment(s). It is appreciated that, unless specifically stated otherwise, certain features of the presently disclosed subject matter, which are, for clarity, described in the context of separate embodiments, may also be provided in combination in a single embodiment. Conversely, various features of the presently disclosed subject matter, which are, for brevity, described in the context of a single embodiment, may also be provided separately or in any suitable sub-combination.
In embodiments of the presently disclosed subject matter, fewer, more and/or different stages than those shown in Fig. 3 may be executed. In embodiments of the presently disclosed subject matter one or more stages illustrated in Fig. 3 may be executed in a different order and/or one or more groups of stages may be executed simultaneously. Fig. 2 illustrates a general schematic of the system architecture in accordance with an embodiment of the presently disclosed subject matter. Each module in Fig. 2 can be made up of any combination of software, hardware and/or firmware that performs the functions as defined and explained herein. The modules in Fig. 2 may be centralized in one location or dispersed over more than one location. In other embodiments of the presently disclosed subject matter, the system may comprise fewer, more, and/or different modules than those shown in Fig. 2.
Any reference in the specification to a method should be applied mutatis mutandis to a system capable of executing the method and should be applied mutatis mutandis to a non-transitory computer readable medium that stores instructions that once executed by a computer result in the execution of the method.
Any reference in the specification to a system should be applied mutatis mutandis to a method that may be executed by the system and should be applied mutatis mutandis to a non-transitory computer readable medium that stores instructions that may be executed by the system.
Any reference in the specification to a non-transitory computer readable medium should be applied mutatis mutandis to a system capable of executing the instructions stored in the non-transitory computer readable medium and should be applied mutatis mutandis to method that may be executed by a computer that reads the instructions stored in the non-transitory computer readable medium.
The presently disclosed subject matter provides a system directed to assess existing prediction modules by examining their behavior in areas of interest containing insufficient training data. Bearing this in mind, attention is drawn to Fig. 1, showing a schematic illustration of an operation of a machine learning model blind-spot detection system (also interchangeably referred to herein as “system”), in accordance with the presently disclosed subject matter.
As shown in the schematic illustration, a data representation of a training data set, used in training a machine learning model being evaluated by the system of the presently disclosed subject matter, includes one or more dots 102 dispersed in an environment 100, creating a data manifold form. It is to be understood that the presentation of environment 100 in Fig. 1 as a two-dimension space is due to technical limitations. In reality, the plurality of dots 102 dispersed in environment 100 may form a multi-dimensional spatial structure.
Each of dots 102 represents a record including a collection of features, which are individual measurable properties or characteristics having values that are, for example, numeric, alphanumeric, alphabetic, etc. Each record further includes a target value and probability of the target value, predicted by the evaluated machine learning model based on the collection of features.
The features of each record may be, for example, characteristics of an image, such as: the number of pixels, wavelength, resolution, size, etc. Alternatively, the features of each record may be, for example, characteristics of a hank loan request, such that they refer to, for example, age, income, family status, gender, race, etc.
The dots 102 dispersed in environment 100 may form subspaces, for example, subspaces 104a-104d (as shown in Fig. 1), each made of a cluster of several dots. Out of these subspaces, the system of the presently disclosed subject matter may particularly focus, for example, on subspaces having a non-zero and below a predetermined threshold number of dots. These subspaces are classified as candidate blind-spots.
Candidate blind-spots are areas within environment 100 in which the machine learning model being evaluated by the system of the presently disclosed subject matter is potentially ill-trained, due to the insufficient number of records/samples within these areas. These ill-trained areas may possibly cause the machine learning model to provide a false prediction, when encountering a prediction request that falls within them, giving the scarce amount of information the model is basing its prediction on.
By way of example, referring to Fig. 1, assuming the predetermined threshold was defined to be the number 3, subspaces 104a and 104c would be classified as candidate blind-spots as both are non-zero subspaces with a number of records (2) that is below the predetermined threshold (3).
Attention is now drawn to the components of the machine learning model blind- spot detection system 200.
Fig. 2 is a block diagram schematically illustrating one example of the machine learning model blind-spot detection system 200, in accordance with the presently disclosed subject matter.
In accordance with the presently disclosed subject matter, machine learning model blind-spot detection system 200 (also interchangeably referred to herein as “system 200”) can comprise a network interface 206. The network interface 206 (e.g., a network card, a WiFi client, a LiFi client, 3G/4G client, or any other component), enables system 200 to communicate over a network with external systems and handles inbound and outbound communications from such systems. For example: system 200 can receive through network interface 206 one or more machine learning models and corresponding training data-sets used to train the machine learning models from one or more external systems.
In some cases, where system 200 is installed on a user’s device, for example, a laptop, desktop, or tablet computer using an installed on-premises program, system 200 does not include the network interface 206. This enables further protection of the user's potentially sensitive data as it restricts access to the user's data to only a certain computer or an organization using the program.
System 200 can further comprise or be otherwise associated with a data repository 204 (e.g., a database, a storage system, a memory including Read Only Memory - ROM, Random Access Memory - RAM, or any other type of memory, etc.) configured to store data, optionally including, inter alia, machine learning models, training data-sets, thresholds, candidate blind spots, actual blind spots, ordered synthetic data-sets, target values, ordered list of predicted probabilities of target values etc. Data repository 204 can be further configured to enable retrieval and/or update and/or deletion of the stored data. It is to be noted that in some cases, data repository 204 can be distributed, while the system 200 has access to the information stored thereon, e.g., via a wired or wireless network to which system 200 is able to connect (utilizing its network interface 206). System 200 further comprises processing circuitry 202. Processing circuitry 202 can be one or more processing units (e.g., central processing units), microprocessors, microcontrollers (e.g., microcontroller units (MCUs)) or any other computing devices or modules, including multiple and/or parallel and/or distributed processing units, which are adapted to independently or cooperatively process data for controlling relevant system 200 resources and for enabling operations related to system's 200 resources.
The processing circuitry 202 comprises a blind-spot detection module 208, configured to perform a blind-spot detection process, as further detailed herein, inter alia with reference to Fig. 3.
Turning to Fig. 3, there is shown a flowchart illustrating one example of a sequence of operations carried out for detecting actual blind-spots, in accordance with the presently disclosed subject matter.
Accordingly, the machine learning model blind-spot detection system can be configured to perform a blind-spot detection process 300, e.g., using blind-spot detection module 208.
For this purpose, the machine learning model blind-spot detection system 200 obtains a training data-set including one or more records, each including one or more features of which one or more are numeric features, and a machine learning model, trained using the training data-set, capable of predicting a target value and a probability of the target value for each record, based on its collection of features (block 302). In a non-limiting example, the one or more records include information about individuals, such that each record has the information on the following features of an individual: age, income, family status, gender, and race, and the machine learning model is directed to provide an estimation on whether each of the individuals is suited to receive a loan from a designated hank.
The system 200 operates on the data representation of the training data-set and identifies candidate blind-spots, which are feasible subspaces within the data representation having a number of records (e.g., individuals) that is non-zero and below a predetermined threshold (block 304). In our continuing non-limiting example, the system 200 identifies a single subspace including three records (referring to three individuals), which is below a predetermined threshold of four records.
As the candidate blind-spots are identified, the system 200 generates, for each identified candidate blind-spot, a synthetic training data-set (block 306). The synthetic training data-set is identical to the original training data-set obtained by system 200, with the addition of one or more ordered series of synthetic records located within the identified candidate blind-spots.
The one or more ordered series of synthetic records include modified copies of one or more records (e.g., individuals) within the corresponding candidate blind-spot possessing, for example, a change in a given numeric feature of the numeric features of the given record. In our continuing non-limiting example, a single record (representing a single individual) of the three records within the identified subspace is selected so as to form a series of ordered synthetic records from. Each ordered synthetic record possesses a change in its age feature.
The change in the numeric feature (e.g., age) is articulated, for example, by a gradual increase or decrease in the value of the given numeric feature (e.g., age) within the series of ordered synthetic records of the synthetic training data-set. In our continuing non-limiting example, the series of ordered synthetic records represent synthetic individuals having identical values to their income, family status, gender, and race features, except for the age feature, where each individual generated has an age feature value that differs from the value of the previously generated individual and the value of the following generated individual equally (e.g., each individual generated is two months older/younger than the individual previously generated, and is two months younger/older than the following generated individual, respectively).
In some cases, the change in the value of the numeric feature (gradual increase or decrease) is determined based on the distribution of the numeric feature within the training data-set. In our continuing non-limiting example, the distribution of the age feature within the training data-set is a normal distribution, such that the synthetic individuals generated are formed according to it.
In some cases, the distribution can vary between different numeric features, for example, between the age feature and the income feature, such that a change of a single unit within the age feature, e.g., from 30 to 31, has much more influence than a change of a single unit within the income feature, e.g., 10,000 USD and 10,001 USD.
In other cases, the distribution can vary within the same numeric feature, for example, the age feature, such that an age change of a specific individual from, for example, 30 to 30.2 (thirty years and two months) has much less influence than a change in the age of a specific individual from, for example, 1 to 1.2 (one year and two months).
Following the identification of candidate blind-spots and the generating of one or more series of ordered synthetic records within each corresponding candidate blind- spot, system 200 identifies if any of the candidate blind-spots are actual blind-spots. To do so, the machine learning model obtained by system 200 operates on the one or more series of ordered synthetic records (e.g., individuals) of each corresponding candidate blind-spot, and generates an ordered list of predicted probabilities of target values for the one or more records (e.g., individuals) within the series of ordered synthetic records. In our continuing non-limiting example, system 200 generates an ordered list including, for each ordered synthetic record of the single record selected, a predicted possibility, between zero and one, as to whether each ordered synthetic record would be able to return a loan given to him from the designated bank, and by that, whether each ordered synthetic record (synthetic individual) is suited to receive a loan from the designated hank.
System 200 then obtains the ordered list of predicted probabilities of target values of each ordered synthetic record and identifies whether one or more differences between sequential values of the ordered list of predicted probabilities meet a smoothness condition. If the smoothness condition is not met, the candidate blind-spot is identified as an actual blind-spot. The smoothness condition is, for example, one or more of: the difference is below a difference threshold, or a trend of the ordered list of predicted probabilities can be described by a second-degree polynomial function. In our continuing non-limiting example, system 200 obtains the ordered list of predicted probabilities of the series of ordered synthetic records and identifies if there are differences between sequential values of predicted probabilities within the list that are above a threshold difference of 0.2. If the answer is yes, system 200 identifies the candidate blind spot as actual blind-spot and informs the user.
In some cases, system 200 is capable of training the machine learning model using a complementary training data-set, including records covering the actual blind- spots.
In some cases, system 200 can receive additional data associated with records that were not part of the training data-set, and identify whether these records fall within subspaces of the data representation that were previously identified as actual blind spots. If one or more of these records fall within the area of an actual blind spot, system 200 informs the user. In other cases, additionally or alternatively to the above, system 200 can identify whether these records, which were not part of the training data-set, form new actual blind spots (in which the machine learning model behaves erratically) within the data representation and inform the system 200’ s user of it.
It is to be noted, with reference to Fig. 3, that some of the blocks can be integrated into a consolidated block or can be broken down to a few blocks and/or other blocks may be added. It is to be further noted that some of the blocks are optional. It should be also noted that whilst the flow diagram is described also with reference to the system elements that realizes them, this is by no means binding, and the blocks can be performed by elements other than those described herein.
It is to be understood that the presently disclosed subject matter is not limited in its application to the details set forth in the description contained herein or illustrated in the drawings. The presently disclosed subject matter is capable of other embodiments and of being practiced and carried out in various ways. Hence, it is to be understood that the phraseology and terminology employed herein are for the purpose of description and should not be regarded as limiting. As such, those skilled in the art will appreciate that the conception upon which this disclosure is based may readily be utilized as a basis for designing other structures, methods, and systems for carrying out the several purposes of the present presently disclosed subject matter.
It will also be understood that the system according to the presently disclosed subject matter can be implemented, at least partly, as a suitably programmed computer. Likewise, the presently disclosed subject matter contemplates a computer program being readable by a computer for executing the disclosed method. The presently disclosed subject matter further contemplates a machine-readable memory tangibly embodying a program of instructions executable by the machine for executing the disclosed method.

Claims

CLAIMS:
1. A machine learning model blind-spot detection system, the machine learning model blind-spot detection system comprising a processing circuitry configured to: obtain: (a) a training data-set, the training data-set comprising a plurality of records, each record including a plurality of features, wherein one or more of the features are numeric features, and (b) a machine learning model, trained using the training data-set, capable of predicting a target value, and a probability of the target value, based on the collection of features; identify one or more candidate blind-spots, being feasible subspaces within a data representation of the records, wherein the number of records within the feasible spaces is non-zero and below a threshold; generate, for at least one candidate blind-spot, an ordered synthetic training data set, including ordered synthetic records that are within the candidate blind-spot, wherein each of the ordered synthetic records is a modified copy of at least one given record within the corresponding candidate blind-spot, modified by changing a given numeric feature of the numeric features of the given record, wherein the change of the numeric feature is gradually increasing or decreasing within the ordered synthetic training data-set; identify, one or more actual blind-spots, if any, being candidate blind-spots wherein at least one difference between sequential values of an ordered list of predicted probabilities of target values predicted by the machine learning model using the ordered synthetic records of the corresponding candidate blind-spot do not meet a smoothness condition.
2. The system of claim 1, wherein the smoothness condition is one or more of: the difference is below a difference threshold, or a trend of the ordered list of predicted probabilities can be described by a second-degree polynomial function.
3. The system of claim 1, wherein the gradual increase or decrease is determined based on a distribution of the numeric feature within the training data-set.
4. The system of claim 1, wherein the processing circuitry is further configured to retrain the machine learning model using a complementary training data set including records covering the actual blind-spots.
5. The system of claim 1, wherein the processing circuitry is further configured to obtain records that were not part of the training data-set, identify whether these records are within the one or more actual blind-spots or whether these records form new actual blind spots, and inform of it.
6. A method for detecting blind-spots of a machine learning model, the method comprising: obtaining: (a) a training data-set, the training data-set comprising a plurality of records, each record including a plurality of features, wherein one or more of the features are numeric features, and (b) a machine learning model, trained using the training data-set, capable of predicting a target value, and a probability of the target value, based on the collection of features; identifying one or more candidate blind-spots, being feasible subspaces within a data representation of the records, wherein the number of records within the feasible spaces is non-zero and below a threshold; generating, for at least one candidate blind-spot, an ordered synthetic training data-set, including ordered synthetic records that are within the candidate blind-spot, wherein each of the ordered synthetic records is a modified copy of at least one given record within the corresponding candidate blind-spot, modified by changing a given numeric feature of the numeric features of the given record, wherein the change of the numeric feature is gradually increasing or decreasing within the ordered synthetic training data-set; identifying, one or more actual blind-spots, if any, being candidate blind-spots wherein at least one difference between sequential values of an ordered list of predicted probabilities of target values predicted by the machine learning model using the ordered synthetic records of the corresponding candidate blind-spot do not meet a smoothness condition.
7. The method of claim 6, wherein the smoothness condition is one or more of: the difference is below a difference threshold, or a trend of the ordered list of predicted probabilities can be described by a second-degree polynomial function.
8. The method of claim 6, wherein the gradual increase or decrease is determined based on a distribution of the numeric feature within the training data-set.
9. The method of claim 6, wherein the processing circuitry is further configured to retrain the machine learning model using a complementary training data set including records covering the actual blind-spots.
10. The method of claim 6, wherein the processing circuitry is further configured to obtain records that were not part of the training data-set, identify whether these records are within the one or more actual blind-spots or whether these records form new actual blind spots, and inform of it.
11. A non-transitory computer readable storage medium having computer readable program code embodied therewith, the computer readable program code, executable by at least one processor to perform a method for detecting blind-spots of a machine learning model, the blind-spot detection comprising one or more components, the method comprising: obtaining: (a) a training data-set, the training data-set comprising a plurality of records, each record including a plurality of features, wherein one or more of the features are numeric features, and (b) a machine learning model, trained using the training data-set, capable of predicting a target value, and a probability of the target value, based on the collection of features; identifying one or more candidate blind-spots, being feasible subspaces within a data representation of the records, wherein the number of records within the feasible spaces is non-zero and below a threshold; generating, for at least one candidate blind-spot, an ordered synthetic training data-set, including ordered synthetic records that are within the candidate blind-spot, wherein each of the ordered synthetic records is a modified copy of at least one given record within the corresponding candidate blind-spot, modified by changing a given numeric feature of the numeric features of the given record, wherein the change of the numeric feature is gradually increasing or decreasing within the ordered synthetic training data-set; identifying, one or more actual blind-spots, if any, being candidate blind-spots wherein at least one difference between sequential values of an ordered list of predicted probabilities of target values predicted by the machine learning model using the ordered synthetic records of the corresponding candidate blind-spot do not meet a smoothness condition.
PCT/IL2022/050255 2021-04-04 2022-03-08 A machine learning model blind-spot detection system and method WO2022215063A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US202163170517P 2021-04-04 2021-04-04
US63/170,517 2021-04-04

Publications (1)

Publication Number Publication Date
WO2022215063A1 true WO2022215063A1 (en) 2022-10-13

Family

ID=83546224

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/IL2022/050255 WO2022215063A1 (en) 2021-04-04 2022-03-08 A machine learning model blind-spot detection system and method

Country Status (1)

Country Link
WO (1) WO2022215063A1 (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080071721A1 (en) * 2006-08-18 2008-03-20 Haixun Wang System and method for learning models from scarce and skewed training data
US20120084238A1 (en) * 2007-07-31 2012-04-05 Cornell Research Foundation, Inc. System and Method to Enable Training a Machine Learning Network in the Presence of Weak or Absent Training Exemplars
US20150052090A1 (en) * 2013-08-16 2015-02-19 International Business Machines Corporation Sequential anomaly detection
US20150269050A1 (en) * 2014-03-18 2015-09-24 Microsoft Corporation Unsupervised anomaly detection for arbitrary time series
US20160092789A1 (en) * 2014-09-29 2016-03-31 International Business Machines Corporation Category Oversampling for Imbalanced Machine Learning

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080071721A1 (en) * 2006-08-18 2008-03-20 Haixun Wang System and method for learning models from scarce and skewed training data
US20120084238A1 (en) * 2007-07-31 2012-04-05 Cornell Research Foundation, Inc. System and Method to Enable Training a Machine Learning Network in the Presence of Weak or Absent Training Exemplars
US20150052090A1 (en) * 2013-08-16 2015-02-19 International Business Machines Corporation Sequential anomaly detection
US20150269050A1 (en) * 2014-03-18 2015-09-24 Microsoft Corporation Unsupervised anomaly detection for arbitrary time series
US20160092789A1 (en) * 2014-09-29 2016-03-31 International Business Machines Corporation Category Oversampling for Imbalanced Machine Learning

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
MICHAEL FEINDT: "A Neural Bayesian Estimator for Conditional Probability Densities", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 19 February 2004 (2004-02-19), 201 Olin Library Cornell University Ithaca, NY 14853 , XP080144441 *

Similar Documents

Publication Publication Date Title
WO2021012783A1 (en) Insurance policy underwriting model training method employing big data, and underwriting risk assessment method
US10503906B2 (en) Determining a risk indicator based on classifying documents using a classifier
US10521734B2 (en) Machine learning predictive labeling system
RU2625053C1 (en) Elimination of false activation of anti-virus records
US20190325267A1 (en) Machine learning predictive labeling system
US20210092140A1 (en) Method, product, and system for detecting malicious network activity using a graph mixture density neural network
US20190258648A1 (en) Generating asset level classifications using machine learning
US11334771B2 (en) Methods, devices and systems for combining object detection models
CN110679114B (en) Method for estimating deletability of data object
US11245726B1 (en) Systems and methods for customizing security alert reports
CN108829715A (en) For detecting the method, equipment and computer readable storage medium of abnormal data
US10504037B1 (en) Systems and methods for automated document review and quality control
US11475252B2 (en) Systems and techniques to monitor text data quality
CN113807940B (en) Information processing and fraud recognition method, device, equipment and storage medium
CN113177700B (en) Risk assessment method, system, electronic equipment and storage medium
CN112818162A (en) Image retrieval method, image retrieval device, storage medium and electronic equipment
CN112632609A (en) Abnormality detection method, abnormality detection device, electronic apparatus, and storage medium
CN110019017B (en) High-energy physical file storage method based on access characteristics
CN112883990A (en) Data classification method and device, computer storage medium and electronic equipment
US20190279085A1 (en) Learning method, learning device, and computer-readable recording medium
AU2021100392A4 (en) A method for malware detection and classification using multi-level resnet paradigm on pe binary images
WO2020041859A1 (en) System and method for building and using learning machines to understand and explain learning machines
CN111988327B (en) Threat behavior detection and model establishment method and device, electronic equipment and storage medium
US20230393960A1 (en) Reducing bias in machine learning models utilizing a fairness deviation constraint and decision matrix
CN109685091B (en) Digital experience targeting using bayesian approach

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22784268

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 130224)

122 Ep: pct application non-entry in european phase

Ref document number: 22784268

Country of ref document: EP

Kind code of ref document: A1