WO2024039017A1 - Method and apparatus for managing quality of data - Google Patents

Method and apparatus for managing quality of data Download PDF

Info

Publication number
WO2024039017A1
WO2024039017A1 PCT/KR2023/007653 KR2023007653W WO2024039017A1 WO 2024039017 A1 WO2024039017 A1 WO 2024039017A1 KR 2023007653 W KR2023007653 W KR 2023007653W WO 2024039017 A1 WO2024039017 A1 WO 2024039017A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
model
online
quality rules
offline
Prior art date
Application number
PCT/KR2023/007653
Other languages
French (fr)
Inventor
Rajasekhara Reddy Duvvuru Muni
Karthikeyan Kumaraguru
Payyavula Rao SRINIVASA
Sejoon Oh
Original Assignee
Samsung Electronics Co., Ltd.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Samsung Electronics Co., Ltd. filed Critical Samsung Electronics Co., Ltd.
Publication of WO2024039017A1 publication Critical patent/WO2024039017A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/901Indexing; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/906Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/98Detection or correction of errors, e.g. by rescanning the pattern or by human intervention; Evaluation of the quality of the acquired patterns

Definitions

  • the present disclosure generally relates to machine learning (ML) models, and more particularly relates to a system and a method for managing quality of data for providing artificial intelligence (AI) based services to one or more user devices.
  • ML machine learning
  • AI artificial intelligence
  • Inferences that are made based on machine learning models built with bad quality data could cost millions of dollars, cost trust issues, and reduce user engagement in various applications deploying ML models.
  • the present disclosure refers to a method for managing quality of data for providing artificial intelligence (AI) based services to one or more user devices.
  • the method comprises defining a first set of data quality rules based on offline data stored in a database, the offline data being associated with corresponding applications associated with the one or more user devices.
  • the method further comprises generating a machine learning (ML) model based on the first set of data quality rules, wherein the ML model is deployed to provide the AI based services in association with the corresponding applications.
  • the method also comprises monitoring over a period of time, using the first set of data quality rules, incoming real-time, online data received by the deployed ML model across the one or more user devices.
  • ML machine learning
  • the method further comprises continually determining, based on the monitoring of the online data, one or more characteristics of the online data that are divergent from the offline data.
  • the method also comprises dynamically generating a second set of data quality rules, by updating the first set of data quality rules based on the determined one or more characteristics, to facilitate providing the AI based services in association with the corresponding applications.
  • the present disclosure refers to a system for managing quality of data for providing artificial intelligence (AI) based services to one or more user devices.
  • the system comprises a memory and at least one at least one processor communicatively coupled to the memory.
  • the at least one at least one processor being configured to define a first set of data quality rules based on offline data stored in a database, the offline data being associated with corresponding applications associated with the one or more user devices.
  • the at least one at least one processor being further configured to generate a machine learning (ML) model based on the first set of data quality rules, wherein the ML model is deployed to provide the AI based services in association with the corresponding applications.
  • ML machine learning
  • the at least one at least one processor being further configured to monitor over a period of time, using the first set of data quality rules, incoming real-time, online data received by the deployed ML model across the one or more user devices.
  • the at least one processor also being configured to continually determine, based on the monitoring of the online data, one or more characteristics of the online data that are divergent from the offline data and dynamically generate a second set of data quality rules, by updating the first set of data quality rules based on the determined one or more characteristics, to facilitate providing the AI based services in association with the corresponding applications.
  • Figure 1 illustrates image classification using ML models, in accordance with existing art
  • Figure 2 illustrates payment fraud detection using ML models, in accordance with existing art
  • Figure 3 illustrates a flow chart depicting a method for managing quality of data for providing artificial intelligence (AI) based services to one or more user devices, according to an embodiment of the present disclosure
  • Figure 4 illustrates a block diagram of a system for managing quality of data for providing artificial intelligence (AI) based services to one or more user devices, according to an embodiment of the present disclosure
  • Figure 5 illustrates a block diagram of an intelligent rule recommender engine and an associated workflow, according to an embodiment of the present disclosure
  • Figure 6 illustrates a data observability engine, according to an embodiment of the present disclosure
  • Figure 7 illustrates a divergence detection engine and an associated workflow, according to an embodiment of the present disclosure
  • Figures 8(a)-8(b) illustrates a comparison between existing techniques and the proposed method for data observability in real-time log streaming, according to various embodiments of the present disclosure
  • Figure 9 illustrates a use case associated with accurate image classification using enhanced data observability, according to an embodiment of the present disclosure
  • Figure 10 illustrates a use case associated with real time fraud detection in payment applications using enhanced data observability, according to an embodiment of the present disclosure
  • Figure 11 illustrates a use case associated with recommending right content for interest deviation using enhanced data observability, according to an embodiment of the present disclosure.
  • Figure 12 illustrates a use case associated with application recommendation based on user app usage pattern, according to an embodiment of the present disclosure.
  • a smart dynamic data quality management platform that integrates rule extraction, validation on data streams of big-data applications and monitors dynamic drift in data that will impact the Machine Learning system's performance that impacts dependent services.
  • Figure 1 illustrates image classification using ML models, in accordance with existing art.
  • one or more rules may be manually generated for data cleaning of offline data provided for building of an ML model.
  • the data may be cleaned and an ML model may be built using the cleaned data and the rules. Further, the built ML models may be used on the online data to classify the images.
  • a cat may be predicted as tiger using the ML model, as the data quality management implemented in offline is not used to monitor the online data and the ML model may have failed.
  • the damage to business may already have been done due to incorrect inferences.
  • Figure 2 illustrates payment fraud detection using ML models, in accordance with existing art.
  • An ML model may be built using manual rules generated for offline data.
  • the user data from financial payment applications may change over time and diverge from offline data used for ML model training, leading to predicting fraud transactions as non-fraud, or vice versa, thereby leading to incorrect predictions.
  • a Fraud detection ML Model is built using manual data cleaning.
  • Computer Age should be more than 36 and income should be in the range of 40,000-53500 and credit score can be in the range of 354 - 695 to be detected as a legitimate transaction.
  • Figure 3 illustrates a flow diagram depicting method for managing quality of data for providing artificial intelligence (AI) based services to one or more user devices, in accordance with an embodiment of the present disclosure.
  • Figure 4 illustrates a block diagram of system for managing quality of data for providing artificial intelligence (AI) based services to one or more user devices.
  • AI artificial intelligence
  • the system 400 may include, but is not limited to, at least one processor 402, memory 404, modules 406, and data unit 408.
  • the modules 406 and the memory 404 may be coupled to the at least one processor 402.
  • the at least one processor 402 can be a single processing unit or several modules, all of which could include multiple computing modules.
  • the at least one processor 402 may be implemented as one or more microprocessors, microcomputers, microcontrollers, digital signal at least one processors, central processing modules, state machines, logic circuitries, and/or any user devices that manipulate signals based on operational instructions.
  • the at least one processor 402 is configured to fetch and execute computer-readable instructions and data stored in the memory 404.
  • the processor 402 may include one or a plurality of processors.
  • the one or more processors may be a general-purpose processor such as a CPU, an AP, or a digital signal processor (DSP), a graphics-only processor such as a GPU or a vision processing unit (VPU), or an artificial intelligence-only processor such as an NPU.
  • DSP digital signal processor
  • VPU vision processing unit
  • NPU artificial intelligence-only processor
  • the processors dedicated to artificial intelligence may be designed as a hardware structure specialized for processing a specific artificial intelligence model.
  • One or more processors control input data to be processed according to predefined operating rules or artificial intelligence models stored in a memory.
  • the processors dedicated to artificial intelligence may be designed with a hardware structure specialized for processing a specific artificial intelligence model.
  • a predefined action rule or an artificial intelligence model is characterized in that it is created through learning.
  • being made through learning means that a basic artificial intelligence model is learned using a plurality of learning data by a learning algorithm, so that a predefined action rule or artificial intelligence model set to perform a desired characteristic (or purpose) is created.
  • Such learning may be performed in the device itself in which artificial intelligence according to the present disclosure is performed, or through a separate server and/or system.
  • Examples of learning algorithms include supervised learning, unsupervised learning, semi-supervised learning, or reinforcement learning, but are not limited to the above examples.
  • An artificial intelligence model may be composed of a plurality of neural network layers.
  • Each of the plurality of neural network layers has a plurality of weight values, and a neural network operation is performed through an operation between an operation result of a previous layer and a plurality of weight values.
  • a plurality of weights possessed by a plurality of neural network layers may be optimized by a learning result of an artificial intelligence model. For example, a plurality of weights may be updated so that a loss value or a cost value obtained from an artificial intelligence model is reduced or minimized during a learning process.
  • the artificial neural network may include a deep neural network (DNN), for example, a Convolutional Neural Network (CNN), a Deep Neural Network (DNN), a Recurrent Neural Network (RNN), a Restricted Boltzmann Machine (RBM), A deep belief network (DBN), a bidirectional recurrent deep neural network (BRDNN), or deep Q-networks, but is not limited to the above examples.
  • DNN deep neural network
  • CNN Convolutional Neural Network
  • DNN Deep Neural Network
  • RNN Recurrent Neural Network
  • RBM Restricted Boltzmann Machine
  • BBN Restricted Boltzmann Machine
  • BBN deep belief network
  • BNN bidirectional recurrent deep neural network
  • Q-networks deep Q-networks
  • the memory 404 may include any non-transitory computer-readable medium known in the art including, for example, volatile memory, such as static random access memory (SRAM) and dynamic random access memory (DRAM), and/or non-volatile memory, such as read-only memory (ROM), erasable programmable ROM, flash memories, hard disks, optical disks, and magnetic tapes.
  • volatile memory such as static random access memory (SRAM) and dynamic random access memory (DRAM)
  • DRAM dynamic random access memory
  • non-volatile memory such as read-only memory (ROM), erasable programmable ROM, flash memories, hard disks, optical disks, and magnetic tapes.
  • the modules 406 amongst other things, include routines, programs, objects, components, data structures, etc., which perform particular tasks or implement data types.
  • the modules 406 may also be implemented as, signal at least one processor(s), state machine(s), logic circuitries, and/or any other user device or component that manipulate signals based on operational instructions.
  • the modules 406 can be implemented in hardware, instructions executed by a processing unit, or by a combination thereof.
  • the processing unit can comprise a computer, a processor, such as the at least one processor 402, a state machine, a logic array, or any other suitable user devices capable of processing instructions.
  • the processing unit can be a general-purpose processor which executes instructions to cause the general-purpose processor to perform the required tasks or, the processing unit can be dedicated to performing the required functions.
  • the modules 406 may be machine-readable instructions (software) which, when executed by a at least one processor/processing unit, perform any of the described functionalities.
  • the modules 406 may include an intelligent rule recommender engine 410, a data observability engine 412, and a divergence detection engine 414.
  • the various modules 410-414 may be in communication with each other.
  • the various modules 410-414 may be a part of the at least one processor 402.
  • the at least one processor 402 may be configured to perform the functions of modules 410-414.
  • the data unit 408 serves, amongst other things, as a repository for storing data processed, received, and generated by one or more of the modules 406.
  • system 400 may be a part of a user device.
  • system 400 may be connected to the user device.
  • user device refers to any electronic user devices used by a user such as a mobile user device, a desktop, a laptop, personal digital assistant (PDA) or similar user devices.
  • PDA personal digital assistant
  • the method comprises defining a first set of data quality rules based on offline data stored in a database.
  • the offline data is associated with corresponding applications associated with the one or more user devices.
  • the one or more applications may constantly send data and store data in the database warehouse for processing and service development. This data is considered as offline data.
  • the intelligent rule recommender engine 410 may define the first set of data quality rules based on the offline data stored in the database. In an alternate embodiment, the intelligent rule recommender engine 410 may directly receive the offline data from the one or more applications.
  • the offline data may refer to data received from applications associated with the one or more user devices, the data being related to one or more interactions of respective users of the one or more user devices with the corresponding application.
  • the offline data may further relate to performance of the AI based services associated with the corresponding application.
  • the offline data may thus be considered as non-real-time data that is received from the application associated with the one or more user devices and stored for further processing.
  • Figure 5 illustrates a block diagram of an intelligent rule recommender engine and an associated workflow, according to an embodiment of the present disclosure.
  • the intelligent rule recommender engine 410 is configured to generate the first set of data quality rules for real-time ML model deployment by analyzing and classifying the retrieved offline data.
  • the intelligent rule recommender engine 410 may include a quality analyzer module 505 and a rule configurator module 507. As can be seen from Figure 5, the intelligent rule recommender engine 410 is connected to the database or data warehouse 501 which stores, over a period of time, the offline data, where the offline data is received from the corresponding one or more applications associated with the one or more user devices.
  • the database 501 is an offline data store capable of receiving data and serving data.
  • the offline data is retrieved by a profiler 503 which is connected to the intelligent rule recommender engine 410.
  • the profiler 503 periodically profiles the offline data and generates profile information associated with the stored offline data.
  • the generated profile information is shared with the quality analyzer module 505.
  • the quality analyzer module 505 determines offline data parameters based on the profile information, wherein the offline data parameters comprise data types within the offline data and data characteristics associated with each of the data types.
  • a type identifier 505a scans the profile information and determines data type associated with the offline data.
  • the type identifier 505a then shares the data type with a rule identifier 507a of the rule configurator 507 and a threshold miner 505b, as depicted in Figure 5.
  • the data may be an image and the offline data parameters may comprise object length, height, colour, pattern, and the like.
  • the data may be employee data and the offline data parameters may comprise user personal information, user preferences, user position, user income, and the like.
  • each of offline data parameter may comprise data types and data characteristics.
  • the user personal information may be a parameter, and the user personal information may comprise user age as a data type and specific values, such as, 20 years, as the data characteristic.
  • image object may be a parameter, and the image object may comprise stripes as a data type and brown as a data characteristic.
  • the threshold miner 505b identifies data type specific data thresholds dynamically and validates the same, i.e., the threshold miner 505b determines the data characteristics associated with the offline data.
  • a report generator 505c, along with type specific thresholds may generate a quality report with quality score of the offline data.
  • the quality report may comprise information related to the data types and data characteristics of the various offline data parameters.
  • the quality reports may include information related to the data types and associated characteristics, such as, white stripes, brown stripes, etc.
  • the quality report may include information related to the data type and associated characteristics, such as, maximum age, median age, fill rate, and the like.
  • the generated data quality reports may be sent to users for review as alerts.
  • the rule configurator 507 may define the first set of data quality rules based on the offline data parameters. In an embodiment, the rule configurator 507 may automatically define the first set of data quality rules based on the offline data parameters.
  • the rule identifier 507a with inputs received from the type identifier 505a automatically identifies appropriate rules and forward the same to a parameter estimator 507b.
  • the parameter estimator 507b with inputs from the rule identifier 507a and threshold miner 505b estimates accurate parameters for rule.
  • a rule definer 507c receives the rules with relevant parameters from the parameter estimator 507b and defines the first set of data quality rules.
  • the defined first set of data quality rules and optional inputs from users may be stored in a database such as a DQ Rules Repo 509.
  • the DQ Rules Repo 509 may optionally accept custom rules from users for integration with data observability engine 412.
  • the method 300 comprises generating/obtaining a ML model based on the first set of data quality rules.
  • the ML model is deployed to provide the AI based services in association with the corresponding applications.
  • the ML model may be generated/obtained by the processor 402 based on set of data quality rules determined by the intelligent rule recommender engine 410.
  • the offline data may be filtered based on the first set of data quality rules determined by the intelligent rule recommender engine 410, and further, the ML model may be generated/obtained based on the filtered offline data.
  • the method 300 comprises monitoring over a period of time, using the first set of data quality rules, incoming real-time, online data received by the deployed ML model across the one or more user devices.
  • the online data is related to one or more of interactions of respective users of the one or more user devices with the corresponding application and performance of the AI based services associated with the corresponding applications.
  • the one or more interactions of the users may comprise interactions done by the users via an application provided on the associated user device.
  • an image classification application may be provided on the user device and the user may interact with the application to purchase an item and/or download other applications on the user device.
  • the online data refers to real-time data being received from the applications associated with the one or more user devices.
  • the online data may differ from the offline data, in that, the online data may have some additional data or some deleted data as compared to the offline data.
  • the data observability engine 412 may monitor incoming real-time, online data received by the deployed ML model across the one or more user devices.
  • Figure 6 illustrates a data observability engine 412, according to an embodiment of the present disclosure.
  • a stream processor associated with the applications may feed the real-time online data, such as in form of a data stream 601, that serves the deployed ML model to the data observability engine 412.
  • the data observability engine 412 may fetch the first set of data quality rules from the intelligent rule recommender engine 410, and may filter the online data to identify clean and unclean data based on the first set of data quality rules.
  • the filtered data may be in a database or segregated data store 603 as logs for further processing.
  • the period of time may be configurable and may be configured by the data observability engine 412.
  • the data observability engine 412 may feed the deployed ML model with the clean data and accordingly, the ML model may provide the AI based services based on the clean data.
  • the method 300 comprises determining, based on the monitoring of the online data, one or more characteristics of the online data that are divergent from the offline data.
  • the divergence detection engine 414 may continually determine, based on monitoring of the online data, the one or more characteristics of the online data that are divergent from the offline data.
  • the divergence detection engine 414 may determine the one or more characteristics of the online data that are divergent from the offline data, which is explained in reference to Figure 7.
  • FIG. 7 illustrates a divergence detection engine 414 and an associated workflow, according to an embodiment of the present disclosure.
  • the divergence detection engine 414 may be configured to determine data quality divergence in the incoming real-time data based on the monitoring performed by the data observability engine 412. In particular, the divergence detection engine 414 may access the clean data and the unclean data filtered from the online data. In an embodiment, the divergence detection engine 414 may receive the clean and the unclean data from the data observability engine 412. Then, the divergence detection engine 414 may process the unclean data and the clean data to detect a type of data divergence in the online data.
  • the divergence detection engine 414 may include a label generator 701 to labelling the clean data and the unclean data to generate labelled data.
  • the clean data is partitioned into reference and real-time data.
  • the label generator 701 may label the clean data as target and reference.
  • the divergence detection engine 414 may include a non-linear transformer 703 to perform non-linear transformations on the labelled data to generate transform data. Further, the divergence detection engine 414 may also include a linear divergence detector 705 to detect presence of data divergence based on the transformed data. Furthermore, the divergence detection engine 414 may include a linear divergence classifier 707 to determine the type of data divergence based on the detected presence of data divergence. In an embodiment, the linear divergence classifier 707 may recommend the ML model management strategy based on the type of data divergence.
  • the divergence detection engine 414 may determine the one or more characteristics of the online data based on the type of data divergence and the first set of data quality rules.
  • the divergence detection engine 414 may also include a non-linear divergence quantifier 709 that may quantify the divergence magnitude and estimate the significance of divergence to identify and recommend second set of data quality rules generation to counter the data.
  • the method 300 comprises generating a second set of data quality rules, based on the determined one or more characteristics, to facilitate providing the AI based services in association with the corresponding applications.
  • the method 300 comprises dynamically generating a second set of data quality rules, by updating the first set of data quality rules based on the determined one or more characteristics to facilitate providing the AI based services in association with the corresponding applications.
  • the method 300 comprises comparing the one or more characteristics of the online data with corresponding threshold ranges.
  • the first set of data quality rules are updated to dynamically generate the second set of data quality rules based on the one or more characteristics of the online data.
  • the second set of data quality rules may then be used by the data observability engine 412 to monitor the online data being provided by the applications.
  • the method 300 comprises generating an updated ML model, using the second set of data quality rules to provide the AI based services in association with the corresponding applications.
  • the ML model is updated to generate the updated ML model based on the one or more characteristics of the online data.
  • the updated ML model may be generated by updating the previously deployed ML model.
  • the updated model may be generated by selecting an alternate model from a plurality of pre-developed models, the alternate model being developed so as to process the online data that comprises characteristics divergent from the offline data.
  • the plurality of pre-developed models may be stored in an ML model repository.
  • the updated ML model may be generated by developing a new model based on the second set of data quality rules.
  • the online data may comprise characteristics divergent from the offline data such that a pre-developed model may not be available to process the online data.
  • the online data may comprise characteristics that may not be recorded previously. In such cases, updating the ML model may developing a new ML model based on the second set of data quality rules.
  • the updated ML model is deployed to provide the AI based services in association with the corresponding applications. Then, the online data being received by the updated ML model is continuing monitored across the one or more user devices. The online data is filtered based on the second set of data quality rules and the updated ML model provides the AI based services based on the filtered online data. Or the online data is filtered based on the first set of data quality rules and the updated ML model provides the AI based services based on the filtered online data.
  • an alert in response to the first set of data quality rules being updated and/or in response to the ML model being updated, an alert may be provided to a user associated with the one or more user devices and/or to a remote admin user associated with an admin device.
  • Figures 8(a)-8(b) illustrates a comparison between existing techniques and the proposed method for data observability in real-time log streaming, according to various embodiments of the present disclosure.
  • the offline data is manually cleaned, and an ML model is built to deploy on a user device or server.
  • the deployed model may serve online inferences and the model performance is monitored periodically and degradation will be rectified after some delay.
  • the offline data is used to learn quality issues and rules to fix them.
  • An ML model is built to deploy on a user device or a server. Further, during online inference, incoming data is scanned by data monitoring module using the learned rules and quality data will reach deployed ML model for inference. The deviation detector module linked to data monitoring module will periodically monitor the deviations in incoming data and alerts new rule generation requirements.
  • the data gets automatically monitored by an intelligent rule recommender engine comprising a profiler to automatically profile the data at periodic intervals to find the data characteristics, a quality analyzer to find the quality parameters of the data that might impact the downstream analysis, a rule configurator that will formulate the rules to fix the quality issues detected therein, and a repository to maintain and serve the rules to the required.
  • the observability engine may access the rules for data quality management from rule recommender to apply and clean the incoming stream data from application event logger. From the cleaned data, the features will be extracted which are necessary for the ML model and served via the feature generator. Further to serving the model, the stream at least one processor also serves classifier to classify clean/unclean data and sends the clean data to the database.
  • Such clean data dynamically updated can be used for down-stream analysis by the ML model.
  • the clean data persisted in the database is fed to data divergence detection engine to periodically monitor the data divergence and model anomalies in the application data. Further, the data divergence detection engine alerts downstream analyst(s) for corrective action or automatically generate a second set of rules. Also, the data divergence detection engine may automatically trigger the ML Model updating or deploying alternate model process.
  • Figure 9 illustrates a use case associated with accurate image classification using enhanced data observability, according to an embodiment of the present disclosure.
  • a best performing image identification ML model is built 'offline' based automated rule recommender system (Intelligent rule recommender engine) (5 rules along with parameters) and deployed in a service online.
  • the data observability engine may use the automatically generated data quality rules, to monitor the incoming online data before sending to deployed ML Model. So, cat image is dropped as it doesn't pass the data quality rules.
  • Data divergence detection engine may analyze the dropped/Poor quality data (Cat image) and assess the rules (if any new rules has to be made) or rule parameters (Values, in this example, Brown stripes, Minimum and maximum values of object length) has to be updated making second set of rules.
  • Figure 10 illustrates a use case associated with real time fraud detection in payment applications using enhanced data observability, according to an embodiment of the present disclosure.
  • intelligent rule recommender engine is configured on the application.
  • Data quality rules (Employment type - Engineer, Principle, IT & Age data cleaning by filling zero values with average by Married), along with parameters are learned.
  • a best performing ML Model is built and deployed in production.
  • the data observability engine will use the data quality rules, to monitor the incoming online data before sending to deployed ML Model.
  • Good quality data is sent to Prediction.
  • Data Observability engine finds the data with missing values (Age) will be filled by rules learned from data and sent to prediction service. In absence of proposed system this record would have been classified as fraud.
  • Bad quality data is sent to data divergence detection engine. Divergent data [(Income of 245000), new employment type (Freelancer) and minimum (178) and maximum (795) values of credit score] are detected and sent to intelligent rule recommender engine to generate second set of rules.
  • Figure 11 illustrates a use case associated with recommending right content for interest deviation using enhanced data observability, according to an embodiment of the present disclosure.
  • each tab has multiple genre, and each genre has multiple channels. Users frequently switch between them.
  • Multiple ML Models are built and stored in Repository using clean data as per the data quality rules and are ready to be deployed dynamically.
  • the data observability engine and data divergence detection engine work hand in hand in real time to observer the preferred Channels and divergence in the Preferences and fetch relevant model to drive user engagement.
  • Figure 12 illustrates a use case associated with application recommendation based on user app usage pattern, according to an embodiment of the present disclosure.
  • an application store does recommendations of apps based on user past installs, usage captured by service using user consent. Based on a time slice, there are 4 different types of Users with different Preferences.
  • An ML model is built using this data and deployed in production. At some point GDPR (General Data Protection Regulation) rules are implemented and most of the customers revoked their consent for tracking app usages leading to skewed or biased data coming to recommend apps. In this situation, the recommender will not be able to recommend relevant apps to all users.
  • Proposed data observability engine and divergence detection engine will detect the divergence in incoming data and can alert/deploy alternate model. At the same time, new set of rules will be generated and used to monitor the incoming data by observer engine.
  • the present disclosure provides following exemplary advantages:
  • the system of the present disclosure is independent of meta-data and is capable of learning from actual data automatically both offline and real-time;
  • system of the present disclosure is designed to learn data quality issues from offline data and observe online data to be served to deployed ML model and alert the rule generator to learn new rules and/or ML model management (e.g., by deploying a suitable model without retraining the existing ML model).
  • Embodiments of the disclosure can also be embodied as a storage medium including instructions executable by a computer such as a program module executed by the computer.
  • a computer readable medium can be any available medium which can be accessed by the computer and includes all volatile/non-volatile and removable/non-removable media.
  • the computer readable medium may include all computer storage and communication media.
  • the computer storage medium includes all volatile/non-volatile and removable/non-removable media embodied by a certain method or technology for storing information such as computer readable instruction code, a data structure, a program module or other data.
  • Communication media may typically include computer readable instructions, data structures, or other data in a modulated data signal, such as program modules.
  • computer-readable storage media may be provided in the form of non-transitory storage media.
  • the 'non-transitory storage medium' is a tangible device and only means that it does not contain a signal (e.g., electromagnetic waves). This term does not distinguish a case in which data is stored semi-permanently in a storage medium from a case in which data is temporarily stored.
  • the non-transitory recording medium may include a buffer in which data is temporarily stored.
  • a method may be provided by being included in a computer program product.
  • the computer program product which is a commodity, may be traded between sellers and buyers.
  • Computer program products are distributed in the form of device-readable storage media (e.g., compact disc read only memory (CD-ROM)), or may be distributed (e.g., downloaded or uploaded) through an application store or between two user devices (e.g., smartphones) directly and online.
  • device-readable storage media e.g., compact disc read only memory (CD-ROM)
  • CD-ROM compact disc read only memory
  • two user devices e.g., smartphones
  • at least a portion of the computer program product e.g., a downloadable app
  • a device-readable storage medium such as a memory of a manufacturer's server, a server of an application store, or a relay server, or may be temporarily generated.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Medical Informatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Computing Systems (AREA)
  • Evolutionary Biology (AREA)
  • Multimedia (AREA)
  • Mathematical Physics (AREA)
  • Quality & Reliability (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Computer And Data Communications (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

A system and method for managing quality of data for providing AI based services to one or more user devices. The method comprises defining a first set of data quality rules based on offline data stored in a database, generating a ML model based on the first set of data quality rules, wherein the ML model is deployed to provide the AI based services in association with the corresponding applications. The method comprises monitoring over a period of time, using the first set of data quality rules, incoming real-time, online data received by the deployed ML model across the one or more user devices. The method further comprises continually determining, based on the monitoring of the online data, one or more characteristics of the online data that are divergent from the offline data and dynamically generating a second set of data quality rules.

Description

METHOD AND APPARATUS FOR MANAGING QUALITY OF DATA
The present disclosure generally relates to machine learning (ML) models, and more particularly relates to a system and a method for managing quality of data for providing artificial intelligence (AI) based services to one or more user devices.
As is widely known, it is a tricky and time-consuming job to maintain good data quality for following ML tasks in fast-iterating service domains. Service changes, changes in content, and changes in features associated with the ML task could often cause unexpected data quality degradation and make poor ML task output and misleading data analysis. So, it is critical to build a system to monitor data/model drift in offline and online way for the sake of sustainable ML development, deployment, and innovative experiment on services.
However, the present ML model performance monitoring, and lack of data quality observability in online deployments, impacts user experience and business decisions. Application services dependent on data analytics for identification of potential churners to manage customer relations are sensitive to data quality. When the ML models built on poor quality data are compared against ML models built on good quality data, even a minor performance gain impacts the business in a big way. For example, out of 1 bn active users, 0.02% accuracy gain may detect approximately 400K users as high confidence churners. This is a valuable resource for Customer Relationship Management (CRM) teams generating significant engagement and revenue gains.
Inferences that are made based on machine learning models built with bad quality data could cost millions of dollars, cost trust issues, and reduce user engagement in various applications deploying ML models.
Accordingly, there is a need to address the aforementioned challenges and specifically, there is a need for methods and systems to improve real-time data readability and observability for ML models. Further, there is a need for automatic data quality management, observing quality issues in online data, detecting the data divergences, and identifying issues impacting the model performance proactively.
This technical solution is provided to introduce a selection of concepts, in a simplified format, that are further described in the detailed description of the disclosure. This technical solution is neither intended to identify key or essential inventive concepts of the invention and nor is it intended for determining the scope of the disclosure.
In an embodiment, the present disclosure refers to a method for managing quality of data for providing artificial intelligence (AI) based services to one or more user devices. The method comprises defining a first set of data quality rules based on offline data stored in a database, the offline data being associated with corresponding applications associated with the one or more user devices. The method further comprises generating a machine learning (ML) model based on the first set of data quality rules, wherein the ML model is deployed to provide the AI based services in association with the corresponding applications. The method also comprises monitoring over a period of time, using the first set of data quality rules, incoming real-time, online data received by the deployed ML model across the one or more user devices. The method further comprises continually determining, based on the monitoring of the online data, one or more characteristics of the online data that are divergent from the offline data. The method also comprises dynamically generating a second set of data quality rules, by updating the first set of data quality rules based on the determined one or more characteristics, to facilitate providing the AI based services in association with the corresponding applications.
In an embodiment, the present disclosure refers to a system for managing quality of data for providing artificial intelligence (AI) based services to one or more user devices. The system comprises a memory and at least one at least one processor communicatively coupled to the memory. The at least one at least one processor being configured to define a first set of data quality rules based on offline data stored in a database, the offline data being associated with corresponding applications associated with the one or more user devices. The at least one at least one processor being further configured to generate a machine learning (ML) model based on the first set of data quality rules, wherein the ML model is deployed to provide the AI based services in association with the corresponding applications. The at least one at least one processor being further configured to monitor over a period of time, using the first set of data quality rules, incoming real-time, online data received by the deployed ML model across the one or more user devices. The at least one processor also being configured to continually determine, based on the monitoring of the online data, one or more characteristics of the online data that are divergent from the offline data and dynamically generate a second set of data quality rules, by updating the first set of data quality rules based on the determined one or more characteristics, to facilitate providing the AI based services in association with the corresponding applications.
To further clarify the advantages and features of the present disclosure, a more particular description of the disclosure will be rendered by reference to specific embodiments thereof, which is illustrated in the appended drawings. It is appreciated that these drawings depict only typical embodiments of the disclosure and are therefore not to be considered limiting of its scope. The present disclosure will be described and explained with additional specificity and detail with the accompanying drawings.
These and other features, aspects, and advantages of the present invention will become better understood when the following detailed description is read with reference to the accompanying drawings in which like characters represent like parts throughout the drawings, wherein:
Figure 1 illustrates image classification using ML models, in accordance with existing art;
Figure 2 illustrates payment fraud detection using ML models, in accordance with existing art;
Figure 3 illustrates a flow chart depicting a method for managing quality of data for providing artificial intelligence (AI) based services to one or more user devices, according to an embodiment of the present disclosure;
Figure 4 illustrates a block diagram of a system for managing quality of data for providing artificial intelligence (AI) based services to one or more user devices, according to an embodiment of the present disclosure;
Figure 5 illustrates a block diagram of an intelligent rule recommender engine and an associated workflow, according to an embodiment of the present disclosure;
Figure 6 illustrates a data observability engine, according to an embodiment of the present disclosure;
Figure 7 illustrates a divergence detection engine and an associated workflow, according to an embodiment of the present disclosure;
Figures 8(a)-8(b) illustrates a comparison between existing techniques and the proposed method for data observability in real-time log streaming, according to various embodiments of the present disclosure;
Figure 9 illustrates a use case associated with accurate image classification using enhanced data observability, according to an embodiment of the present disclosure;
Figure 10 illustrates a use case associated with real time fraud detection in payment applications using enhanced data observability, according to an embodiment of the present disclosure;
Figure 11 illustrates a use case associated with recommending right content for interest deviation using enhanced data observability, according to an embodiment of the present disclosure; and
Figure 12 illustrates a use case associated with application recommendation based on user app usage pattern, according to an embodiment of the present disclosure.
Further, skilled artisans will appreciate that elements in the drawings are illustrated for simplicity and may not have necessarily been drawn to scale. For example, the flow charts illustrate the method in terms of the most prominent steps involved to help to improve understanding of aspects of the present disclosure. Furthermore, in terms of the construction of the user device, one or more components of the user device may have been represented in the drawings by conventional symbols, and the drawings may show only those specific details that are pertinent to understanding the embodiments of the present disclosure so as not to obscure the drawings with details that will be readily apparent to those of ordinary skill in the art having the benefit of the description herein.
For the purpose of promoting an understanding of the principles of the present disclosure, reference will now be made to the various embodiments and specific language will be used to describe the same. It will nevertheless be understood that no limitation of the scope of the disclosure is thereby intended, such alterations and further modifications in the illustrated system, and such further applications of the principles of the disclosure as illustrated therein being contemplated as would normally occur to one skilled in the art to which the disclosure relates.
It will be understood by those skilled in the art that the foregoing general description and the following detailed description are explanatory of the disclosure and are not intended to be restrictive thereof.
Reference throughout this specification to "an aspect", "another aspect" or similar language means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment of the present disclosure. Thus, appearances of the phrase "in an embodiment", "in another embodiment" and similar language throughout this specification may, but do not necessarily, all refer to the same embodiment.
The terms "comprises", "comprising", or any other variations thereof, are intended to cover a non-exclusive inclusion, such that a process or method that comprises a list of steps does not include only those steps but may include other steps not expressly listed or inherent to such process or method. Similarly, one or more user devices or sub-systems or elements or structures or components proceeded by "comprises... a" does not, without more constraints, preclude the existence of other user devices or other sub-systems or other elements or other structures or other components or additional user devices or additional sub-systems or additional elements or additional structures or additional components.
In accordance with various embodiments of the present disclosure, a smart dynamic data quality management platform is provided that integrates rule extraction, validation on data streams of big-data applications and monitors dynamic drift in data that will impact the Machine Learning system's performance that impacts dependent services.
Embodiments of the present invention will be described below in detail with reference to the accompanying drawings.
Figure 1 illustrates image classification using ML models, in accordance with existing art. As depicted, one or more rules may be manually generated for data cleaning of offline data provided for building of an ML model. The data may be cleaned and an ML model may be built using the cleaned data and the rules. Further, the built ML models may be used on the online data to classify the images.
As depicted, a cat may be predicted as tiger using the ML model, as the data quality management implemented in offline is not used to monitor the online data and the ML model may have failed. By the time the model monitoring systems detect performance issues, the damage to business may already have been done due to incorrect inferences.
Figure 2 illustrates payment fraud detection using ML models, in accordance with existing art. An ML model may be built using manual rules generated for offline data. However, the user data from financial payment applications may change over time and diverge from offline data used for ML model training, leading to predicting fraud transactions as non-fraud, or vice versa, thereby leading to incorrect predictions. For example, A Fraud detection ML Model is built using manual data cleaning. As per the offline data, Scientist Age should be more than 36 and income should be in the range of 40,000-53500 and credit score can be in the range of 354 - 695 to be detected as a legitimate transaction. However, in the online data, we got a Scientist, whose age is 19, income is 140000 and credit score is 354 which is violating the deployed model learnings. So ML Model will detect him as fraud, which is not accurate. Actually, what happened is, divergence in data. Salaries of scientists has increased recently. Transactions Per month are very around 60 because, the person demographics changed due to his marriage and credit score decreased due to multiple score checks. Such data divergence should be actively monitored and ML services should be updated.
Accordingly, there is a need to address the aforementioned challenges and specifically, there is a need for methods and systems to improve real-time data readability and observability for ML models. Further, there is a need for automatic data quality management, observing quality issues in online data, detecting the data divergences, and identifying issues impacting the model performance proactively.
Figure 3 illustrates a flow diagram depicting method for managing quality of data for providing artificial intelligence (AI) based services to one or more user devices, in accordance with an embodiment of the present disclosure. Figure 4 illustrates a block diagram of system for managing quality of data for providing artificial intelligence (AI) based services to one or more user devices. For the sake of brevity, the description of the Figs. 3 and 4 are explained in conjunction with each other.
The system 400 may include, but is not limited to, at least one processor 402, memory 404, modules 406, and data unit 408. The modules 406 and the memory 404 may be coupled to the at least one processor 402.
The at least one processor 402 can be a single processing unit or several modules, all of which could include multiple computing modules. The at least one processor 402 may be implemented as one or more microprocessors, microcomputers, microcontrollers, digital signal at least one processors, central processing modules, state machines, logic circuitries, and/or any user devices that manipulate signals based on operational instructions. Among other capabilities, the at least one processor 402 is configured to fetch and execute computer-readable instructions and data stored in the memory 404.
Functions related to artificial intelligence according to the present disclosure are operated through at least one processor 402 and a memory 404. The processor 402 may include one or a plurality of processors. In this case, the one or more processors may be a general-purpose processor such as a CPU, an AP, or a digital signal processor (DSP), a graphics-only processor such as a GPU or a vision processing unit (VPU), or an artificial intelligence-only processor such as an NPU. For example, when one or more processors are processors dedicated to artificial intelligence, the processors dedicated to artificial intelligence may be designed as a hardware structure specialized for processing a specific artificial intelligence model.
One or more processors control input data to be processed according to predefined operating rules or artificial intelligence models stored in a memory. Alternatively, when one or more processors are processors dedicated to artificial intelligence, the processors dedicated to artificial intelligence may be designed with a hardware structure specialized for processing a specific artificial intelligence model.
A predefined action rule or an artificial intelligence model is characterized in that it is created through learning. Here, being made through learning means that a basic artificial intelligence model is learned using a plurality of learning data by a learning algorithm, so that a predefined action rule or artificial intelligence model set to perform a desired characteristic (or purpose) is created. Such learning may be performed in the device itself in which artificial intelligence according to the present disclosure is performed, or through a separate server and/or system. Examples of learning algorithms include supervised learning, unsupervised learning, semi-supervised learning, or reinforcement learning, but are not limited to the above examples.
An artificial intelligence model may be composed of a plurality of neural network layers. Each of the plurality of neural network layers has a plurality of weight values, and a neural network operation is performed through an operation between an operation result of a previous layer and a plurality of weight values. A plurality of weights possessed by a plurality of neural network layers may be optimized by a learning result of an artificial intelligence model. For example, a plurality of weights may be updated so that a loss value or a cost value obtained from an artificial intelligence model is reduced or minimized during a learning process. The artificial neural network may include a deep neural network (DNN), for example, a Convolutional Neural Network (CNN), a Deep Neural Network (DNN), a Recurrent Neural Network (RNN), a Restricted Boltzmann Machine (RBM), A deep belief network (DBN), a bidirectional recurrent deep neural network (BRDNN), or deep Q-networks, but is not limited to the above examples.
The memory 404 may include any non-transitory computer-readable medium known in the art including, for example, volatile memory, such as static random access memory (SRAM) and dynamic random access memory (DRAM), and/or non-volatile memory, such as read-only memory (ROM), erasable programmable ROM, flash memories, hard disks, optical disks, and magnetic tapes.
The modules 406 amongst other things, include routines, programs, objects, components, data structures, etc., which perform particular tasks or implement data types. The modules 406 may also be implemented as, signal at least one processor(s), state machine(s), logic circuitries, and/or any other user device or component that manipulate signals based on operational instructions.
Further, the modules 406 can be implemented in hardware, instructions executed by a processing unit, or by a combination thereof. The processing unit can comprise a computer, a processor, such as the at least one processor 402, a state machine, a logic array, or any other suitable user devices capable of processing instructions. The processing unit can be a general-purpose processor which executes instructions to cause the general-purpose processor to perform the required tasks or, the processing unit can be dedicated to performing the required functions. In an embodiment of the present disclosure, the modules 406 may be machine-readable instructions (software) which, when executed by a at least one processor/processing unit, perform any of the described functionalities.
In an embodiment, the modules 406 may include an intelligent rule recommender engine 410, a data observability engine 412, and a divergence detection engine 414.
The various modules 410-414 may be in communication with each other. In an embodiment, the various modules 410-414 may be a part of the at least one processor 402. In an embodiment, the at least one processor 402 may be configured to perform the functions of modules 410-414. The data unit 408 serves, amongst other things, as a repository for storing data processed, received, and generated by one or more of the modules 406.
It should be noted that the system 400 may be a part of a user device. In an embodiment, the system 400 may be connected to the user device. It should be noted that the term "user device" refers to any electronic user devices used by a user such as a mobile user device, a desktop, a laptop, personal digital assistant (PDA) or similar user devices.
Referring to Figure 3, at step 301, the method comprises defining a first set of data quality rules based on offline data stored in a database. In an embodiment, the offline data is associated with corresponding applications associated with the one or more user devices. In particular, assuming that one or more applications are associated with the one or more user devices. The one or more applications may constantly send data and store data in the database warehouse for processing and service development. This data is considered as offline data.
In an embodiment, the intelligent rule recommender engine 410 may define the first set of data quality rules based on the offline data stored in the database. In an alternate embodiment, the intelligent rule recommender engine 410 may directly receive the offline data from the one or more applications.
In other words, the offline data may refer to data received from applications associated with the one or more user devices, the data being related to one or more interactions of respective users of the one or more user devices with the corresponding application. In an embodiment, the offline data may further relate to performance of the AI based services associated with the corresponding application. The offline data may thus be considered as non-real-time data that is received from the application associated with the one or more user devices and stored for further processing.
Figure 5 illustrates a block diagram of an intelligent rule recommender engine and an associated workflow, according to an embodiment of the present disclosure. As shown in Figure 5, the intelligent rule recommender engine 410 is configured to generate the first set of data quality rules for real-time ML model deployment by analyzing and classifying the retrieved offline data.
As shown in FIG. 5, the intelligent rule recommender engine 410 may include a quality analyzer module 505 and a rule configurator module 507. As can be seen from Figure 5, the intelligent rule recommender engine 410 is connected to the database or data warehouse 501 which stores, over a period of time, the offline data, where the offline data is received from the corresponding one or more applications associated with the one or more user devices. Hence, the database 501 is an offline data store capable of receiving data and serving data.
The offline data is retrieved by a profiler 503 which is connected to the intelligent rule recommender engine 410. The profiler 503 periodically profiles the offline data and generates profile information associated with the stored offline data. The generated profile information is shared with the quality analyzer module 505.
The quality analyzer module 505 then determines offline data parameters based on the profile information, wherein the offline data parameters comprise data types within the offline data and data characteristics associated with each of the data types. In particular, a type identifier 505a scans the profile information and determines data type associated with the offline data. The type identifier 505a then shares the data type with a rule identifier 507a of the rule configurator 507 and a threshold miner 505b, as depicted in Figure 5.
In a non-limiting example, the data may be an image and the offline data parameters may comprise object length, height, colour, pattern, and the like. In an non-limiting example, the data may be employee data and the offline data parameters may comprise user personal information, user preferences, user position, user income, and the like. As described above, each of offline data parameter may comprise data types and data characteristics. For instance, the user personal information may be a parameter, and the user personal information may comprise user age as a data type and specific values, such as, 20 years, as the data characteristic. In an example, image object may be a parameter, and the image object may comprise stripes as a data type and brown as a data characteristic.
Further, the threshold miner 505b identifies data type specific data thresholds dynamically and validates the same, i.e., the threshold miner 505b determines the data characteristics associated with the offline data. In an embodiment, a report generator 505c, along with type specific thresholds may generate a quality report with quality score of the offline data. In an embodiment, the quality report may comprise information related to the data types and data characteristics of the various offline data parameters.
For instance, in case of offline data comprising image objects, the quality reports may include information related to the data types and associated characteristics, such as, white stripes, brown stripes, etc. Similarly, in case of offline data comprising user details, the quality report may include information related to the data type and associated characteristics, such as, maximum age, median age, fill rate, and the like. In an embodiment, the generated data quality reports may be sent to users for review as alerts.
The rule configurator 507 may define the first set of data quality rules based on the offline data parameters. In an embodiment, the rule configurator 507 may automatically define the first set of data quality rules based on the offline data parameters.
In particular, the rule identifier 507a with inputs received from the type identifier 505a automatically identifies appropriate rules and forward the same to a parameter estimator 507b. The parameter estimator 507b with inputs from the rule identifier 507a and threshold miner 505b estimates accurate parameters for rule. Then, a rule definer 507c receives the rules with relevant parameters from the parameter estimator 507b and defines the first set of data quality rules. In an embodiment, the defined first set of data quality rules and optional inputs from users may be stored in a database such as a DQ Rules Repo 509. In an embodiment, the DQ Rules Repo 509 may optionally accept custom rules from users for integration with data observability engine 412.
Referring back to Figure 3, at step 303, the method 300 comprises generating/obtaining a ML model based on the first set of data quality rules. The ML model is deployed to provide the AI based services in association with the corresponding applications. In an embodiment, the ML model may be generated/obtained by the processor 402 based on set of data quality rules determined by the intelligent rule recommender engine 410. In an embodiments, the offline data may be filtered based on the first set of data quality rules determined by the intelligent rule recommender engine 410, and further, the ML model may be generated/obtained based on the filtered offline data.
Then, at step 305, the method 300 comprises monitoring over a period of time, using the first set of data quality rules, incoming real-time, online data received by the deployed ML model across the one or more user devices.
In an embodiment, the online data is related to one or more of interactions of respective users of the one or more user devices with the corresponding application and performance of the AI based services associated with the corresponding applications. In an embodiment, the one or more interactions of the users may comprise interactions done by the users via an application provided on the associated user device. For instance, an image classification application may be provided on the user device and the user may interact with the application to purchase an item and/or download other applications on the user device.
In an embodiment, the online data refers to real-time data being received from the applications associated with the one or more user devices. In an embodiment, the online data may differ from the offline data, in that, the online data may have some additional data or some deleted data as compared to the offline data.
In an embodiment, the data observability engine 412 may monitor incoming real-time, online data received by the deployed ML model across the one or more user devices. Figure 6 illustrates a data observability engine 412, according to an embodiment of the present disclosure.
As shown in figure 6, a stream processor associated with the applications may feed the real-time online data, such as in form of a data stream 601, that serves the deployed ML model to the data observability engine 412. The data observability engine 412 may fetch the first set of data quality rules from the intelligent rule recommender engine 410, and may filter the online data to identify clean and unclean data based on the first set of data quality rules. In an embodiment, the filtered data may be in a database or segregated data store 603 as logs for further processing. In an embodiment, the period of time may be configurable and may be configured by the data observability engine 412. The data observability engine 412 may feed the deployed ML model with the clean data and accordingly, the ML model may provide the AI based services based on the clean data.
Then, at step 307, the method 300 comprises determining, based on the monitoring of the online data, one or more characteristics of the online data that are divergent from the offline data. In an embodiment, the divergence detection engine 414 may continually determine, based on monitoring of the online data, the one or more characteristics of the online data that are divergent from the offline data.
In an embodiment, the divergence detection engine 414 may determine the one or more characteristics of the online data that are divergent from the offline data, which is explained in reference to Figure 7.
Figure 7 illustrates a divergence detection engine 414 and an associated workflow, according to an embodiment of the present disclosure. The divergence detection engine 414 may be configured to determine data quality divergence in the incoming real-time data based on the monitoring performed by the data observability engine 412. In particular, the divergence detection engine 414 may access the clean data and the unclean data filtered from the online data. In an embodiment, the divergence detection engine 414 may receive the clean and the unclean data from the data observability engine 412. Then, the divergence detection engine 414 may process the unclean data and the clean data to detect a type of data divergence in the online data.
In order to process the clean and the unclean data, in an embodiment, the divergence detection engine 414 may include a label generator 701 to labelling the clean data and the unclean data to generate labelled data. In an embodiment, before labelling the data, the clean data is partitioned into reference and real-time data. The label generator 701 may label the clean data as target and reference.
Further, the divergence detection engine 414 may include a non-linear transformer 703 to perform non-linear transformations on the labelled data to generate transform data. Further, the divergence detection engine 414 may also include a linear divergence detector 705 to detect presence of data divergence based on the transformed data. Furthermore, the divergence detection engine 414 may include a linear divergence classifier 707 to determine the type of data divergence based on the detected presence of data divergence. In an embodiment, the linear divergence classifier 707 may recommend the ML model management strategy based on the type of data divergence.
Then, the divergence detection engine 414 may determine the one or more characteristics of the online data based on the type of data divergence and the first set of data quality rules. In an embodiment, the divergence detection engine 414 may also include a non-linear divergence quantifier 709 that may quantify the divergence magnitude and estimate the significance of divergence to identify and recommend second set of data quality rules generation to counter the data.
Referring back to Figure 3, at step 309, the method 300 comprises generating a second set of data quality rules, based on the determined one or more characteristics, to facilitate providing the AI based services in association with the corresponding applications. In an embodiment, the method 300 comprises dynamically generating a second set of data quality rules, by updating the first set of data quality rules based on the determined one or more characteristics to facilitate providing the AI based services in association with the corresponding applications.
In an embodiment, the method 300 comprises comparing the one or more characteristics of the online data with corresponding threshold ranges.
If it is determined that the one or more characteristics of the online data are within corresponding threshold ranges, the first set of data quality rules are updated to dynamically generate the second set of data quality rules based on the one or more characteristics of the online data. The second set of data quality rules may then be used by the data observability engine 412 to monitor the online data being provided by the applications. Then, the method 300 comprises generating an updated ML model, using the second set of data quality rules to provide the AI based services in association with the corresponding applications.
In an embodiment, if it is determined that the one or more characteristics of the online data are not within the corresponding threshold ranges, the ML model is updated to generate the updated ML model based on the one or more characteristics of the online data. In an embodiment, the updated ML model may be generated by updating the previously deployed ML model.
In an embodiment, the updated model may be generated by selecting an alternate model from a plurality of pre-developed models, the alternate model being developed so as to process the online data that comprises characteristics divergent from the offline data. In an embodiment, the plurality of pre-developed models may be stored in an ML model repository.
In an embodiment, the updated ML model may be generated by developing a new model based on the second set of data quality rules. In an embodiment, the online data may comprise characteristics divergent from the offline data such that a pre-developed model may not be available to process the online data. For instance, the online data may comprise characteristics that may not be recorded previously. In such cases, updating the ML model may developing a new ML model based on the second set of data quality rules.
In an embodiment, the updated ML model is deployed to provide the AI based services in association with the corresponding applications. Then, the online data being received by the updated ML model is continuing monitored across the one or more user devices. The online data is filtered based on the second set of data quality rules and the updated ML model provides the AI based services based on the filtered online data. Or the online data is filtered based on the first set of data quality rules and the updated ML model provides the AI based services based on the filtered online data.
In an embodiment, in response to the first set of data quality rules being updated and/or in response to the ML model being updated, an alert may be provided to a user associated with the one or more user devices and/or to a remote admin user associated with an admin device.
Figures 8(a)-8(b) illustrates a comparison between existing techniques and the proposed method for data observability in real-time log streaming, according to various embodiments of the present disclosure.
As shown in figure 8(a), in the current techniques, the offline data is manually cleaned, and an ML model is built to deploy on a user device or server. The deployed model may serve online inferences and the model performance is monitored periodically and degradation will be rectified after some delay.
However, as shown in figure 8(b), in accordance with the various embodiments of the present disclosure, the offline data is used to learn quality issues and rules to fix them. An ML model is built to deploy on a user device or a server. Further, during online inference, incoming data is scanned by data monitoring module using the learned rules and quality data will reach deployed ML model for inference. The deviation detector module linked to data monitoring module will periodically monitor the deviations in incoming data and alerts new rule generation requirements.
Thus, in the disclosed techniques, as opposed to the traditional system, the data gets automatically monitored by an intelligent rule recommender engine comprising a profiler to automatically profile the data at periodic intervals to find the data characteristics, a quality analyzer to find the quality parameters of the data that might impact the downstream analysis, a rule configurator that will formulate the rules to fix the quality issues detected therein, and a repository to maintain and serve the rules to the required. These quality issues and rules would be sent to data scientist via alerts module for an optional intervention. The observability engine may access the rules for data quality management from rule recommender to apply and clean the incoming stream data from application event logger. From the cleaned data, the features will be extracted which are necessary for the ML model and served via the feature generator. Further to serving the model, the stream at least one processor also serves classifier to classify clean/unclean data and sends the clean data to the database. Such clean data dynamically updated can be used for down-stream analysis by the ML model.
Further, the clean data persisted in the database is fed to data divergence detection engine to periodically monitor the data divergence and model anomalies in the application data. Further, the data divergence detection engine alerts downstream analyst(s) for corrective action or automatically generate a second set of rules. Also, the data divergence detection engine may automatically trigger the ML Model updating or deploying alternate model process.
Figure 9 illustrates a use case associated with accurate image classification using enhanced data observability, according to an embodiment of the present disclosure.
As shown in figure 9, a best performing image identification ML model is built 'offline' based automated rule recommender system (Intelligent rule recommender engine) (5 rules along with parameters) and deployed in a service online. The data observability engine may use the automatically generated data quality rules, to monitor the incoming online data before sending to deployed ML Model. So, cat image is dropped as it doesn't pass the data quality rules. Data divergence detection engine may analyze the dropped/Poor quality data (Cat image) and assess the rules (if any new rules has to be made) or rule parameters (Values, in this example, Brown stripes, Minimum and maximum values of object length) has to be updated making second set of rules.
Figure 10 illustrates a use case associated with real time fraud detection in payment applications using enhanced data observability, according to an embodiment of the present disclosure.
As shown in figure 10, intelligent rule recommender engine is configured on the application. Data quality rules (Employment type - Engineer, Scientist, IT & Age data cleaning by filling zero values with average by Married), along with parameters are learned. A best performing ML Model is built and deployed in production. The data observability engine will use the data quality rules, to monitor the incoming online data before sending to deployed ML Model. Good quality data is sent to Prediction. Data Observability engine finds the data with missing values (Age) will be filled by rules learned from data and sent to prediction service. In absence of proposed system this record would have been classified as fraud. Bad quality data is sent to data divergence detection engine. Divergent data [(Income of 245000), new employment type (Freelancer) and minimum (178) and maximum (795) values of credit score] are detected and sent to intelligent rule recommender engine to generate second set of rules.
Figure 11 illustrates a use case associated with recommending right content for interest deviation using enhanced data observability, according to an embodiment of the present disclosure.
As shown in figure 11, in the application installed on the user device, there are multiple tabs namely Watch, read, Play and listen. Each tab has multiple genre, and each genre has multiple channels. Users frequently switch between them. Multiple ML Models are built and stored in Repository using clean data as per the data quality rules and are ready to be deployed dynamically. The data observability engine and data divergence detection engine work hand in hand in real time to observer the preferred Channels and divergence in the Preferences and fetch relevant model to drive user engagement.
Figure 12 illustrates a use case associated with application recommendation based on user app usage pattern, according to an embodiment of the present disclosure.
As shown in figure 12, an application store does recommendations of apps based on user past installs, usage captured by service using user consent. Based on a time slice, there are 4 different types of Users with different Preferences. An ML model is built using this data and deployed in production. At some point GDPR (General Data Protection Regulation) rules are implemented and most of the customers revoked their consent for tracking app usages leading to skewed or biased data coming to recommend apps. In this situation, the recommender will not be able to recommend relevant apps to all users. Proposed data observability engine and divergence detection engine will detect the divergence in incoming data and can alert/deploy alternate model. At the same time, new set of rules will be generated and used to monitor the incoming data by observer engine.
Accordingly, the present disclosure provides following exemplary advantages:
Figure PCTKR2023007653-appb-img-000001
Rule-based data quality monitoring in real-time online data systems serving data dependent applications;
Figure PCTKR2023007653-appb-img-000002
The system of the present disclosure is independent of meta-data and is capable of learning from actual data automatically both offline and real-time;
Figure PCTKR2023007653-appb-img-000003
Further, the system of the present disclosure is designed to learn data quality issues from offline data and observe online data to be served to deployed ML model and alert the rule generator to learn new rules and/or ML model management (e.g., by deploying a suitable model without retraining the existing ML model).
Embodiments of the disclosure can also be embodied as a storage medium including instructions executable by a computer such as a program module executed by the computer. A computer readable medium can be any available medium which can be accessed by the computer and includes all volatile/non-volatile and removable/non-removable media.
Further, the computer readable medium may include all computer storage and communication media. The computer storage medium includes all volatile/non-volatile and removable/non-removable media embodied by a certain method or technology for storing information such as computer readable instruction code, a data structure, a program module or other data. Communication media may typically include computer readable instructions, data structures, or other data in a modulated data signal, such as program modules. In addition, computer-readable storage media may be provided in the form of non-transitory storage media.
The 'non-transitory storage medium' is a tangible device and only means that it does not contain a signal (e.g., electromagnetic waves). This term does not distinguish a case in which data is stored semi-permanently in a storage medium from a case in which data is temporarily stored. For example, the non-transitory recording medium may include a buffer in which data is temporarily stored.
According to an embodiment of the disclosure, a method according to various disclosed embodiments may be provided by being included in a computer program product. The computer program product, which is a commodity, may be traded between sellers and buyers. Computer program products are distributed in the form of device-readable storage media (e.g., compact disc read only memory (CD-ROM)), or may be distributed (e.g., downloaded or uploaded) through an application store or between two user devices (e.g., smartphones) directly and online. In the case of online distribution, at least a portion of the computer program product (e.g., a downloadable app) may be stored at least temporarily in a device-readable storage medium, such as a memory of a manufacturer's server, a server of an application store, or a relay server, or may be temporarily generated.
While specific language has been used to describe the present subject matter, any limitation arising on account thereto, are not intended. As would be apparent to a person in the art, various working modifications may be made to the method in order to implement the inventive concept as taught herein. The drawings and the foregoing description give examples of embodiments. Those skilled in the art will appreciate that one or more of the described elements may well be combined into a single functional element. Alternatively, certain elements may be split into multiple functional elements. Elements from one embodiment may be added to another embodiment.

Claims (15)

  1. A method for managing quality of data for providing artificial intelligence (AI) based services to one or more user devices, the method comprising:
    defining a first set of data quality rules based on offline data stored in a database, the offline data is associated with corresponding applications for the one or more user devices;
    obtaining a machine learning (ML) model based on the first set of data quality rules, wherein the ML model provides the AI based services associated with the corresponding applications;
    monitoring over a period of time, in real-time, using the first set of data quality rules, online data received by the ML model across the one or more user devices;
    determining, based on the monitoring of the online data, one or more characteristics of the online data that are divergent from the offline data; and
    generating a second set of data quality rules, based on the determined one or more characteristics, to facilitate providing the AI based services associated with the corresponding applications.
  2. The method of claim 1, further comprising:
    obtaining an updated ML model, using the second set of data quality rules, to provide the AI based services associated with the corresponding applications.
  3. The method of any of claims 1 to 2, wherein the obtaining the ML model comprises:
    filtering the offline data based on the first set of data quality rules; and
    developing the ML model based on the filtered offline data, thereby generating the ML model.
  4. The method of any of claims 1 to 3, wherein the defining the first set of data quality rules comprises:
    storing, over the period of time, the offline data in the database, the offline data is received from the corresponding applications for the one or more user devices;
    retrieving the offline data from the database;
    generating profile information associated with the stored offline data;
    determining offline data parameters based on the profile information, wherein the offline data parameters comprise data types within the offline data and data characteristics associated with each of the data types; and
    defining the first set of data quality rules based on the offline data parameters.
  5. The method of any of claims 1 to 4, further comprising:
    accessing the online data received by the ML model across the one or more user devices;
    filtering, based on the first set of data quality rules, the online data to identify clean data and unclean data; and
    providing, using the ML model, the AI based services based on the clean data.
  6. The method of claim 5, wherein determining the one or more characteristics of the online data comprises:
    accessing the clean data and the unclean data filtered from the online data;
    processing the unclean data and the clean data to detect a type of data divergence in the online data; and
    determining the one or more characteristics of the online data based on the type of data divergence and the first set of data quality rules.
  7. The method of claim 6, wherein processing the clean data and the unclean data to detect the type of data divergence comprises:
    labelling the clean data and the unclean data to generate labelled data;
    subjecting the labelled data to non-linear transformations to generate transformed data;
    detecting presence of data divergence based on the transformed data; and
    determining the type of data divergence based on the detected presence of data divergence.
  8. The method of any of claims 1 to 7, the generating the second set of data quality rules comprising:
    comparing the one or more characteristics of the online data with corresponding threshold ranges;
    upon determining that the one or more characteristics of the online data are within corresponding threshold ranges, updating the first set of data quality rules to generate the second set of data quality rules based on the one or more characteristics of the online data; and
    upon determining that the one or more characteristics of the online data are not within the corresponding threshold ranges, updating the ML model to generate the updated ML model based on the one or more characteristics of the online data, wherein the updated ML model provides the AI based services in association with the corresponding applications.
  9. The method of claim 8, further comprising:
    deploying the updated ML model to provide the AI based services in association with the corresponding applications;
    continuing monitoring the online data received by the updated ML model across the one or more user devices;
    filtering the online data, based on the first set of data quality rules or the second set of data quality rules; and
    providing, using the updated ML model, the AI based services based on the filtered online data.
  10. A apparatus for managing quality of data for providing artificial intelligence (AI) based services to one or more user devices, the system comprising:
    a memory storing instructions; and
    at least one processor configured to execute the instructions to:
    define a first set of data quality rules based on offline data stored in a database, the offline data is associated with corresponding applications for the one or more user devices;
    obtain a machine learning (ML) model based on the first set of data quality rules, wherein the ML model provides the AI based services in association with the corresponding applications;
    monitor over a period of time, in real-time, using the first set of data quality rules, online data received by the deployed ML model across the one or more user devices;
    determine, based on the monitoring of the online data, one or more characteristics of the online data that are divergent from the offline data; and
    generate a second set of data quality rules, based on the determined one or more characteristics, to facilitate providing the AI based services associated with the corresponding applications.
  11. The apparatus of claim 10, wherein the at least one processor is further configured to:
    obtain an updated ML model, using the second set of data quality rules to provide the AI based services associated with the corresponding applications.
  12. The apparatus of any of claims 10 to 11, wherein the at least one processor is configured to:
    access the online data being received by the deployed ML model across the one or more user devices;
    filter, based on the first set of data quality rules, the online data to identify clean data and unclean data; and
    provide, by the deployed ML model, the AI based services based on the clean data.
  13. The apparatus of any of claims 10 to 12, wherein to determine the one or more characteristics of the online data, the at least one processor is configured to:
    access the clean data and the unclean data filtered from the online data;
    process the unclean data and the clean data to detect a type of data divergence in the online data; and
    determine the one or more characteristics of the online data based on the type of data divergence and the first set of data quality rules.
  14. The apparatus of any of claims 11 to 13, wherein the at least one processor is configured to:
    deploy the updated ML model to provide the AI based services in association with the corresponding applications;
    continue monitoring the online data being received by the updated ML model across the one or more user devices;
    filter the online data, based on the first set of data quality rules or the second set of data quality rules; and
    provide, by the updated ML model, the AI based services based on the filtered online data.
  15. A computer readable medium containing instructions that, when executed, cause at least one processor of an electronic device to perform operations corresponding to the method of any one of claims 1-9.
PCT/KR2023/007653 2022-08-16 2023-06-02 Method and apparatus for managing quality of data WO2024039017A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
IN202241046489 2022-08-16
IN202241046489 2023-04-05

Publications (1)

Publication Number Publication Date
WO2024039017A1 true WO2024039017A1 (en) 2024-02-22

Family

ID=89942423

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/KR2023/007653 WO2024039017A1 (en) 2022-08-16 2023-06-02 Method and apparatus for managing quality of data

Country Status (1)

Country Link
WO (1) WO2024039017A1 (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190155797A1 (en) * 2016-12-19 2019-05-23 Capital One Services, Llc Systems and methods for providing data quality management
US20190370233A1 (en) * 2018-05-29 2019-12-05 Accenture Global Solutions Limited Intelligent data quality
US20200374305A1 (en) * 2019-05-24 2020-11-26 Bank Of America Corporation System and method for machine learning-based real-time electronic data quality checks in online machine learning and ai systems
CN113342791A (en) * 2021-05-31 2021-09-03 中国工商银行股份有限公司 Data quality monitoring method and device
US20220076078A1 (en) * 2020-09-08 2022-03-10 Koninklijke Philips N.V. Machine learning classifier using meta-data

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190155797A1 (en) * 2016-12-19 2019-05-23 Capital One Services, Llc Systems and methods for providing data quality management
US20190370233A1 (en) * 2018-05-29 2019-12-05 Accenture Global Solutions Limited Intelligent data quality
US20200374305A1 (en) * 2019-05-24 2020-11-26 Bank Of America Corporation System and method for machine learning-based real-time electronic data quality checks in online machine learning and ai systems
US20220076078A1 (en) * 2020-09-08 2022-03-10 Koninklijke Philips N.V. Machine learning classifier using meta-data
CN113342791A (en) * 2021-05-31 2021-09-03 中国工商银行股份有限公司 Data quality monitoring method and device

Similar Documents

Publication Publication Date Title
Fan et al. Demystifying big data analytics for business intelligence through the lens of marketing mix
US10438135B2 (en) Performance model adverse impact correction
US20190327259A1 (en) Vulnerability profiling based on time series analysis of data streams
US11017330B2 (en) Method and system for analysing data
Salamai et al. Risk identification-based association rule mining for supply chain big data
US20180101797A1 (en) Systems and methods for improving sales process workflow
CN111459922A (en) User identification method, device, equipment and storage medium
Baier et al. Handling concept drift for predictions in business process mining
US20240086736A1 (en) Fault detection and mitigation for aggregate models using artificial intelligence
JP7470235B2 (en) Vocabulary extraction support system and vocabulary extraction support method
WO2024039017A1 (en) Method and apparatus for managing quality of data
JP7170689B2 (en) Output device, output method and output program
CN111260219A (en) Asset class identification method, device, equipment and computer readable storage medium
CN112085281A (en) Method and device for detecting safety of business prediction model
WO2023168222A1 (en) Systems and methods for predictive analysis of electronic transaction representment data using machine learning
US20230281635A1 (en) Systems and methods for predictive analysis of electronic transaction representment data using machine learning
CN114385121B (en) Software design modeling method and system based on business layering
US11151581B2 (en) System and method for brand protection based on search results
CN111651753A (en) User behavior analysis system and method
Abakouy et al. Machine Learning as an Efficient Tool to Support Marketing Decision-Making
US20240070269A1 (en) Automatic selection of data for target monitoring
US11568084B2 (en) Method and system for sequencing asset segments of privacy policy using optimization techniques
Leal et al. Towards adaptive and transparent tourism recommendations: A survey
KR102619522B1 (en) Method and apparatus for detecting leakage of confidention information based on artificial intelligence
US11481368B2 (en) Automatically rank and route data quality remediation tasks

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23854988

Country of ref document: EP

Kind code of ref document: A1