AU2020102190A4

AU2020102190A4 - AML- Data Cleaning: AUTOMATIC DATA CLEANING USING MACHINE LEARNING PROGRAMMING

Info

Publication number: AU2020102190A4
Application number: AU2020102190A
Authority: AU
Inventors: S. B. Chordiya; Anupam Kumar; Jambi Ratna Raja Kumar; Narendra Mohan Mishra; Dilkeshwar Pandey; Manoj Kumar Pandey; Vrushsen Purushottam Pawar; M. Nagabhushana Rao; Pushpender Sarao
Original assignee: Kumar Anupam Dr; Kumar Jambi Ratna Raja Dr; Mishra Narendra Mohan Dr; Pandey Dilkeshwar Dr; Pandey Manoj Kumar Dr; Rao M Nagabhushana Dr; Sarao Pushpender Dr
Current assignee: Kumar Anupam Dr; Kumar Jambi Ratna Raja Dr; Mishra Narendra Mohan Dr; Pandey Dilkeshwar Dr; Pandey Manoj Kumar Dr; Rao M Nagabhushana Dr; Sarao Pushpender Dr
Priority date: 2020-09-09
Filing date: 2020-09-09
Publication date: 2020-10-22
Anticipated expiration: 2028-09-09

Abstract

Our invention "AML- Data Cleaning" is a systems and techniques for improving the training, mapping of machine learning programming classifiers and also a classifier is trained using a set of validated documents that are accurately associated with a set of clas-1, Calss-2, Class-3 labels. The invented technology a subset of non-validated documents is also identified and is used to further train and improve accuracy of the classifier. The invented system also a system and method enabling automated data cleansing and scrubbing at the attribute level. A consolidated view may be provided of the scrubbed data or narratives that gets promoted to a final copy and the data or narratives received from multiple sources on a single user interface. The technology automatic takes a decision to clean the data as per requirement of stirage system not wait to user interrupt. After data cleaning time some unused data can delete and shift without taking user acknowledgment. The shifted data auto fetches by your private storage server without taking time. 18 14 R-P4Q ACX P4muiu -- - - 24 C.AS --------- OfIERAT)O4 ALDT TR C£AS 4mT~i RMUL DATA: 4 mr ~ ti~tDT 42 333 S4 FIG. 1: ISA SCHEMATIC OF ANEXEMPLARY COMPUTER-BASED CLASSIFIER SYSTEM.

Description

14R-P4Q ACX P4muiu

-- - -

C.AS 24

------- -- OfIERAT)O4 ALDT TR

DATA: mr C£AS 4 ~ti~tDT 4mT~i RMUL

333 42 S4

FIG. 1: ISA SCHEMATIC OF ANEXEMPLARY COMPUTER-BASED CLASSIFIER SYSTEM.

AML- Data Cleaning: AUTOMATIC DATA CLEANING USING MACHINE LEARNING PROGRAMMING.

FIELD OF THE INVENTION

Our Invention "AML- Data Cleaning" is related to automatic data cleaning using machine learning programming and also methods for improving accuracy of machine learning classifiers. The Invention is a data processing and management processes and, more particularly, to an autonomic system and method of performing data cleansing and scrubbing at the field level and presenting the consolidated view thereof on a single screen.

BACKGROUND OF THE INVENTION

Today, there is increasing interest in the use of machine learning for analyzing data. Machine learning refers to the design and development of computer algorithms that allow computers to recognize complex patterns and make intelligent decisions based on empirical data.

Typically, a machine learning system that performs text classification on documents includes a classifier. The classifier is provided training data in which each document is already labeled (e.g., identified) with a correct label or class. The labeled document data is used to train a learning algorithm of the classifier which is then used to label/classify similar documents. The accuracy of the classifier is inextricably dependent upon the quality and quantity of correctly labeled documents included in the training data.

Typically, training data for the classifier is derived from experts that manually assign class labels to documents. Manual assignment, however, inherently exhibits a certain level of inconsistency because experts with varying levels of domain knowledge and experience may interpret the same class differently. In addition, the tedious nature of manual assignment can further aggravate the requirement that large amounts of correctly labeled documents be provided to classifiers in order to generalize well. Furthermore, manual assignment of class labels by experts can be an expensive process.

In a typical Corporate Action Announcement process, multiple messages are usually procured from numerous sources that may even involve the message interpretation and translation by several intermediaries. This most often brings about inconsistency and inaccuracy in the information gathered. Further, the currently available methodologies provide source level precedence for the incoming messages, which restricts the entities from choosing the best data available in the market for any event type, particularly in cases where the reliability of the information provided varies from source to source. Since multiple heterogeneous sources extract and transmit the messages ostensibly relating to same data element, much discrepancy has been observed in the field/attribute level information of theoretically the same data.

There has been no robust solution that can automate the process by configuring it to choose one attribute from one source (e.g. Record Date) and another attribute from a different source (e.g. Rate of Interest) for the same announcement. This makes the entity handling heavy messages to rely upon a single best source for data acquisition. Additionally, user of the entity is also restricted to decide if there is any potentially conflicting information available in the market. This raises the concern of standardized information obtainment involving best market practices, the said standardization involving scrubbing of data at the field or attribute level.

Furthermore, there is no single point view of data or narratives involving all unformatted and unstructured text that is gathered from numerous sources as well as the data/narratives that gets promoted to the Golden Copy. This makes it difficult for the user to have a consolidated and uniform view of the data/narratives provided from different sources including registrars, depositories, exchanges or data vendors and final data/narrative to be present in the Golden Copy. Also, amidst absence of any standardized message processing system it becomes difficult to even identify the best data that can be promoted to the Golden Copy. There does exist a missing usability for the users to manage the narratives from a single point as the user may not be able to obtain a complete view of the data/narrative received from various sources and present in the Golden Copy. The current environment also does not avail any process that can identify the presence of narrative and identify the difference in narratives provided from replacement messages.

In light of the foregoing, there exists a need for a method of reconciling and cleansing or scrubbing data obtained from various heterogeneous sources at the field level to produce a single authoritative Golden Copy of a Corporate Event. There has further arisen a need to provide a system that is capable of presenting a consolidated view of data/narratives received from various sources and the one that shall be present in the Golden Copy for enhanced analysis and better management.

PRIOR ART SEARCH

US20020111755A1*2000-10-192002-08-15Tti-Team Telecom International Ltd. Topology-based reasoning apparatus for root-cause analysis of network faults. W02003021421A12001-08-302003-03-13Kana Software, Inc. Classification learning system. US20040024769A12002-08-022004-02-05Forman George H. System and method for inducing a top-down hierarchical categorizer. US20060282415A12005-06-092006-12-14Fuji Xerox Co., Ltd. Document retrieval apparatus. EP1903479A12006-08-252008-03-26Research In Motion Limited Method and system for data classification using a self-organizing map. US20080162390A1*2000-09-252008-07-03Harsh Kapoor Systems and methods for processing data flows. US20100332474A12009-06-252010-12-3OUniversity Of Tennessee Research Foundation Method and apparatus for predicting object properties and events using similarity-based information retrieval and model. US5506897A *1993-02-221996-04-09Murex Securities, Ltd. Automatic routing system for telephonic services.

W02001029780A11999-10-192001-04-26Stamps.ComAddress matching system and method US6272495B1*1997-04-222001-08-07Greg Hetherington Method and apparatus for processing free-format data. US6311178B1*1997-09-292001-10-30Webplus, Ltd. Multi-element confidence matching system and the method therefor. US20020059142A1*2000-04-212002-05-16Robert Krause Systems and methods for providing change of address services over a network. US6457012B1 *1997-06-102002-09-24Pitney Bowes Inc. Method and system of updating address records utilizing a clientserver interface. US20020136381A1 *1996-06-102002-09-26Shaffer James D. One number, intelligent call processing system. US5893111A*1997-06-131999-04-06Sharon, Jr.; Paul A. Ad taking pagination information system. US20020087515A1*2000-11-032002-07-04Swannack Christopher Martyn Data acquisition system.. US20020091685A1*2001-01-102002-07-11Letmeknow Technology System and method for filtering data events US20030061212A1*2001-07-162003-03-27Applied Materials, Inc. Method and apparatus for analyzing manufacturing data US20030115080A1*2001-10-232003-06-19Kasra Kasravi System and method for managing contracts using text mining. US20040236620A1*2003-05-192004-11-25Chauhan S. K. Automated utility supply management system integrating data sources including geographic information systems (GIS) data. US20050033726A1*2003-05-192005-02-10Ju Wu Apparatus and method for accessing diverse native data sources through a metadata interface. US20050060313A1*2003-09-152005-03-170racle International Corporation A California Corporation Data quality analyzer. US20050071370A1*2001-11-012005-03-31Altschul Jacob Falkentorp Automatic machine for production of sequences based on profiles as well as method for automatic production of sequences. US6970881B1*2001-05-072005-11-29ntelligenxia, Inc. Concept-based method and system for dynamically analyzing unstructured information. US20060238919A1*2005-04-202006-10-26The Boeing Company Adaptive data cleaning. US20070038657A1*2005-08-112007-02-15International Business Machines Corporation Method, system and program product for determining objective function coefficients of a mathematical programming model.

OBJECTIVES OF THE INVENTION

1. The objective of the invention is to a systems and techniques for improving the training, mapping of machine learning programming classifiers and also a classifier is trained using a set of validated documents that are accurately associated with a set of clas-1, Calss-2, Class-3 labels.

2. The objective of the invention is to a the invented technology a subset of non validated documents is also identified and is used to further train and improve accuracy of the classifier. The invented system also a system and method enabling automated data cleansing and scrubbing at the attribute level. 3. The objective of the invention is to the consolidated view may be provided of the scrubbed data or narratives that gets promoted to a final copy and the data or narratives received from multiple sources on a single user interface. 4. The objective of the invention is to the technology automatic takes a decision to clean the data as per requirement of stirage system not wait to user interrupt. After data cleaning time some unused data can delete and shift without taking user acknowledgment. The shifted data auto fetches by your private storage server without taking time. 5. The objective of the invention is to comprising applying a plurality of profiles included in the set of profiles to each document of the UNVDS in a decreasing order of profile strictness. 6. The objective of the invention is to wherein the at least one document from the UNVDS is deleted if the at least one document from the UNVDS is similar to any of the documents included in the first set of documents. 7. The objective of the invention is to a wherein comparing the at least one document comprises computing a cosine similarity for the at least one document.

SUMMARY OF THE INVENTION

Systems and techniques for improving the training of machine learning classifiers are disclosed. A classifier is trained using a set of validated documents that are accurately associated with a set of class labels. A subset of non-validated documents is also identified and is used to further train and improve accuracy of the classifier. Various aspects of the system relate to generating training data and training classifiers using the generated training data.

For example, according to one aspect, a method of training an initially trained classifier (ITC) that was generated using a set of verified documents associated with a set of class labels is disclosed. The set of verified documents is divided into a training set of documents and a test set of documents, and each class of the set of class labels is associated with a class list. The training set of documents having been further divided into an integer number of verified document sets (INVDS), the method including automatically inputting a set of unverified documents into the ITC, the set of unverified documents divided into an integer number of unverified document sets (UNVDS), and automatically identifying a subset of documents from the set of unverified documents. The method also includes automatically generating a final set of training documents based on the subset of documents and the set of verified documents, and training the ITC using the final set of training documents. The method can also include training a plurality of classifiers using the final set of training documents, and applying the plurality of classifiers to the test set of documents.

The method further includes executing a first loop code segment including a first loop construct written in a computer programming language, wherein the first loop code segment is executed at run time at least n times, wherein n is a value at run time of a first variable in a first loop termination condition, executing a second loop code segment including a second loop construct written in the computer programming language, wherein the second loop code segment is executed at least pxn times, wherein p is a value at run time of a second variable in a second loop termination condition, and executing a third loop code segment comprising a third loop construct written in the computer programming language, wherein the third loop code segment is executed pxnxq times, wherein q is a value at run time of a third variable in a third loop termination condition.

For each of the n times the first loop code is executed, the method includes assigning a first portion of the INVDS to a first set of documents, assigning a second portion of the INVDS to a second set of documents, the first portion different from the second portion, and training the ITC for each class associated with the set of class labels using documents in the first set of documents. The method also includes applying a plurality of classifiers to the documents in the second set of documents, and computing a first set of F1 scores associated with documents in the second set of documents. The first set of documents and the second set of documents may have no documents in common.

For each of the p times the second loop code is executed, the method includes applying at least one profile of a set of profiles to each document of the UNVDS, the at least one profile defining a rule to be applied to each document of the UNVDS, comparing at least one document of the UNVDS to all of the documents included in the first set of documents, and deleting the at least one document from the UNVDS based on the comparison. The rule may include a class label retention scheme, a class label addition scheme, or a combination thereof.

The method can include applying a plurality of profiles included in the set of profiles to each document of the UNVDS in a decreasing order of profile strictness. In addition, in one embodiment, if the at least one document from the UNVDS is similar to any of the documents included in the first set of documents, the at least one document from the UNVDS is deleted. Comparing the at least one document can include computing a cosine similarity for the at least one document.

For each of the p times the second loop code is executed, the method includes forming a third set of documents by merging documents included in the first set of documents with documents not deleted from the UNVDS, and clustering the third set of documents using a clustering algorithm.

The method can also include, for each of the q times the third loop code is executed, applying the ITC to all documents in the UNVDS for each class defined in the set of class labels, forming a subset of documents from the UNVDS, each document of the subset of documents having at least one class associated therewith, training a second classifier using documents in the subset of documents and the first set of documents, and applying the second classifier to the second set of documents.

The method further includes computing a second set of F1 scores associated with documents in the second set of documents, comparing an F1 score associated with a class label from the second set of F1 scores to a corresponding F1 score associated with the class label from the first set of F1 scores, updating a best score for the class label based on the comparison, and adding documents from the subset of documents to the list of class labels based on the comparison.

Before the present methods, systems, and hardware enablement are described, it is to be understood that this invention is not limited to the particular systems, and methodologies described, as there can be multiple possible embodiments of the present invention which are not expressly illustrated in the present disclosures. It is also to be understood that the terminology used in the description is for the purpose of describing particular versions or embodiments only, and is not intended to limit the scope of the present invention which will be limited only by the appended claims.

A computer implemented method and system for performing data scrubbing at attribute level may be provided. The method may comprise: receiving data containing at least one significant or non-significant attribute and associated values from distributed data sources, the sources being assigned weight against each of the attribute value; and applying a ranking matrix process to the received data, the ranking process comprising: for sources referring to different values for the same attribute, computing a combined weight therefrom; in response to the combined weight of the significant attribute exceeding a predetermined promotion threshold value, determining if said combined weight exceeds a predetermined confirmation threshold value, in response to the combined weight of the non-significant attribute exceeding the promotion threshold value, promoting the non-significant attribute to a final template copy; in response to the combined weight of the significant or non-significant being less than the confirmation threshold value, computing a total weight of all values for the attribute from all sources; and in response to the total weight exceeding a predetermined task threshold, raising a work item for a user to create a manual source and reapplying the ranking matrix process.

BRIEF DESCRIPTION OF THE DIAGRAM

FIG. 1: is a schematic of an exemplary computer-based classifier system.

FIGS. 2A-B illustrate an exemplary method of improving training data for classifiers.

FIG.3: System Flow Diagram.

FIG. 4: depicts a block diagram of an exemplary data scrubbing system.

FIGS. 5 (a) and (b) is a flow diagram illustrating a high level view of performing data scrubbing operation.

DESCRIPTION OF THE INVENTION

Exemplary embodiments illustrating features of the present disclosure will now be discussed in detail. The words "comprising," "having," "containing," and "including," and other forms thereof, are intended to be equivalent in meaning and be open ended in that an item or items following any one of these words is not meant to be an exhaustive listing of such item or items, or meant to be limited to only the listed item or items.

It must also be noted that as used herein and in the appended claims, the singular forms "a" "an," and "the" include plural references unless the context clearly dictates otherwise. Although any systems and methods similar or equivalent to those described herein can be used in the practice or testing of embodiments of the present invention, the preferred systems and methods are now described.

The disclosed embodiments are mere examples and may be embodied in various forms. Software programming code, which embodies aspects of the present disclosure, is typically maintained in a permanent storage such as a computer readable medium. The software programming code may be embodied on any of a variety of known media for use with a data processing system. This includes, but is not limited to, magnetic and optical storage devices such as disk drives, magnetic tape, compact discs (CD's), digital video discs (DVD's), and computer instruction signals embodied in a transmission medium with or without a carrier wave upon which the signals are modulated. Further, a computerized method may refer to a method whose steps are performed by a computing system containing a suitable combination of one or more processors, memory means and storage means.

The illustrative embodiments are described below with reference to flowchart illustrations and/or block diagrams of methods and systems according to the illustrative embodiments of the disclosure. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, may be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable medium that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer readable medium produce an article of manufacture including instruction means which implement the function/act specified in the flowchart and/or block diagram block or blocks.

The computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide processes for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks,

Definitions:

1. Attribute: Elements of message signified by the key words. e.g. "Interest Rate", "Period of Interest Calculation", "Payment Date" etc. 2. Cleansing and Scrubbing: Process to identify reliable data available from various information sources. 3. EPS: Event Preview Screen, 4. Conflict: Potential difference in information provided by the trusted sources where user confirmation is required. This term may also be used to signify the absence of Mandatory information from reliable sources. 5. Golden Copy: Final message comprising the best reliable data available from various information sources. 6. GUI: Graphic user Interface. 7. Narrative: unformatted text received/added in the message. 8. Ranking: Weight assigned to the source for a specific attribute. 9. Source: Data Source (Market entities) which provide necessary information to the concerned Business entity.

Broadly, an embodiment of the present disclosure may provide an automated system and a method enabling field level scrubbing of data obtained from disparately located heterogeneous set of multiple sources. An embodiment may further provide a consolidated view of data or narratives that are obtained from various sources as well as the data or narratives present in the Golden Copy on a single user interface. The field level data scrubbing system and method, in accordance with one embodiment, may enable the user to decide the ranking of data that should be assigned per attribute received from various sources. In yet another embodiment, the present system and the method may empower the user to decide the ranking based on his heuristics with regard to the reliability of the source to provide correct information. According to configured ranking, the cleansing and scrubbing of the attributes may be performed and it may be decided whether the attribute will be accepted to be part of Golden Copy or not.

The present disclosure may provide a data cleansing and scrubbing system that is suitable for, but not limited to, applications in corporate action announcements. One embodiment of the present disclosure may, apart from data cleansing and scrubbing at the attribute level, provide a user interfacing screen, hereafter referred to as Event Preview Screen that provides a consolidated tabular view of data/narratives received from various sources and the data/narratives present in the golden copy in a tabular form.

The present disclosure, the Event Preview Screen of the present disclosure may enable the user to decide for the best data to be present in the Golden Copy in case of conflicts between data provided from various sources. Another embodiment may enables narrative management through addition or editing of narratives in the Golden Copy and allow the user to promote or demote attributes to/from the final golden copy. Another feature may enable highlighting the difference(s) in the narratives provided in the replacement messages. Also all the erroneous attributes along with the changed attributes may get highlighted.

The disclosure, the Event Preview Screen may provide flexibility to perform various operations like, but not limited to, modifying the existing source, creation of new source, creating duplicate of existing source, viewing of raw message of incoming message and Golden Copy, Audit Viewing, attribute level and option level conflict resolution, addition of new standard text narratives directly into the Golden Copy etc from a single point.

The invention includes methods and systems which facilitate automatic data cleansing (e.g., removal of noise, inconsistent data and errors) of data for training classifiers. In particular, methods and systems for optimizing the amount of training data available to train an initially trained classifier (ITC) are disclosed. The present invention is particularly beneficial in scenarios in which training data is limited, which can have a significant effect on the accuracy of a classifier.

It is known that data cleansing of training data often yields higher accuracy gains than improvements in machine learning algorithms. This invention is useful by providing gains in classifier accuracy as a result of training on larger sets of cleaner data. Decreased production costs can also be obtained by not requiring experts to manually assign class labels to all training documents. Advantageously, the present invention can be used with different types of classifiers and domains as a general purpose method to be used when developing and training machine learning classifiers.

The following definitions are provided to facilitate understanding of the disclosure and are standard machine learning terms.

As used herein, the term "classifier" refers to a software component that accepts unlabeled documents as inputs and returns discrete classes. Classifiers are trained on labeled documents prior to being used on unlabeled documents.

As used herein, the term "training" refers to the process by which a classifier generates models and/or patterns from a training data set. A training data set comprises documents that have been mapped (e.g., labeled) to "known-good" classes. As used herein, the term "document" refers to a set of information input into a classifier. Example documents include, but are not limited to, electronic files and records.

As used herein, the term "class" refers to a discrete category with which a document is associated. The classifier's function is to predict the discrete category (e.g., label, class) to which a document belongs. As used herein, the term "labeling" refers to the process of associating a document to a set of correct classes. As used herein, the term "accuracy" refers to the rate of correct or incorrect labeling performed by a classifier over a test data set.

As used herein, the phrase "test data" refers to a data set used for estimating accuracy. Turning now to FIG. 1, an example of a suitable computing system 10 within which embodiments of the present invention may be implemented is disclosed. The computing system10 is only one example and is not intended to suggest any limitation as to the scope of use or functionality of the invention. Neither should the computing system 10 be interpreted as having any dependency or requirement relating to any one or combination of illustrated components.

For example, the present invention is operational with numerous other general purpose or special purpose computing consumer electronics, network PCs, minicomputers, mainframe computers, laptop computers, as well as distributed computing environments that include any of the above systems or devices, and the like.

The invention may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, loop code segments and constructs, etc. that perform particular tasks or implement particular abstract data types. The invention can be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules are located in both local and remote computer storage media including memory storage devices. Tasks performed by the programs and modules are described below and with the aid of figures. Those skilled in the art can implement the description and figures as processor executable instructions, which can be written on any form of a computer readable media.

FIG. 1: the system 10 includes a server device 12 configured to include a processor 14, such as a central processing unit ('CPU'), random access memory ('RAM') 16, one or more input-output devices 18, such as a display device (not shown) and keyboard (not shown), and non-volatile memory 20, all of which are interconnected via a common bus 22 and controlled by the processor 14. As shown in the FIG. 1 example, in one embodiment, the non-volatile memory 20 is configured to include a classifier analyzer 24 for generating a set of training documents, and one or more machine learning classifiers 26 that are trained using the generated set of training documents. In one embodiment, the machine learning classifiers are binary text classifiers. Additional details of the classifier module 24 and machine learning classifiers 26 are discussed in greater detail below.

The network 28 can include various devices such as routers, server, and switching elements connected in an Intranet, Extranet or Internet configuration. In one embodiment, the network 28 uses wired communications to transfer information between an access device (not shown), the server device 12, and an operational data store 30. In another embodiment, the network 28 employs wireless communication protocols to transfer information between the access device, the server device 12, and operational data store 30. In yet other embodiments, the network employs a combination of wired and wireless technologies to transfer information between the access device, the server device 12, and the operational data store 30.

The operational data store 30 is a repository that maintains and stores information utilized by the classifier analyzer 24 and the machine learning classifiers 26. In one embodiment, the operational data store 30 is a relational database. In another embodiment, the operational data store 30 is a directory server, such as a Lightweight Directory Access Protocol ('LDAP'). In yet another embodiment, the operational data store 30 is an area of non-volatile memory 20 of the server 12.

As shown in FIG. 1: in one embodiment, the operational data store 30 includes a class data store (CLS) 32 that stores an taxonomy of class labels that are used in classifying a document into one or more discrete categories. An uncertified document data store (UDS) 33 is also provided that includes a set of documents having zero (0) or more class labels associated with each document, and which has not been independently validated by an expert, and a certified document data store (CDS) 34 that includes a set of documents having one (1) or more class labels associated with each document and which has been independently verified by an expert.

As shown in the FIG. 1 example, the operational data store 30 is also configured to maintain and store a profile data store (PROFILES) 35 that include profiles defining rules/schemas for adding and retaining class labels associated with documents, and a result data store (FTDS) 36 that includes training data generated by the classifier analyzer 24 by combining documents stored in the CDS 34 with a select subset of documents stored in the UDS 33. Additional details of each of these data stores 32-36 are discussed in connection with FIGS. 2A-B.

Although the operational data store 30 shown in FIG. 1 is connected to the network 28, it will be appreciated by one skilled in the art that the operational data store 30 and/or any of the data stores 32-36 shown in FIG. 1, can be distributed across various servers and be accessible to the server 12 over the network 28, be coupled directly to the server 12, or be configured in an area of non-volatile memory 20 of the server 12.

Further, it should be noted that the system 10 shown in FIG. 1 is one embodiment of the disclosure. Other system embodiments of the disclosure may include additional structures that are not shown, such as secondary storage and additional computational devices. In addition, various other embodiments of the disclosure include fewer structures than those shown in FIG. 1. For example, in one embodiment, the disclosure is implemented on a single computing device in a non-networked standalone configuration. Data input is communicated to the computing device via an input device, such as a keyboard and/or mouse. Data output of the system is communicated from the computing device to a display device, such as a computer monitor.

FIGS. 2A-B, a computer-implemented method for training an initially trained classifier (ITC) using a set of training documents generated by the classifier analyzer 24 is disclosed. The method includes executing a plurality of loop code segments that include loop constructs written in a computer programming language. Each of the loop code segments is executed at run time a number of times based on a value at run time of variables included in corresponding loop termination conditions. In one embodiment, the method is implemented using three loop code segments and corresponding loop termination conditions. In another embodiment, as set forth below, the method is implemented using four loop code segments with corresponding loop termination conditions.

For example, in one embodiment, as shown at step 40 of FIG. 2A, the classifier analyzer 24 initializes a list of document candidates (LST) for each class defined in the CLS data store 32. The document list LST is used by the classifier analyzer 24 to store document candidates that are to be added to training data. At initialization, the LST includes no documents. Next, at step 42, the classifier analyzer 24 initializes a plurality of counter variables C1, C2, C3 and C4 to a value of one (1). Each of the plurality of counter variables C1, C2, C3, C4 is used to determine the number of times loop code segments are to be executed. Once counter variable and list initializations are complete, at step 44, the classifier analyzer 24 divides documents included in the CDS data store 34 into a training set (TRN) and a test set (TST) of documents. The TRN data set includes documents that have been accurately labeled to known-group classes and the TST data set includes documents that are used for estimating accuracy of machine learning classifiers trained on the TRN data set. In one embodiment, the classifier analyzer 24 randomly splits the set of documents into the TRN and TST set of documents based on a user-defined percentage associated with the TRN and TST set of documents, respectively.

Next, at step 46, the classifier analyzer 24 divides the TRN data set into a plurality of K document sets, where K is an integer value. In one embodiment, the number of documents included in each of the K document sets is approximately equal. Once the TRN data set is divided into K document sets, at step 48, the classifier analyzer 24 assigns all documents of the K document sets to a total document set (TD) excluding one of the documents sets referenced by counter variable C1, hereinafter referred to as the K(C1) data set. Next, at step 50, the classifier analyzer 24 assigns the K(C1) data set to a single document set (SD).

Next, at step 52, a first classifier of the machine learning classifiers 26 is trained for each class in the CLS data store 32 using documents in the TD set. The classifier analyzer 24, at step 54, next applies all machine learning classifiers 26 to the SD set of documents. In one embodiment, the classifier analyzer 24 computes and records F1 scores (e.g., a measure of a test's accuracy that considers both the precision (p) and recall (r) of the test, as known in the art) for each class determined. Once the machine learning classifiers 26 are trained, at step 56, the classifier analyzer 24 divides documents from the UDS 33 into N data sets of UDS documents, where N is an integer value.

Once documents from the UDS 33 are divided into N data sets, the classifier analyzer 24, at step 58, applies a plurality of label retention and addition rules/schemes accessed from the PROFILES data store 35 to one of the N data sets. The particular N data set is identified based on the value stored in counter variable C2, hereinafter referred to as N(C2), and the particular label retention and addition rule/scheme applied to the N data set is based on the value of counter variable C3. Each rule is associated with a criterion that if met, triggers the rule. In one embodiment, the rules are applied to the N data set based on a descending order of strictness (e.g., criteria that must be met).

Referring back to FIG. 2A, at step 60, the classifier analyzer 24 removes any document in the N(C2) data set that is similar to any document in the TD data set. In one embodiment, the classifier analyzer 24 applies cosine similarity to determine document similarity and removes any document from the N(C2) data set where the computed cosine value equals or exceeds a pre-determined value. The classifier analyzer 24, at step 62, then forms a new set L of documents by merging all remaining documents in the N(C2) data set with the TD data set. In one embodiment, at step 64, the classifier analyzer 24 then clusters the set L of documents using a clustering algorithm. Various clustering techniques known in the art can be used to cluster the set L of documents.

Next, at step 66, the first classifier of the machine learning classifiers 26 is trained on all remaining documents of the N(C2) data set associated with a class in the CLS data store 32, which is referenced by a value stored in the C4 variable. At step 68, the classifier analyzer 24 then applies label retention and addition rules/schemes accessed from the PROFILES data store 35 to all documents in the N(C2) data set for the associated class.

Once the label retention and addition rules have been applied, at step 70, the classifier analyzer 24 forms a Q document data store from a subset of the N(C2) data set where the class, referenced by the C4 variable, has been assigned. At step 72, a second classifier of the machine learning classifiers 26 is then trained using documents of the TD data set and the Q document data store. Next, at step 74, the second classifier of the machine learning classifiers 26 is applied to the SD data set and an F1 score for the class in the CLS data store 32 referenced by the value C4 is computed by the classifier analyzer 24.

At step 76, the classifier analyzer 24 compares the computed F1 score for the class referenced by the value C4 with a current best score for the class. If the classifier analyzer 24 determines that the computed F1 score for the class is higher than a previously stored best score for the class, as shown in step 78, the classifier analyzer 24 updates the best score value for the class, as indicated in step 80, adds documents from the Q document data store to the list of document candidates LST, as indicated in step 82, and then increments C4, as indicated in step 84.

Next, at step 86, a first loop termination condition is evaluated. The classifier analyzer 24 determines whether the value store in counter variable C4 is greater than the total number of classes stored in the CLS data store 32. If the value of counter variable C4 is not greater than the total number of classes, steps 66-86 of the above described method are repeated in a first loop code segment. Otherwise, as shown at step 87, the classifier analyzer 24 increments C2.

Next, at step 88, a second loop termination condition is evaluated. The classifier analyzer 24 determines whether the value stored in the counter variable C2 is greater than the number of N data sets. If the value of counter variable C2 is not greater than the number of N data sets, steps 60-88 of a second loop code segment are repeated. Otherwise, if the value of the C2 variable exceeds the number of N data sets, at step 90, the classifier analyzer 24 increments C3.

Next, at step 92, the classifier analyzer 24 compares the value of variable C3 to a total number of label retention and addition rules/schemes stored in the PROFILES data store 35. If the value of counter variable C3 does not exceed the total number of label retention and addition rules/schemes stored in the PROFILES data store 35, steps 58 92 of a third loop code segment are repeated. Otherwise, if the value of the counter variable C3 exceeds the total number of label retention and addition rule/schemes, as shown at step 94, the classifier analyzer 24 increments C1, and determines at step 96 whether the value of counter variable C1 exceeds the total number of K data sets. If the value of counter variable C1 does not exceed the total number of K data sets, steps 48-96 of a fourth loop code segment are repeated.

Otherwise, at step 98, for each class defined in the CLS data store, a predetermined number of documents are added to the TRN data set. For example, in one embodiment, the top 20 documents from the list of document candidates LST are added to the TRN data store. Next, at step 100, all machine learning classifiers are retrained using all of the documents in the original TRN data set as both positive and negative documents and documents identified from the LST data store as being positive. The positive and negative documents for a class "c" are used by machine learning classifiers to generate a model that is likely to assign class "c" to unlabeled documents that match more characteristics of documents that were marked as positive for "c", and less characteristics of documents that were marked as negative for "c." All retrained machine learning classifiers 26 are then applied to the TST data store, as indicated in step 102. Lastly, as indicated in step 104, the classifier analyzer 24 stores all documents in the TRN data set in the FTDS 104 for subsequent use by the machine learning classifiers 26.

FIG. 1: depicts a schematic block diagram of a data scrubbing system, according to an embodiment of the disclosure. Broadly, the system 100 may include three major components, namely an input interface 101, a processor unit 102 and a display interface 103. The input interface may be adapted to receive feeds or data from all disparately located heterogeneous data sources. The processor unit102 may include an executing engine (or a workflow engine) which performs the attribute level cleansing and scrubbing of data by applying ranking matrix rule process and provides decision making and control operations. The processor unit 102 may further include a database that maintains a set of predefined rules, vis a vis event mapping rule, vendor priority rule, Golden Copy match rule, completeness and confirmation rule and constraint rule, ranking matrix rule and historical data. The attributes processed by the executing engine and qualifying the ranking matrix process may get promoted to the Golden Copy. The final copy may be displayed by a display interface 103 or output interface 103, also referred to as the Event Preview Screen.

Now turning briefly to FIGS. 2 (a) and (b), there is shown a technical flow diagram of performing the data scrubbing operation. As seen and derived, the corporate announcements may be captured from various sources like custodians, vendors or manual entries by the input interface 101. This may be followed by vendor/data provider mapping, event mapping and security mapping in response to a security request. It may be followed by a cleansing mechanism whereby the duplicate entries for an announcement are dropped. A Heldness check may also be performed to ensure that relevant entries are allowed, and following which a trigger rule may be processed. Next, a narrative matching process may be initiated which checks if the incoming data for an announcement is same or not. Eventually a Golden Copy template may be created based on the content of received incoming data. In case the system reports an error, a repair activity may be performed. Upon the rectification of error the attribute may follows the destined Straight through Processing (STP) Path, else the announcement may be either dropped completely or modified, as the need be.

FIG. 2:(b) shows an exemplary application of ranking matrix rule after matching of the narrative process is performed. Upon finding an exact match, the ranking matrix rule may be applied. The ranking matrix rule may be a combination of ranking attribute rule and source weighting rule. This rule based on its set up may perform the cleansing and scrubbing of the data received from various sources and finalizes the data to be present in the Golden Copy.

WE CLAIMS

1) Our invention "AML- Data Cleaning" is a systems and techniques for improving the training, mapping of machine learning programming classifiers and also a classifier is trained using a set of validated documents that are accurately associated with a set of clas-1, Calss-2, Class-3 labels. The invented technology a subset of non-validated documents is also identified and is used to further train and improve accuracy of the classifier. The invented system also a system and method enabling automated data cleansing and scrubbing at the attribute level. A consolidated view may be provided of the scrubbed data or narratives that gets promoted to a final copy and the data or narratives received from multiple sources on a single user interface. The technology automatic takes a decision to clean the data as per requirement of stirage system not wait to user interrupt. After data cleaning time some unused data can delete and shift without taking user acknowledgment. The shifted data auto fetches by your private storage server without taking time. 2) According to claims# the invention is to a systems and techniques for improving the training, mapping of machine learning programming classifiers and also a classifier is trained using a set of validated documents that are accurately associated with a set of clas-1, Calss-2, Class-3 labels. 3) According to claiml,2# the invention is to the invented technology a subset of non validated documents is also identified and is used to further train and improve accuracy of the classifier. The invented system also a system and method enabling automated data cleansing and scrubbing at the attribute level. 4) According to claiml,2,4# the invention is to a consolidated view may be provided of the scrubbed data or narratives that gets promoted to a final copy and the data or narratives received from multiple sources on a single user interface. 5) According to claiml,2,3# the invention is to the technology automatic takes a decision to clean the data as per requirement of stirage system not wait to user interrupt. After data cleaning time some unused data can delete and shift without taking user acknowledgment. The shifted data auto fetches by your private storage server without taking time. The invention is to a comprising applying a plurality of profiles included in the set of profiles to each document of the UNVDS in a decreasing order of profile strictness. The invention is to a wherein the at least one document from the UNVDS is deleted if the at least one document from the UNVDS is similar to any of the documents included in the first set of documents. The invention is to a wherein comparing the at least one document comprises computing a cosine similarity for the at least one document.

FIG. 1: IS A SCHEMATIC OF AN EXEMPLARY COMPUTER-BASED CLASSIFIER SYSTEM.

FIG. 2A: IS AN EXEMPLARY METHOD OF IMPROVING TRAINING DATA FOR CLASSIFIERS.

FIG. 2B: IS AN EXEMPLARY METHOD OF IMPROVING TRAINING DATA FOR CLASSIFIERS.

FIG.3: SYSTEM FLOW DIAGRAM.

FIG. 4: IS AN EXAMPLE OF LABEL RETENTION AND ADDITION SCHEMES.

FIG. 5 (A) IS A FLOW DIAGRAM ILLUSTRATING A HIGH LEVEL VIEW OF PERFORMING DATA SCRUBBING OPERATION.

FIG. 5 (B): IS A FLOW DIAGRAM ILLUSTRATING A HIGH LEVEL VIEW OF PERFORMING DATA SCRUBBING OPERATION.