GB2567148A - Methods and apparatuses relating to data classification - Google Patents

Methods and apparatuses relating to data classification Download PDF

Info

Publication number
GB2567148A
GB2567148A GB1715752.0A GB201715752A GB2567148A GB 2567148 A GB2567148 A GB 2567148A GB 201715752 A GB201715752 A GB 201715752A GB 2567148 A GB2567148 A GB 2567148A
Authority
GB
United Kingdom
Prior art keywords
data
data records
records
learning process
machine learning
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
GB1715752.0A
Other versions
GB201715752D0 (en
Inventor
Louçã André
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
WARWICK ANALYTICAL SOFTWARE Ltd
Original Assignee
WARWICK ANALYTICAL SOFTWARE Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by WARWICK ANALYTICAL SOFTWARE Ltd filed Critical WARWICK ANALYTICAL SOFTWARE Ltd
Priority to GB1715752.0A priority Critical patent/GB2567148A/en
Publication of GB201715752D0 publication Critical patent/GB201715752D0/en
Priority to PCT/GB2018/052772 priority patent/WO2019064016A1/en
Publication of GB2567148A publication Critical patent/GB2567148A/en
Withdrawn legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A computer-implemented method of data classification comprises receiving S301 a data record definition (202, fig 2) and corresponding processing instruction (201, fig 2) for a plurality of data records, determining S302 a common data format according to the definition for determining the inputs to a machine learning process, receiving one or more data records and processing S303 each record in accordance with the processing instruction, converting S304 the processed data records into the common format, and classifying S305 the converted data records using a data model trained by the machine learning process, or generating a trained data model based on the converted data records. When generating a trained data model, any unlabelled data records received may be labelled in accordance with a machine learning process. Preferably, the definition defines the data fields to be used in a machine learning process and a data record abstractor (204, fig 2) abstracts the definition to determine the common data format. A user interface may allow a user to provide data record definitions and processing instructions and to manipulate classes and records.

Description

Methods and Apparatuses Relating to Data Classification
Technical Field
The present invention relates to classification of data.
Background
Machine learning can be used to analyse data and arrange it into classes. This classification has a variety of applications, for example classifying pictures, text records, customer behaviour, among others, to enable computers to take decisions, automate processes and make recommendations based on fresh input data. Supervised classification involves a human training machine learning algorithms, for example by providing classes to a sample of records. The algorithms learn from this training set and build classification models which can then be used to classify fresh data. This is commonly called predictive analytics. Active Learning is a form of supervised classification machine learning whereby the algorithms interact with a source of true classification (such as a human, for example) to point them to the data to classify next in order to reduce the required input from the true classification source (human), whilst improving the performance of the machine learning models.
Both Active Learning and indeed all supervised classification machine learning remains largely the preserve of highly-skilled data scientists due to the steps that need to be taken to convert raw input data into a form suitable to be input into a machine learning environment. It is therefore difficult for non-data scientists to make use of classification machine learning systems.
Summary of the Invention
According to a first aspect, the specification describes a computer implemented method of data classification comprising:
receiving a data record definition and corresponding processing instruction for 30 a plurality of data records;
determining a common data format according to the received data record definition for determining the inputs for a machine learning process;
receiving one or more data records and processing each of the one or more data records in accordance with the processing instruction;
converting the one or more processed data records into the determined common format; and classifying the one or more converted data records using a trained data model, wherein said trained data model is trained in accordance with the machine learning process.
According to a second aspect, the specification describes a computer implemented method comprising:
receiving a plurality of training data records wherein at least some of the training data records are not labelled;
labelling the data records which are not labelled in accordance with a machine 10 learning process;
receiving a data record definition and corresponding processing instruction for the plurality of data records;
determining a common data format according to the received data record definition for determining the inputs for a machine learning process;
processing the data records in accordance with the processing instruction;
converting the one or more processed data records into the determined common format; and generating a trained data model for classifying data records, comprising training said data model based on the converted data records.
Training the data model may comprise using a machine learning process according to the first or second aspect.
The data model may be trained in accordance with an Active Learning process according to the first or second aspect.
The plurality of data records may comprise at least one of unstructured text and different data fields according to the first or second aspect.
The data record definition may be abstracted in order to determine the common data 30 format according to the first or second aspect.
The method according to the first or second aspect may further include providing an indication that a data record comprises data that does not correspond to an existing class, and presenting said data record to a user be labelled.
Processing the data records may comprise generating a text document suitable for input into a machine learning process according to the first or second aspect.
-3The method according to the first or second aspect may further comprise receiving a data record definition from a user via a user interface.
The method according to the first or second aspect may further comprise receiving a user input via a user interface to manipulate classes and/or records.
The method according to the first or second aspect may further comprise automatically re-generating a trained data model in response to receiving an input from a user to manipulate classes and/or records.
According to a third aspect, this specification describes apparatus configured to perform any method as described with reference to the first and/or second aspect.
According to a fourth aspect, this specification describes computer readable instructions which, when executed by computing apparatus, cause the computing apparatus to perform any method as described with reference to the first and/or second aspect
According to a fifth aspect, this specification describes apparatus comprising: at least one processor; and at least one memory including computer program code which, when executed by the at least one processor, causes the apparatus to perform: receiving a data record definition and corresponding processing instruction for a plurality of data records; determining a common data format according to the received data record definition for determining the inputs for a machine learning process; receiving one or more data records and processing each of the one or more data records in accordance with the processing instruction; converting the one or more processed data records into the determined common format; and classifying the one or more converted data records using a trained data model, wherein said trained data model is trained in accordance with the machine learning process.
According to a sixth aspect, this specification describes apparatus comprising: at least one processor; and at least one memory including computer program code which, when executed by the at least one processor, causes the apparatus to perform: receivi ng a plurality of training data records wherein at least some of the training data records are not labelled; labelling the data records which are not labelled in accordance with a
-4machine learning process; receiving a data record definition and corresponding processing instruction for the plurality of data records; determining a common data format according to the received data record definition for determining the inputs for a machine learning process; processing the data records in accordance with the processing instruction; converting the one or more processed data records into the determined common format; and generating a trained data model for classifying data records, comprising training said data model based on the converted data records.
Training the data model may comprise using a machine learning process according to 10 the sixth aspect.
The data model may be trained in accordance with an Active Learning process according to the fifth or sixth aspect.
The plurality of data records may comprise at least one of unstructured text and different data fields according to the fifth or sixth aspect.
The data record definition may be abstracted in order to determine the common data format according to the fifth or sixth aspect.
The computer program code, when executed by the at least one processor, may cause the apparatus to perform: providing an indication that a data record comprises data that does not correspond to an existing class, and presenting said data record to a user be labelled, according to the fifth or sixth aspect.
Processing the data records comprises generating a text document suitable for input into a machine learning process according to the fifth or sixth aspect.
The computer program code, when executed by the at least one processor, may cause 30 the apparatus to perform: receiving a data record definition from a user via a user interface, according to the fifth or sixth aspect.
The computer program code, when executed by the at least one processor, may cause the apparatus to perform: receiving a user input via a user interface to manipulate classes and/or records according to the fifth or sixth aspect.
-5The computer program code, when executed by the at least one processor, may cause the apparatus to perform: automatically re-generating a trained data model in response to receiving an input from a user to manipulate classes and/or records according to the fifth or sixth aspect.
According to a seventh aspect, this specification describes a computer-readable medium having computer-readable code stored thereon, the computer-readable code, when executed by the at least one processor, cause the performance of at least: receiving a plurality of training data records wherein at least some of the training data 10 records are not labelled; labelling the data records which are not labelled in accordance with a machine learning process; receiving a data record definition and corresponding processing instruction for the plurality of data records; determining a common data format according to the received data record definition for determining the inputs for a machine learning process; processing the data records in accordance with the 15 processing instruction; converting the one or more processed data records into the determined common format; and generating a trained data model for classifying data records, comprising training said data model based on the converted data records.
According to an eighth aspect, this specification describes apparatus comprising means 20 for: receiving a plurality of training data records wherein at least some of the training data records are not labelled; labelling the data records which are not labelled in accordance with a machine learning process; receiving a data record definition and corresponding processing instruction for the plurality of data records; determining a common data format according to the received data record definition for determining 25 the inputs for a machine learning process; processing the data records in accordance with the processing instruction; converting the one or more processed data records into the determined common format; and generating a trained data model for classifying data records, comprising training said data model based on the converted data records.
Brief Description of the Drawings
For a more complete understanding of the methods, apparatuses and computerreadable instructions described herein, reference is now made to the following descriptions taken in connection with the accompanying drawings in which: Figure la is an example of a data record;
Figure lb is an example of a data record;
Figure 2 illustrates a system according to embodiments of the invention;
-6Figure 3 is a flow chart illustrating an example of operations which maybe performed according to embodiments of the invention.
Detailed Description
In the following detailed description, only certain exemplary embodiments of the present invention have been shown and described, simply by way of illustration. As those skilled in the art would realise, the described embodiments may be modified in various different ways, all without departing from the scope of the present invention. Accordingly, the drawings and description are to be regarded as illustrative in nature 10 and not restrictive. Like reference numerals designate like elements throughout the specification.
Text and other data used to train machine learning models and which is classified may vary enormously from one situation to the next, and for different organisations. It can 15 vary in any and all ways: length, language, style, acronyms, which fields (or partialfields) are used and the subject domain. In addition, data structures may change over time for any one particular situation. In addition, there may be variety in the endresulting classes (i.e. ‘labels’) themselves. This is dependent on what the classification is being used for, for example the classes for automatic routing of customer enquiries to 20 particular departments might be different to classifying whether the customer is happy, even for the same dataset. , As will be described in more detail below, embodiments of the present invention are able to handle simultaneously such variety in data structure and labelling.
For example, input data maybe provided in a number of different formats comprising 25 different structured and unstructured fields. The present invention provides for input data provided in any format to be received by the system and converted into a common format in order to be processed by the machine learning system. In addition, the user may predefine which data is to be used in the machine learning process.
An example of a data structure including structured and unstructured fields is set out in Figures la and lb. Figure la illustrates an example of a data record. In this example, the data record corresponds to a hotel review. The review includes structured fields 1 to 5 in which the user can give a rating in relation to specific questions (location, service, etc.), and the final field is a general comments field in which users can comment on the holiday. In addition, the review includes a field for unstructured text comments.
-7For the machine learning process, the desired classes might be (i) the topics that people are talking about within each response, (ii) the opinions of those topics and (iii) the emotional state of the responder. These resultant classes, or labels can all then later be analysed, with or without the numerical responses, to identify which topics are important to which customers. Each field may have its own pre-processing rules in terms of acronyms, and other cleansing, Natural Language Processing (“NLP”) rules or data transformations (i.e. synonym replacement, normalization of date formats). Even the ostensibly simple addition of a sixth field to the data structure of Figure la (e.g. as depicted by Figure lb, in which a new field “food” is added) in a conventional system which is just classifying the comments field causes challenges given that such a system will now not work without starting again and retraining, unless it knows that the comments field (which might now have a different reference caused by the simple addition of the new field) is the one to classify.
Another example may relate to classifying maintenance records of some machinery to try to analyse why certain failures are happening. This might require the concatenation of several written reports (e.g. a customer and then one or more technicians) to be used in the classification. A change in the data structure such as new or changing fields (e.g. due to new recording processes or IT changes) would mean that the entire analysis would need to be restarted in a conventional system. A generic system that can easily be configured and used by a non-data scientist, for either of the described examples relating to the maintenance records or the holiday reviews has not been achieved before now. Such a solution has presented many difficulties: in conventional systems each classification workflow is crafted by a data scientist for a particular static dataset, and a 25 significant amount of time is also spent validating and curating these workflows in case of changes.
The present invention provides a method and system for data classification, useable by a non-data scientist, which is able to deal with data provided in a variety of formats, 30 including data formats which may change over time. Embodiments of the present invention may therefore provide a solution for dealing with any format of incoming data in contrast to prior art systems in which such systems may be configured only to receive data in a single predetermined format, or a small group of separately-defined formats. The present invention provides a technical solution for data records of 35 different structures to be processed in such a way as to be suitable for input into the machine learning system. Therefore, the present invention does not require that the
-8inputs are provided in exactly the same format in order to be processed by the machine learning system, as maybe the case for many existing machine learning systems. Machine learning classification may comprise the steps of text pre-processing, data modelling, and model generation. Text pre-processing relates to transforming the text 5 from the data record fields in order to remove certain features which may add “noise” to the classification process. The text pre-processing may also including steps for transforming the text in such a way as to allow the machine learning classifier to achieve a better classification. Examples of features which may be addressed in the pre-processing stage may include, but are not limited to, typos which may be corrected 10 or removed, abbreviations which may be replaced, and data transformation like date format normalization. Data modelling relates to transforming the records into a matrix. For example, a row of the matrix may represent a data record, and a column may represent a field of the data record corresponding to a specific feature. Model generation relates to using one or more machine learning algorithms to create a model 15 based on the output matrix from the data modelling step.
The data structure of the input data records may have an impact on each of these steps.
Therefore, the present invention transforms the input data records into a common format in order for the machine learning process to deal with input data records 20 provided in different data structures and formats.
Therefore, the embodiments of the present invention provide a flexible system which may be used for different situations and by different users. They may handle data structures which change over time, for example the different structures shown in 25 Figures la and lb. Indeed, the data records may be provided to the system in any format, and the embodiments of the present invention provide for each of the data records to be processed into a common format.
Figure 2 illustrates a system arranged to perform a method embodying the present 30 invention. The system includes a component referred to herein as the “data record abstractor” 204. As described in more detail below, the data record abstractor abstracts the disparities in the different data structures in order to allow the data to be input from a data structure of any format. This enables the system to ingest data of any structure, while still being able to run all of the processes and algorithms that are 35 required for prediction, model validation, and retraining.
-9The method may make use of Active Learning in which the system may be configured to interact with a user, wherein the user may provide labels to some data in order to improve predictive models. Active Learning may provide the user with an invitation to classify certain records in a certain order to reduce the labeller’s time and to improve 5 the performance of predictive models. The active learning system may select records for the user to classify. In some examples, if a certain syntax has changed its semantic meaning over time, the system may invite a user to classify certain records and, if necessary, generate new classes. Therefore, the system may determine whether data must be classified by a user. The system may be used with or without Active Learning.
That is, other types of machine learning may be used together with embodiments of the present invention. However, combining the method alongside Active Learning may enable machine learning classification to be simplified and may improve the automation of the process.
The system components and their functions will now be described. The system may comprise a number of inputs, 201, 202, 203. These inputs may comprise, for example, data record pre-processing instructions 201, data record definition 202, and/or a data record view 203.
The data record pre-processing instructions 201 may comprise instructions for preprocessing data records defined in a particular way, i.e. data records having a particular structure. Data record pre-processing instructions 201 maybe provided to the system in any suitable way, such as by being input by a user, for example, or extracted from a computer memory. The system may select an appropriate pre-processing instruction for a corresponding data field.
The data record definition 202 includes a definition of data fields to be used in the machine learning process. The data record definition 202 may have a corresponding data record pre-processing instruction 201 for different fields to be classified, in order 30 that the system can identify the relevant information from the data records and obtain the relevant instructions for performing pre-processing on the data records. The data record definitions may be provided to the system in any suitable way, such as by being input by a user, for example, or extracted from a computer memory.
The data record view 203 may be associated to the data record definition and may comprise information on the fields of the data records which maybe required to be
- 10 presented to a user in order for a user to validate or label data in the relevant fields. Therefore, if a data record is presented to a user, the data record view contains the information required to graphically present specific fields the data records to a user.
The system may comprise a user interface to enable a user to input information and/or instructions to the system. For example, the data record pre-processing instructions may be input by a user through the user interface. The data record definition may be input by a user through the user interface. The data record view may be provided to the user via the user interface.
The data record abstractor 204 receives the data record pre-processing instructions 201, data record definition 202, and data record view 203 corresponding to input data records. The data record definition 202 is converted by the data record abstractor 204 into a format which can be considered as an abstraction or common representation of a 15 structure to be applied to diverse data records. Data in the abstracted format can be processed by a machine learning process. The abstracted format represents predetermined fields that can be added or removed based on the content of incoming data records and processing instructions, and the predetermination of fields is such that the addition or removal of such predetermined fields is an expected variation to be 20 handled by the machine learning process. In other words, the machine learning process is such that it is arranged to be able to process data in any abstracted format, namely a restricted group of combinations of different fields. Due to the N:i mapping of data record definitions to the abstracted format input, it is not necessary to configure the machine learning process to handle data records in any unrestricted format, and so it is 25 not necessary to define a machine learning process for any format of input data, but only for abstracted formats- the modification of the abstracted format by addition or removal of predetermined fields format thus represents a predetermined restricted set of variations which provides full flexibility to the machine learning process without requiring undue complexity.
Features are extracted from the input data records and arranged into the abstracted format and the data record abstractor 204 therefore determines the relevant information required for further processing of the data records, in order that the data records may be processed in the machine learning system. The data record abstractor 35 204 uses the received information in order to drive the subsequent components of the system in the correct manner, and is thus a dynamic component interfacing between an
- 11 unrestricted input and a set of user-definable outputs, i.e. selected data fields. As the data record definition is abstracted, the corresponding processing instructions can be mapped on to the corresponding features of the input data records, regardless of the format in which the data record definitions are input.
The dynamic data record pre-processor 206 receives data records 207 which may be accompanied by corresponding labels. The dynamic data record pre-processor 206 maybe configured to apply the relevant pre-processing steps for received data records 207. The pre-processing for each data record is performed in accordance with the pre10 processing information 201 associated with the data record definition 202. The data records 207 may be provided in a plurality of different formats and/or may relate to different subject matters.
If required, data from the data record may be presented to the user in order for the user 15 to interpret and validate or label the data. The dynamic data record Tenderer 205 displays the data to the user in accordance with the data record view for the given data record.
The dynamic data record pre-processor 206 provides the pre-processed data to the dynamic text document generator 208. This converts the pre-processed data into a common format required for input into the data modeller 209. For example, the dynamic text document generator 208 generates a text document from the preprocessed data in a format suitable for the machine learning analysis to be applied. The dynamic text document generator 208 may, for example, be configured to merge and concatenate certain fields of the data record, or corpuses of text. The pre-processed data records are converted into the common format ready for input into the data modeller 209. The common format is determined according to the data record definition 202.
The data modeller 209 transforms the processed data records output from the dynamic text document generator into a matrix. Machine learning is applied to the matrix. The matrix comprises a number of rows and columns, where each row represents a data record and the columns represent features of the data records set out in the text document output from the dynamic text document generator 208.
- 12 The matrix output from the data modeller 209 is input to a classification model generator 210. The classification model generator applies one or more machine learning algorithms to create at least one model based on the matrix output from the data modeller 209. The classification model generator 210 outputs labelled data 211.
In addition, the classification model generator may be configured to output a recommendation 212 of a data record for a human operator to label in order to improve the model, in accordance with an active learning process. The classification model generator 210 may perform processes such as cross-validation of the generated model.
The data record abstractor 204, dynamic data record pre-processor 206 and dynamic text document generator 208 when used together with Active Learning provide for a practical Active Learning system which allows the system 1 to classify data records having different formats by keeping an association between each data record and its definition. In this way, the system 1 is able to combine all data records even when the data records are in different formats, for example if they include different fields, during the classification process. Also as each data record definition 202 has specific data record pre-processing instructions 201 associated with it, the system 1 is able to determine how to process each data record based on the data record definition 202. In addition, in order to allow a user to provide true labels for each data record, even data records containing different fields, each data record definition 202 has an associated data record view 203 which may be viewed by a user on a data record view page.
The system maybe fully adaptable for commercial use. For example, the system may be adapted for big-data capabilities. In some embodiments, the system may comprise a distributed system to allow different labellers in different locations to receive different records to classify from the system.
In some embodiments, the system maybe configured to target specific classes for Active Learning to operate on and improve.
The system may also be configured to target data records corresponding to specific time periods. For example, the system maybe configured to generate models based on data records corresponding to the present and exclude data records that are older than a given date.
-13The system may be adaptable such that a user such as a data scientist may select or add different machine learning algorithms and/or pre-existing models, for example in order to analyse data in different ways or to provide for further analyses.
The system may include further pre-processing algorithms which could be used to suggest or generate labels automatically at the start where there are not yet any labels.
The system may be configured to indicate to a user when it detects that there may be an entirely new class. This may provide for an “early warning” about new issues that have 10 not previously arisen.
The system may be configured to allow a user to highlight certain portions of text within a corpus of text and to add a label or class. This may enable more accurate training of the machine learning models where the data records may include large 15 corpuses of text and/or multiple labels which may be assigned to a corpus of text.
Therefore, the performance of the models may be improved. The system may allow a user to view and edit labels that have been assigned manually or automatically.
The system may output confusion matrices to a user. The confusion matrices may indicate how well particular classes or labels are performing. The user may be able to select, inspect, and relabel incorrectly classified records such as false positives or false negatives in order to retrain and improve the models. For example, the system may comprise a user interface via which a user may provide an input in order to manipulate classes and/or records.
In response to classes and records being manipulated by a user, the system may be configured to automatically trigger remodelling. For example, a class maybe split, or a number of classes may need to be merged. For this purpose, a user interface maybe provided in order to allow a user to manipulate the classes and records. Classes within 30 a hierarchy may also therefore be easily manipulated.
Classes labelled from pre-existing data maybe imported along with training data in order to automatically generate an initial model.
The assignment of a class label to data may be based on a threshold criteria for the classification confidence coming from the machine learning model.
-14The system may comprise a memory comprising computer readable instructions and a processor. The instructions may be executed by the processor in order to perform required steps for training and/or classifying data, such as for example, the steps 5 described with reference to Figure 3. The system may comprise a computer readable medium having computer readable code stored thereon, the computer readable code, when executed by at least one processor, causes the performance of any of the operations described with reference to Figure 3.
Figure 3 is a flow chart illustrating steps of a method which may be performed by the system 1.
In step S301, the method comprises receiving a data record definition and corresponding processing instruction for a plurality of data records. The data records 15 may be provided in a plurality of different formats. As described above with reference to Figure 2, the data record abstractor may be configured to receive the data record definition and the pre-processing instructions. The data records may comprise unstructured text, or may comprise data fields which change over time. The data record abstractor may store the data record definitions and data record pre-processing 20 instructions and extract the corresponding features from the data records.
In step S302, a common format is determined based on the data record definition for determining the inputs to a machine learning process.
In step S303, the method comprises processing the data records according to the corresponding processing instruction. As described above with reference to Figure 2, the dynamic data record pre-processor maybe configured to process each data record according to the pre-processing instruction. Because the data record definition is abstracted in the data record abstractor, the system is capable extracting the corresponding features from the data records and applying the pre-processing instruction to the relevant data fields of the input data regardless of the input format of the data record.
In step S304, the pre-processed data is converted into the determined common format. 35 In general, the pre-processed data records may be processed into text documents suitable for applying a machine learning process, in order to input the data records into
5a data modeller and a classification model generator. The determined common format allows the data to be input into a data modeller regardless of the input format of the data. The text document is generated based on the determined common format.
A data modeller may arrange the processed data records output from the data modeller into a matrix. The matrix may then be input to a classification model generator in order to generate a data model trained in accordance with the Active Learning process. Therefore, a trained data model may be generated in accordance with an Active Learning process.
In step S305, the method comprises classifying the data records in the common format using the data model which has been trained according to the machine learning process. The method may comprise providing a recommendation to a user for a data record to be classified by the user, in accordance with a machine learning process.
If a data record comprises data which does not belong to an existing class, the system may provide an indication that the data record does not correspond to an existing class. The data record in question may then be presented to a user in order for the user to label the data.
The present invention therefore provides for a system and a method that is easy to use and adaptable to a variety of situations. The system requires little set up, and it is constantly analysed for effectiveness. The system may therefore be used by users who do not have specialist knowledge in the field of data science, while providing accurate 25 and detailed information on which further business analysis may be performed. The system and method provide the technical benefit of enabling data in a variety of formats to be utilised in a machine learning process.
The present invention has been described with reference to a number of exemplary embodiments and examples. It should be appreciated that the particular embodiments shown and described herein are illustrative of the invention and are not intended to limit in any way the scope of the invention as set forth in the claims. It will be recognised that changes and modifications may be made to the exemplary embodiments without departing from the scope of the present invention. These and other changes or modifications are intended to be included within the scope of the present invention, as expressed in the following claims.

Claims (26)

1. A computer implemented method of data classification comprising:
receiving a data record definition and corresponding processing instruction for 5 a plurality of data records;
determining a common data format according to the received data record definition for determining the inputs for a machine learning process;
receiving one or more data records and processing each of the one or more data records in accordance with the processing instruction;
io converting the one or more processed data records into the determined common format; and classifying the one or more converted data records using a trained data model, wherein said trained data model is trained in accordance with the machine learning process.
2. A computer implemented method comprising:
receiving a plurality of training data records , wherein at least some of the training data records are not labelled;
labelling the data records which are not labelled in accordance with a machine 20 learning process;
receiving a data record definition and corresponding processing instruction for the plurality of data records;
determining a common data format according to the received data record definition for determining the inputs for a machine learning process;
25 processing the data records in accordance with the processing instruction;
converting the one or more processed data records into the determined common format; and generating a trained data model for classifying data records, comprising training said data model based on the converted data records.
30
3. A method according to claim 2, wherein training the data model comprises using a machine learning process.
4. A computer implemented method according to any preceding claim, wherein the data model is trained in accordance with an Active Learning process.
-175- A method according to any preceding claim, wherein the plurality of data records comprise at least one of unstructured text and different data fields.
6. A method according to any preceding claim, wherein the data record definition 5 is abstracted in order to determine the common data format.
7. A method according to any preceding claim comprising providing an indication that a data record comprises data that does not correspond to an existing class, and presenting said data record to a user be labelled.
io
8. A method according to any preceding claim wherein processing the data records comprises generating a text document suitable for input into a machine learning process.
15
9. A method according to any preceding claim, comprising receiving a data record definition from a user via a user interface.
10. A method according to any preceding claim, comprising receiving a user input via a user interface to manipulate classes and/or records.
11. A method according to claim io, further comprising automatically re-generating a trained data model in response to receiving an input from a user to manipulate classes and/or records.
25
12. Apparatus configured to perform the method according to any of claims 1 to 11.
13. Computer readable instructions which, when executed by computing apparatus, cause the computing apparatus to perform a method according to any of claims 1 to 11.
30
14. Apparatus comprising:
at least one processor; and at least one memory including computer program code which, when executed by the at least one processor, causes the apparatus to perform:
receive a data record definition and corresponding processing instruction for a plurality of data records;
-18determining a common data format according to the received data record definition for determining the inputs for a machine learning process;
receiving one or more data records and processing each of the one or more data records in accordance with the processing instruction;
5 converting the one or more processed data records into the determined common format; and classifying the one or more converted data records using a trained data model, wherein said trained data model is trained in accordance with the machine learning process.
15. Apparatus comprising: at least one processor; and at least one memory including computer program code which, when executed by the at least one processor, causes the apparatus to perform:
15 receiving a plurality of training data records , wherein at least some of the training data records are not labelled;
labelling the data records which are not labelled in accordance with a machine learning process;
receiving a data record definition and corresponding processing
20 instruction for the plurality of data records;
determining a common data format according to the received data record definition for determining the inputs for a machine learning process;
processing the data records in accordance with the processing instruction;
25 converting the one or more processed data records into the determined common format; and generating a trained data model for classifying data records, comprising training said data model based on the converted data records.
30
16. Apparatus according to claim 15, wherein training the data model comprises using a machine learning process.
17. Apparatus according to claim 14,15, or 16, wherein the data model is trained in accordance with an Active Learning process.
18. Apparatus according to any of claims 14 to 17, wherein the plurality of data records comprise at least one of unstructured text and different data fields.
19. Apparatus according to any of claims 14 to 18, wherein the data record
5 definition is abstracted in order to determine the common data format.
20. Apparatus according to any of claims 14 to 19, wherein the computer program code, when executed by the at least one processor, causes the apparatus to perform: providing an indication that a data record comprises data that does not correspond to
10 an existing class, and presenting said data record to a user be labelled.
21. Apparatus according to any of claims 14 to 20, wherein processing the data records comprises generating a text document suitable for input into a machine learning process.
22. Apparatus according to any of claims 14 to 21, wherein the computer program code, when executed by the at least one processor, causes the apparatus to perform: receiving a data record definition from a user via a user interface.
20
23. Apparatus according to any of claims 14 to 22, wherein the computer program code, when executed by the at least one processor, causes the apparatus to perform: receiving a user input via a user interface to manipulate classes and/or records.
24. Apparatus according to claim 23, wherein the computer program code, when
25 executed by the at least one processor, causes the apparatus to perform: automatically re-generating a trained data model in response to receiving an input from a user to manipulate classes and/or records.
25. A computer-readable medium having computer-readable code stored thereon, 30 the computer-readable code, when executed by at least one processor, cause the performance of at least:
receiving a plurality of training data records , wherein at least some of the training data records are not labelled;
labelling the data records which are not labelled in accordance with a machine 35 learning process;
- 20 receiving a data record definition and corresponding processing instruction for the plurality of data records;
determining a common data format according to the received data record definition for determining the inputs for a machine learning process;
5 processing the data records in accordance with the processing instruction;
converting the one or more processed data records into the determined common format; and generating a trained data model for classifying data records, comprising training said data model based on the converted data records.
26. Apparatus comprising means for: receiving a plurality of training data records , wherein at least some of the training data records are not labelled;
labelling the data records which are not labelled in accordance with a machine 15 learning process;
receiving a data record definition and corresponding processing instruction for the plurality of data records;
determining a common data format according to the received data record definition for determining the inputs for a machine learning process;
20 processing the data records in accordance with the processing instruction;
converting the one or more processed data records into the determined common format; and generating a trained data model for classifying data records, comprising training said data model based on the converted data records.
GB1715752.0A 2017-09-28 2017-09-28 Methods and apparatuses relating to data classification Withdrawn GB2567148A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
GB1715752.0A GB2567148A (en) 2017-09-28 2017-09-28 Methods and apparatuses relating to data classification
PCT/GB2018/052772 WO2019064016A1 (en) 2017-09-28 2018-09-28 Methods and apparatuses relating to processing heterogeneous data for classification

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
GB1715752.0A GB2567148A (en) 2017-09-28 2017-09-28 Methods and apparatuses relating to data classification

Publications (2)

Publication Number Publication Date
GB201715752D0 GB201715752D0 (en) 2017-11-15
GB2567148A true GB2567148A (en) 2019-04-10

Family

ID=60270320

Family Applications (1)

Application Number Title Priority Date Filing Date
GB1715752.0A Withdrawn GB2567148A (en) 2017-09-28 2017-09-28 Methods and apparatuses relating to data classification

Country Status (2)

Country Link
GB (1) GB2567148A (en)
WO (1) WO2019064016A1 (en)

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9122681B2 (en) * 2013-03-15 2015-09-01 Gordon Villy Cormack Systems and methods for classifying electronic information using advanced active learning techniques
US9390378B2 (en) * 2013-03-28 2016-07-12 Wal-Mart Stores, Inc. System and method for high accuracy product classification with limited supervision
US10262272B2 (en) * 2014-12-07 2019-04-16 Microsoft Technology Licensing, Llc Active machine learning

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
None *

Also Published As

Publication number Publication date
GB201715752D0 (en) 2017-11-15
WO2019064016A1 (en) 2019-04-04

Similar Documents

Publication Publication Date Title
US11568855B2 (en) System and method for defining dialog intents and building zero-shot intent recognition models
US10331768B2 (en) Tagging text snippets
US7685082B1 (en) System and method for identifying, prioritizing and encapsulating errors in accounting data
CN110580308B (en) Information auditing method and device, electronic equipment and storage medium
CN113220836B (en) Training method and device for sequence annotation model, electronic equipment and storage medium
EP1672537A2 (en) Data semanticizer
CA3113784C (en) Automated production of data-driven reports with descriptive and rich text and graphical contents
WO2020077350A1 (en) Adaptable systems and methods for discovering intent from enterprise data
US20220414463A1 (en) Automated troubleshooter
CN111177350A (en) Method, device and system for forming dialect of intelligent voice robot
CN114528845A (en) Abnormal log analysis method and device and electronic equipment
CN111177351A (en) Method, device and system for acquiring natural language expression intention based on rule
US20230289538A1 (en) Systems and methods for code-switched semantic parsing
WO2024050528A2 (en) Granular taxonomy for customer support augmented with ai
WO2024015740A1 (en) Methods and apparatus for generating behaviorally anchored rating scales (bars) for evaluating job interview candidate
CN112257400B (en) Table data extraction method, apparatus, computer device and storage medium
GB2567148A (en) Methods and apparatuses relating to data classification
US20210295036A1 (en) Systematic language to enable natural language processing on technical diagrams
US11605006B2 (en) Deep-learning model catalog creation
CN113901793A (en) Event extraction method and device combining RPA and AI
Munir et al. Log attention–assessing software releases with attention-based log anomaly detection
US20240062219A1 (en) Granular taxonomy for customer support augmented with ai
US20240177172A1 (en) System And Method of Using Generative AI for Customer Support
US11107096B1 (en) Survey analysis process for extracting and organizing dynamic textual content to use as input to structural equation modeling (SEM) for survey analysis in order to understand how customer experiences drive customer decisions
EP4303716A1 (en) Method for generating data input, data input system and computer program

Legal Events

Date Code Title Description
WAP Application withdrawn, taken to be withdrawn or refused ** after publication under section 16(1)