WO2022144848A1

WO2022144848A1 - System and method for predicting an overall similarity score between two primary entities of a data lake

Info

Publication number: WO2022144848A1
Application number: PCT/IB2021/062515
Authority: WO
Inventors: Malik SOUDED
Original assignee: Alten
Priority date: 2020-12-31
Filing date: 2021-12-31
Publication date: 2022-07-07
Also published as: EP4272090A1; WO2022144852A1; EP4272089A1

Abstract

One of the aims of said invention is to provide an objective and reproducible tool for quantifying redundancy in a data lake. To achieve this, the inventors propose training a machine learning model, using existing data lakes, to predict an overall similarity score which is representative of the similarity between two data entities of a data lake. In practice, instead of comparing each of the data fields of the data entities, the invention proposes determining the overall similarity score from intermediate similarity scores which are calculated for random samples of the data entities.

Description

SYSTEM AND METHOD FOR PREDICTING AN OVERALL SIMILARITY SCORE BETWEEN TWO PRIMARY ENTITIES OF A DATA LAKE

The invention relates to the field of quantifying the similarity of data entities in a data lake. In particular, it relates to a system and a method for predicting an overall similarity score between two primary entities of a data lake. It also relates to a system and method for training a machine learning model that is intended to predict an overall similarity score.

The increase in the volume of digital data has enabled the development of technologies related to big data (“Big Data”).

The heterogeneous nature of these digital data, as well as their diverse sources, have required changes to traditional ways of storing data.

It is in this context that data lakes have been introduced.

However, in these data lakes, it is difficult to analyze the data because of the presence of redundancies of data fields which, for example, make it complex to extract subsets of particular data fields.

Indeed, these redundancies increase the data access times, weaken the integrity of the data and prevent the maintenance of data consistency.

Thus, there is a need to quantify redundancy in data lakes.

The invention aims to solve, at least partially, this need.

The invention relates in particular to a method for predicting an overall similarity score which is representative of the similarity between a first primary data entity and a second primary data entity of a data lake, each primary data entity comprising a plurality of data fields.

In particular, the method comprises:
- a step of extracting, for each primary data entity, a plurality of data field characteristics from the content of the plurality of data fields according to at least one predetermined extraction criterion,
- a step of association, for each primary data entity, of each characteristic with, on the one hand, a reference to the set of data fields from which the characteristic is extracted and, on the other hand, the quantity of fields of data from which the characteristic is extracted, so as to form a table of characteristics,
- a generation step, for each primary data entity, of the same predetermined number of at least two random samples, called secondary data entities, using stratified sampling, from the associated characteristic table and from a associated predetermined sampling probability, the predetermined sampling probability being, on the one hand, different for the secondary data entities which come from the same primary data entity and, on the other hand, common for pairs of 'secondary data entities which are derived from the first primary data entity and the second primary data entity,
- a step of defining a plurality of pairs of secondary data entities which are associated with the same predetermined sampling probability, each pair of secondary data entities comprising a first secondary data entity coming from the first entity of primary data and a second secondary data entity derived from the second primary data entity,
- a step of calculating, for each pair of secondary data entities, an intermediate similarity score which is representative of the similarity between the first secondary data entity and the second secondary data entity,
- a step of forming a vector comprising a plurality of vector elements, each vector element comprising, for each pair of secondary data entities, the intermediate similarity score and the value of the associated predetermined probability, and
- a prediction step, from the vector and from at least one trained machine learning model, called trained model, to predict an overall similarity score which is representative of the similarity between the first primary data entity and the second primary data entity.

The invention also covers a method of training a machine learning model intended to predict an overall similarity score which is representative of the similarity between a first primary data entity and a second primary data entity of a lake of data, each primary data entity comprising a plurality of data fields.

In particular, the method comprises:
- a first step of calculating a plurality of global similarity scores, from a plurality of pairs of primary data entities of at least one data lake, each global similarity score being representative of the similarity between a first primary data entity and a second primary data entity of the plurality of primary data entities,
- a step of extracting, for each primary data entity of each pair of primary data entities, a plurality of characteristics of data fields from the content of the plurality of data fields according to at least one criterion d predetermined extraction,
- a step of association, for each primary data entity of each pair of primary data entities, of each characteristic with, on the one hand, the set of data fields from which the characteristic is extracted and, on the other part, the quantity of data fields from which the characteristic is extracted, so as to form a table of characteristics,
- a generation step, for each primary data entity of each pair of primary data entities, of the same predetermined number of at least two random samples, called secondary data entities, using stratified sampling, from of the associated characteristic table and of an associated predetermined sampling probability, the predetermined sampling probability being, on the one hand, different for the secondary data entities which come from the same primary data entity and, d on the other hand, common for pairs of secondary data entities which come from the first primary data entity and from the second primary data entity,
- a step of defining, for each primary data entity of each pair of primary data entities, a plurality of pairs of secondary data entities which are associated with the same predetermined sampling probability, each pair of secondary data entities comprising a first secondary data entity from the first primary data entity and a second secondary data entity from the second primary data entity,
- a second step of calculating, for each pair of secondary data entities, an intermediate similarity score which is representative of the similarity between the first secondary data entity and the second secondary data entity,
- a first step of forming, for each pair of primary data entities, a first vector comprising a plurality of vector elements, each vector element comprising, for each pair of secondary data entities, the intermediate score of similarity and the value of the associated predetermined probability,
- a second step of forming, for each global similarity score, a second vector comprising the global similarity score, and
- training a machine learning model by automatic learning comprising an input and an output which are associated with a same pair of primary data entities, the input being configured to receive the first vector and the output being configured to receive the second vector.

In a first embodiment, the predetermined extraction criterion is chosen from: a metadata attribute associated with the data fields, a characteristic associated with the content of the data fields, and any combinations thereof.

In a first implementation of the first embodiment, the metadata attribute associated with the data fields is chosen from: a type, a size in memory, an import date, a production date, an update date date, an identifier of the original data source, and any combination thereof.

In a second implementation of the first embodiment, the characteristic associated with the content of the data fields is calculated.

In a first example of the second implementation of the first embodiment, when the data field is of the character string type, the characteristic associated with the content of the data fields is a length.

In a first example of the second implementation of the first embodiment, when the data field is of the character string type, the characteristic associated with the content of the data fields is an n-gram of characters of at least one length predetermined.

In a first aspect of the first example of the second implementation of the first embodiment, the length of the character n-gram is fixed and/or variable.

In a second aspect of the first example of the second implementation of the first embodiment, the method further comprises in the association step the formation of a single characteristic class which groups the characteristic classes whose quantity of data fields is below a predetermined threshold.

In a second embodiment, the calculation step comprises the use of a method chosen from: a Levenshtein ratio, a Jaccard index, a Jaro-Winkler index, a cosine similarity and any combination thereof .

The invention also covers a system for predicting an overall similarity score that is representative of the similarity between a first primary data entity and a second primary data entity of a data lake, each primary data entity comprising a plurality of data fields.

In particular, the system includes:
- at least one data storage device configured to store instructions for predicting the global similarity score, and
- at least one processor configured to execute the instructions to implement a method according to the invention.

The invention also covers a system for training a machine learning model for predicting an overall similarity score that is representative of the similarity between a first primary data entity and a second primary data entity of a data lake, each primary data entity comprising a plurality of data fields.

In particular, the system includes:
- at least one data storage device configured to store instructions for training a machine learning model, and
- at least one processor configured to execute the instructions to implement a method according to the invention.

Other characteristics and advantages of the invention will be better understood on reading the following description and with reference to the appended drawings, given by way of illustration and in no way limiting.

The represents an embodiment of a prediction method according to the invention.

The represents an embodiment of a table of characteristics according to the invention.

The represents an embodiment of a stratified sampling according to the invention.

The shows one embodiment of a system for implementing the method of the .

The represents an embodiment of a training method according to the invention.

The shows one embodiment of a system for implementing the method of the .

The figures do not necessarily respect the scales, in particular in thickness, and this for purposes of illustration.

In the various figures, the dotted lines and arrows indicate optional or optional elements, steps and sequences.

One of the goals of this invention is to provide an objective and reproducible tool to quantify redundancy in a data lake.

For this, the inventors propose to train a machine learning model, from existing data lakes to predict an overall similarity score which is representative of the similarity between two data entities of a data lake.

In practice, instead of comparing each of the data fields of the data entities of a particular data lake, the invention proposes to determine the overall similarity score from intermediate similarity scores which are calculated on random samples of the data lake data entities.

Thus, the invention relates to a method for predicting an overall similarity score which is representative of the similarity between a first primary data entity and a second primary data entity of a data lake.

In the invention, by data lake is meant a storage space where data is collected in its natural form, whether raw or transformed. And this, for the purpose of analysis (e.g. the establishment of reports, visualizations or analytical structures) or action (e.g. machine learning).

Thus, a data lake can comprise primary data entities in different forms such as structured data (e.g. databases that include rows and columns), semi-structured data (e.g. files such as CSV, logs, XML or JSON), unstructured data (e.g. emails, PDF files), and binary data (e.g. image, audio or video files).

In practice, each primary data entity comprises a plurality of data fields.

As is known, each data field has a data type which can be chosen from: numeric types, time types and character string types.

In a first example, when the data type is numeric, it can be chosen from among the numeric types defined in the standards associated with relational databases of the SQL type, such as an integer or a decimal number and any combination of these.

However, depending on the data available, other digital types may be used, without requiring substantial modifications to the invention.

In a second example, when the type of data is temporal, it can be chosen from among the temporal types defined in the standards associated with relational databases of the SQL type, such as a date, a time, a day, a year, minute, second and any combination thereof.

However, depending on the data available, other time types may be used, without requiring substantial modifications to the invention.

In a third example, when the data type is a character string, it can be chosen from among the types of character strings defined in the standards associated with relational databases of the SQL type, such as an ASCII string , a binary string, an enumeration and any combination thereof.

However, depending on the data available, other types of character strings may be used, without requiring substantial modifications to the invention.

In a particular embodiment of the invention, it may be considered that all the fields of the data lake have the same type, for example the character string type.

For this, it is possible to use known techniques for converting data types.

Returning to the invention, as illustrated in the example of the , the prediction method 100 comprises a step 110 of extracting, for each primary data entity, a plurality of characteristics of data fields from the content of the plurality of data fields.

In particular, the extraction is carried out according to at least one predetermined extraction criterion.

Thus, the invention also covers the performance of an extraction according to one or more predetermined extraction criteria.

In a particular embodiment, the predetermined extraction criterion is selected from: a metadata attribute associated with the data fields, a feature associated with the content of the data fields, and any combinations thereof.

In a first example, when the predetermined extraction criterion is a metadata attribute associated with the data fields, it is chosen from: a type, a size in memory, an import date, a production date, a update date, an identifier of the original data source, and any combination thereof.

In a known manner, the term “metadata” means data which is used to define or describe another data item. Thus, a metadata attribute is a property of metadata.

However, depending on the data available, other metadata attributes may be used, without requiring substantial modifications to the invention.

In a particular embodiment, when the predetermined extraction criterion is the size in memory occupied by the data fields, it will be possible to extract, from each primary data entity, the data fields so that each data field belongs to a predetermined class of memory occupation (e.g. a class chosen from: [1MB-10MB[, [10MB-100MB[, [100MB-200MB[, [200MB-1GB[, [1GB-100GB] and any combination thereof).

In a second example, when the predetermined extraction criterion is a characteristic associated with the content of the data fields, this is calculated.

In a first embodiment of the second example, when the data field is of the character string type, the characteristic associated with the content of the data fields is a length.

In a particular embodiment, when the data field is of the character string type and the predetermined extraction criterion is the length of the data fields, it will be possible to extract, from each primary data entity, the data fields of so that each data field belongs to a predetermined class of character string length (e.g. a class chosen from: [1-10 characters[, [10-50 characters[, [50-100 characters[, [100 -200 characters[ and any combination thereof).

In a second embodiment of the second example, still when the data field is of the character string type, the characteristic associated with the content of the data fields is an n-gram of characters (also called "bags of words" or "token based" , in English) of at least a predetermined length.

In practice, an n-gram of characters is a succession of 'n' consecutive characters (e.g. three, four or five characters) of a character string.

In a first embodiment, the length of the character n-gram is fixed.

For example, the word "text" includes the following trigrams: "tex", "ext", and "xte".

In a second embodiment, the length of the n-gram of characters is variable.

For example, the word "portfolio" includes the following trigrams: "por", "ort", "rte", "tef", "efe", "feu", "eui", "uil", "ill". , "she".

In addition, the word "portefeuille" includes the following quadrigrams: "port", "orte", "rtef", "tefe", "efeu", "feui", "euil", "uill", "ille". .

Thus, it is possible to envisage extracting a part of the data field characteristics according to an n-gram of characters of a first fixed length, while another part of the data field characteristics can be extracted according to an n- gram of characters of a second fixed length which is different from the first fixed length.

Furthermore, it will be noted that in this second embodiment of the second example, when the data field is of the character string type and the predetermined extraction criterion is an n-gram of characters, it will be possible to extract, from each data entity primary, data fields so that each data field belongs to at least one class of n-gram character.

Indeed, in this particular embodiment, the classes of n-grams of characters may not be distinct, because a data field may comprise several n-grams of characters and thus appear in several classes of n-gram of characters.

Thus, in this particular embodiment, an internal, semantic, and common characteristic of the data fields links the classes of n-grams of characters together.

Back to the , the prediction method 100 comprises a step 120 of association, for each primary data entity, of each characteristic with, on the one hand, a reference to the set of data fields from which the characteristic is extracted and, on the other hand, on the other hand, the quantity of data fields from which the characteristic is extracted, so as to form a table of characteristics.

In practice, in the characteristics table, each row corresponds to a characteristic class.

In a first example, the reference to all of the data fields includes a plurality of data field references, each of the references referring to a single data field.

In a particular embodiment, a data field reference is a unique identifier of a data field.

In a second example, the quantity of data fields from which the characteristic is extracted is a statistical quantity.

For example, the quantity of data fields is a descriptive statistic measure chosen from: frequency and frequency.

The frequency of a characteristic means the number of observations of this characteristic in the associated table of characteristics.

By frequency of a characteristic, we mean the ratio between the number of this characteristic and the total number of characteristics in the associated table of characteristics.

However, depending on the needs, the quantity of data fields can be obtained from other descriptive statistics measurements, without requiring substantial modifications of the invention.

Furthermore, in the context of this example, and optionally, the prediction method 100 further comprises, in the association step 120, the formation of a single particular feature class which groups together the feature classes whose quantity of data fields is below a predetermined threshold.

In this way, a particular feature class will be obtained which includes the feature classes which are least represented in each primary data entity.

For example, this particular characteristic class could include characteristics whose frequency is less than fifty or whose frequency is less than 10% of the total frequency.

Of course, depending on the needs, other threshold values may be considered, without requiring substantial modifications of the invention.

The illustrates an example of a feature table formed from n-grams of characters extracted from the contents of the data fields of a primary data entity.

The table of characteristics 10 comprises a first column 11, a second column 12 and a third column 13.

The first column 11 comprises one line per n-gram of characters.

In the example of the , the character n-gram "coh" is three characters long. On the other hand, the n-gram of characters "ct" has a length of two characters.

The second column 12 comprises, for each n-gram, the quantity of associated data fields.

In the example of the , the quantity of data fields is a count.

In particular, in the , there are seventeen times the n-gram of characters "cti" in the data fields of the primary data entity under consideration.

The third column 13 comprises, for each n-gram of characters, a list of references to all the data fields from which the n-gram of characters is extracted.

For example, the field with the reference '7' includes the character n-grams "ct", "cti", and "dir".

Back to the , the prediction method 100 comprises a step 130 of generating, for each primary data entity, the same predetermined number of at least two random samples, called secondary data entities.

Thus, the invention also covers the generation of two or more secondary data entities.

Furthermore, the same number of secondary data entities is determined for each primary data entity.

For this, the generation step 130 uses stratified sampling, from the characteristic table associated with each primary data entity and from an associated predetermined sampling probability.

As is known, stratified sampling is a method of selecting samples from a population of study data. It is particularly useful if the phenomena studied are irregularly distributed or if the study data are very heterogeneous. In practice, study data is divided into several predefined subsets (also called "strata") that exhibit homogeneity with respect to the spatial distribution of relevant features and attributes. Next, independent samples are selected from each stratum, either randomly or systematically.

For example, consider a primary data entity that includes 50,000 data fields with a sampling probability of 40%.

Further, consider that a particular feature is represented by one thousand data fields, and the total count of all features is one million.

In this case, twenty fields of data should be randomly drawn from the thousand fields (i.e., in detail: (50,000*0.4*1000)/1,000,000 = 20).

In practice, in the invention, the predetermined sampling probability is, on the one hand, different for the secondary data entities which come from the same primary data entity.

Thus, if several secondary data entities are determined for a first primary data entity, each secondary data entity will be associated with a predetermined sampling probability which is different from that of the others.

On the other hand, the predetermined sampling probability is common for pairs of secondary data entities that are derived from the first primary data entity and the second primary data entity.

Thus, if several secondary data entities are determined for a first primary data entity and for a second primary data entity, it will be ensured that it is possible to form pairs of secondary data entities which come from the first entity primary data entity and the second primary data entity, wherein each pair comprises secondary data entities that are associated with a same predetermined sampling probability. However, we will ensure that the predetermined sampling probability associated with each pair is different from that of the others.

The illustrates an example of stratified sampling according to the invention.

First of all, the describes a first primary data entity 20 and a second primary data entity 30.

Then the describes the generation, for each

primary data entity

20, 30, of four random samples, so as to form secondary data entities.

In practice, the first

secondary data entities

21, 22, 23 and 24 are random samples of the first primary data entity 20, while the second

secondary data entities

31, 32, 33 and 34 are random samples of the second primary data entity 30.

Furthermore, the sampling probability associated with each secondary data entity is different from those associated with the same

primary data entity

20, 30.

In one example, each of the first secondary data features 21, 22, 23, and 24, respectively, can be associated with the following sampling probabilities: 10%, 30%, 40%, and 60%.

Furthermore, the sampling probability associated with each secondary data entity of a pair formed from first primary data entity 20 and second primary data entity 30, is common.

Thus, and continuing the above example, each of the second

secondary data entities

31, 32, 33, and 34 is associated, respectively, with the following sampling probabilities: 10%, 30%, 40%, and 60% .

In this case, a first pair is formed from the first secondary data entity 21 and the second secondary data entity 31; a second pair is formed from the first secondary data entity 22 and the second secondary data entity 32; a third pair is formed from the first secondary data entity 23 and the second secondary data entity 33; and, forming a fourth pair from the first secondary data entity 24 and the second secondary data entity 34.

Back to the , the prediction method 100 comprises a step 140 of defining a plurality of pairs of secondary data entities which are associated with the same predetermined sampling probability.

In particular, each pair of secondary data entities includes a first secondary data entity from the first primary data entity and a second secondary data entity from the second primary data entity.

Thus, we will obtain as many pairs of secondary data entities as the predetermined number of secondary data entities associated with each primary data entity.

For example, if it is considered that four secondary data entities are determined for each primary data entity, then the definition step 140 will make it possible to define four pairs of secondary data entities.

Returning to the invention, the prediction method 100 comprises a step 150 of calculating, for each pair of secondary data entities, an intermediate similarity score which is representative of the similarity between the first secondary data entity and the second secondary data entity.

In a particular embodiment, the calculation step 150 comprises the use of a method chosen from among: a Levenshtein ratio, a Jaccard index, a Jaro-Winkler index, a cosine similarity and any combinations thereof- this.

However, depending on the types of data available, other similarity metrics may be used, without requiring substantial modifications to the invention.

Afterwards, the prediction method 100 comprises a step 160 of forming a vector which comprises a plurality of vector elements.

In particular, each vector element includes, for each pair of secondary data entities, the intermediate similarity score and the value of the associated predetermined probability.

In the invention, by vector is meant a data structure which is configured to store an ordered set of elements which are each identified by an index. In practice, a vector is defined by the number of elements that compose it as well as the type and size of its elements.

Thus, a linked list can also be considered as a vector within the meaning of the invention.

Finally, the prediction method 100 comprises a prediction step 170, from the vector and from at least one trained machine learning model, called trained model, to predict an overall similarity score which is representative of the similarity between the first primary data entity and the second primary data entity.

The invention also relates to a system for predicting the global similarity score which is representative of the similarity between a first primary data entity and a second primary data entity of a data lake.

So, as shown in the example of the , the system 200 includes at least one data storage device 210 and at least one processor 220.

Data storage device 210 is configured to store instructions for predicting the overall similarity score.

The processor 220 is configured to execute the instructions to implement all or part of the prediction method 100 as described above.

The invention also relates to a method of training a machine learning model which is intended to predict an overall similarity score which is representative of the similarity between a first primary data entity and a second primary data entity of a data lake.

In practice, as indicated above, each primary data entity comprises a plurality of data fields.

First, as shown in the example of the , the training method 300 comprises a first step 310 of calculating a plurality of global similarity scores, from a plurality of primary data entities of at least one data lake.

In practice, each global similarity score is representative of the similarity between a first primary data entity and a second primary data entity of the plurality of primary data entities.

In a particular embodiment, as indicated above, the first calculation step 310 comprises the use of a method chosen from among: a Levenshtein ratio, a Jaccard index, a Jaro-Winkler index, a cosine similarity and any combinations thereof.

Next, the training method 300 comprises a step of extracting 320, for each primary data entity of each pair of primary data entities, a plurality of data field characteristics from the content of the plurality of data fields.

In particular, as indicated above, the extraction is carried out according to at least one predetermined extraction criterion.

Then, the training method 300 comprises a step 330 of association, for each primary data entity of each pair of primary data entities, of each characteristic with, on the one hand, the set of data fields whose the characteristic is extracted and, on the other hand, the quantity of data fields from which the characteristic is extracted, so as to form a table of characteristics, as indicated above.

Subsequently, the training method 300 comprises a generation step 340, for each primary data entity of each pair of primary data entities, of the same predetermined number of at least two random samples, called secondary data, as indicated above.

For this, the generation step 340 uses stratified sampling, from the associated characteristic table and from an associated predetermined sampling probability, as indicated above.

In practice, as indicated above, the predetermined sampling probability is, on the one hand, different for the secondary data entities which come from the same primary data entity and, on the other hand, common for pairs of secondary data entities that are derived from the first primary data entity and the second primary data entity.

Afterwards, the training method 300 comprises a step of defining 350, for each primary data entity of each pair of primary data entities, a plurality of pairs of secondary data entities which are associated with the same probability pre-determined sample rate, as indicated above.

In particular, as indicated above, each pair of secondary data entities comprises a first secondary data entity derived from the first primary data entity and a second secondary data entity derived from the second primary data entity.

Then, as indicated above, the training method 300 comprises a second step of calculating 360, for each pair of secondary data entities, an intermediate score of similarity which is representative of the similarity between the first entity of secondary data and the second secondary data entity.

Then, the training method 300 includes a first step 370 of forming, for each pair of primary data entities, a first vector that includes a plurality of vector elements.

In practice, each vector element of the first vector includes, for each pair of secondary data entities, the intermediate similarity score and the value of the associated predetermined probability

Subsequently, the training method 300 comprises a second step 380 of training, for each global similarity score, a second vector which comprises the global similarity score.

Finally, the training method 300 includes training 390 a machine learning model by machine learning that includes an input and an output that are associated with a same pair of primary data entities.

In practice, the input is configured to receive the first vector and the output is configured to receive the second vector.

In a particular embodiment of the invention, the machine learning includes a training algorithm that is selected from: partial least squares regression, linear regression, neural network, decision tree, genetic algorithm, genetic programming, k-nearest neighbor method, radial basis function network, random function, forest, support vector machine and deep learning.

However, depending on the needs and available resources, other training algorithms may be used, without requiring substantial modifications to the invention.

The invention also relates to a system for training a machine learning model which is intended to predict an overall similarity score which is representative of the similarity between a first primary data entity and a second primary data entity d a data lake.

So, as shown in the example of the , system 400 includes at least one data storage device 410 and at least one processor 420.

Data storage device 410 is configured to store instructions for training a machine learning model.

Processor 420 is configured to execute instructions to implement all or part of training method 300 as described above.

We have described and illustrated the invention. However, the invention is not limited to the embodiments that we have presented. Thus, an expert in the field can deduce other variants and embodiments, on reading the description and the appended figures.

The invention may be subject to numerous variants and applications other than those described above. In particular, unless otherwise indicated, the various structural and functional characteristics of each of the implementations described above should not be considered as combined and/or closely and/or inextricably linked to each other, but, on the contrary, as simple juxtapositions. In addition, the structural and/or functional characteristics of the various embodiments described above may be the subject, in whole or in part, of any different juxtaposition or any different combination.

Claims

A method (100) of predicting an overall similarity score that is representative of the similarity between a first primary data entity and a second primary data entity of a data lake, each primary data entity comprising a plurality of fields of data, the method (100) comprising:
- a step of extracting (110), for each primary data entity, a plurality of data field characteristics from the content of the plurality of data fields according to at least one predetermined extraction criterion,
- an association step (120), for each primary data entity, of each characteristic with, on the one hand, a reference to the set of data fields from which the characteristic is extracted and, on the other hand, the quantity of data fields from which the characteristic is extracted, so as to form a table of characteristics,
- a generation step (130), for each primary data entity, of the same predetermined number of at least two random samples, called secondary data entities, using stratified sampling, from the associated characteristic table and an associated predetermined sampling probability, the predetermined sampling probability being, on the one hand, different for the secondary data entities which come from the same primary data entity and, on the other hand, common for pairs of secondary data entities which are derived from the first primary data entity and the second primary data entity,
- a step of defining (140) a plurality of pairs of secondary data entities which are associated with the same predetermined sampling probability, each pair of secondary data entities comprising a first secondary data entity coming from the first primary data entity and a second secondary data entity derived from the second primary data entity,
- a calculation step (150), for each pair of secondary data entities, of an intermediate similarity score which is representative of the similarity between the first secondary data entity and the second secondary data entity,
- a step of forming (160) a vector comprising a plurality of vector elements, each vector element comprising, for each pair of secondary data entities, the intermediate similarity score and the value of the associated predetermined probability , and
- a prediction step (170), from the vector and from at least one trained machine learning model, called trained model, to predict an overall similarity score which is representative of the similarity between the first primary data entity and the second primary data entity.
A method of training (300) a machine learning model for predicting an overall similarity score that is representative of the similarity between a first primary data entity and a second primary data entity of a data lake, each primary data entity comprising a plurality of data fields, the method (300) comprising:
- a first step of calculating (310) a plurality of global similarity scores, from a plurality of pairs of primary data entities of at least one data lake, each global similarity score being representative of the similarity between a first primary data entity and a second primary data entity of the plurality of primary data entities,
- a step of extracting (320), for each primary data entity of each pair of primary data entities, a plurality of characteristics of data fields from the content of the plurality of data fields according to at least a predetermined extraction criterion,
- a step of association (330), for each primary data entity of each pair of primary data entities, of each characteristic with, on the one hand, the set of data fields from which the characteristic is extracted and, on the other hand, the quantity of data fields from which the characteristic is extracted, so as to form a table of characteristics,
- a generation step (340), for each primary data entity of each pair of primary data entities, of the same predetermined number of at least two random samples, called secondary data entities, using stratified sampling , from the associated characteristic table and an associated predetermined sampling probability, the predetermined sampling probability being, on the one hand, different for the secondary data entities which come from the same primary data entity and, on the other hand, common for pairs of secondary data entities which come from the first primary data entity and from the second primary data entity,
- a step of defining (350), for each primary data entity of each pair of primary data entities, a plurality of pairs of secondary data entities which are associated with the same predetermined sampling probability, each pair of secondary data entities comprising a first secondary data entity from the first primary data entity and a second secondary data entity from the second primary data entity,
- a second calculation step (360), for each pair of secondary data entities, of an intermediate similarity score which is representative of the similarity between the first secondary data entity and the second secondary data entity,
- a first step of forming (370), for each pair of primary data entities, a first vector comprising a plurality of vector elements, each vector element comprising, for each pair of secondary data entities, the intermediate similarity score and the value of the associated predetermined probability,
- a second step of forming (380), for each global similarity score, a second vector comprising the global similarity score, and
- training (390) a machine learning model by machine learning comprising an input and an output which are associated with a same pair of primary data entities, the input being configured to receive the first vector and the output being configured to receive the second vector.
Method (100, 300) according to one of claims 1 or 2 wherein the predetermined extraction criterion is chosen from among: a metadata attribute associated with the data fields, a characteristic associated with the content of the data fields, and any combination of these.
A method (100, 300) according to claim 3 wherein the metadata attribute associated with the data fields is selected from: a type, a size in memory, an import date, a production date, an update date date, an identifier of the original data source, and any combinations thereof.
Method (100, 300) according to one of Claims 3 or 4, in which the characteristic associated with the content of the data fields is calculated.
Method (100, 300) according to claim 5 wherein, when the data field is of the character string type, the characteristic associated with the content of the data fields is a length.
Method (100, 300) according to claim 5 wherein, when the data field is of the character string type, the characteristic associated with the content of the data fields is an n-gram of characters of at least a predetermined length.
A method (100, 300) according to claim 7 wherein the length of the character n-gram is fixed and/or variable.
Method (100, 300) according to one of claims 7 or 8, further comprising in the step of associating (120, 330) the formation of a single characteristic class which groups together the characteristic classes whose quantity of data fields is below a predetermined threshold.
A method (100, 300) according to any of claims 1 to 8, wherein the calculating step (150, 310, 360) comprises using a method selected from: a Levenshtein ratio, a Jaccard, a Jaro-Winkler index, cosine similarity and any combination thereof.
A system (200) for predicting an overall similarity score that is representative of the similarity between a first primary data entity and a second primary data entity of a data lake, each primary data entity comprising a plurality of data fields , the system (200) comprising:
- at least one data storage device (210) configured to store instructions for predicting the global similarity score, and
- at least one processor (220) configured to execute the instructions for implementing a method according to any one of claims 1 and 3 to 10.
System (400) for training a machine learning model to predict an overall similarity score that is representative of the similarity between a first primary data entity and a second primary data entity of a data lake , each primary data entity comprising a plurality of data fields, the system (400) comprising:
- at least one data storage device (410) configured to store instructions for training a machine learning model, and
- at least one processor (420) configured to execute the instructions for implementing a method according to any one of claims 2 to 10.