EP3948579A1 - Systeme et procede d'enrichissement de donnees - Google Patents
Systeme et procede d'enrichissement de donneesInfo
- Publication number
- EP3948579A1 EP3948579A1 EP20731903.9A EP20731903A EP3948579A1 EP 3948579 A1 EP3948579 A1 EP 3948579A1 EP 20731903 A EP20731903 A EP 20731903A EP 3948579 A1 EP3948579 A1 EP 3948579A1
- Authority
- EP
- European Patent Office
- Prior art keywords
- data
- label
- enriched
- fundamental
- database
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 46
- 238000012545 processing Methods 0.000 claims description 62
- 230000006870 function Effects 0.000 claims description 42
- 101001003569 Homo sapiens LIM domain only protein 3 Proteins 0.000 claims description 26
- 101000639972 Homo sapiens Sodium-dependent dopamine transporter Proteins 0.000 claims description 26
- 102100026460 LIM domain only protein 3 Human genes 0.000 claims description 26
- 238000004891 communication Methods 0.000 claims description 20
- 238000012795 verification Methods 0.000 claims description 12
- 238000004422 calculation algorithm Methods 0.000 claims description 9
- 230000005236 sound signal Effects 0.000 claims description 5
- 238000004590 computer program Methods 0.000 claims description 3
- 230000005540 biological transmission Effects 0.000 description 10
- 238000012546 transfer Methods 0.000 description 7
- 230000008569 process Effects 0.000 description 6
- 230000000694 effects Effects 0.000 description 4
- 230000000717 retained effect Effects 0.000 description 4
- 230000001413 cellular effect Effects 0.000 description 2
- 230000001010 compromised effect Effects 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 230000008520 organization Effects 0.000 description 2
- 230000009467 reduction Effects 0.000 description 2
- 238000000638 solvent extraction Methods 0.000 description 2
- 230000001960 triggered effect Effects 0.000 description 2
- 102100034033 Alpha-adducin Human genes 0.000 description 1
- 102100024348 Beta-adducin Human genes 0.000 description 1
- 102100034004 Gamma-adducin Human genes 0.000 description 1
- 101000799076 Homo sapiens Alpha-adducin Proteins 0.000 description 1
- 101000689619 Homo sapiens Beta-adducin Proteins 0.000 description 1
- 101000799011 Homo sapiens Gamma-adducin Proteins 0.000 description 1
- 101000629598 Rattus norvegicus Sterol regulatory element-binding protein 1 Proteins 0.000 description 1
- 230000004931 aggregating effect Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000012512 characterization method Methods 0.000 description 1
- 230000000295 complement effect Effects 0.000 description 1
- 238000012937 correction Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/38—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/21—Design, administration or maintenance of databases
- G06F16/215—Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/28—Databases characterised by their database models, e.g. relational or object models
- G06F16/284—Relational databases
- G06F16/285—Clustering or classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
Definitions
- the field of the invention relates to the enrichment of data, in particular when the latter present the risk of including inaccuracies or errors due to the conditions of transmission and reception of these data.
- One of the main applications of the invention relates in particular to mobile banking, better known by the English term "mobile banking”.
- the need to make the data received reliable is an important issue in all systems in which the transfer of data, sometimes repeatedly, is inevitable.
- the field of mobile banking which designates all types of financial services accessible from mobile equipment connected to a wide area network, such as a mobile phone for example, is a field in which data transfers are numerous and restitution of this data is a necessary condition for the implementation of the services.
- the data transferred may include information such as the name of a merchant, his activity code, his location, his name, etc. It is therefore necessary to ensure the reliability of this data when it presents the risk of containing errors for the proper functioning of the services.
- the present invention improves the situation.
- the present invention relates to a data enrichment method implemented by computer means and comprising:
- a) receive several sets of data, a set of data comprising fundamental data and one or more metadata relating to the fundamental data, ...
- the fundamental data is a series of characters, or a sound signal or a digital image.
- the method further comprises, following the reception of the data sets: generating for each data set, by applying a processing for reducing a noise level to the fundamental datum, a processed datum associated with the dataset.
- the grouping of the data sets is implemented as a function of the processed data respectively associated with the data sets.
- the fundamental data is a sequence of characters and the processed data is generated by deleting the sequence of characters of one or more characters from a list of predetermined characters.
- the grouping of the datasets uses an unsupervised learning algorithm.
- each set of data stored in the at least one database further comprises fundamental data and, if the combination of at least part of the metadata and of the label of an enriched data set is present in at least one database in a corresponding data set, the fundamental data of the enriched data set is replaced if necessary by the fundamental data of the corresponding data set.
- each set of data stored in the at least one database further comprises fundamental data
- the search is carried out on a plurality of databases, each database being characterized by a coefficient of reliability, and, if the combination of at least part of the metadata and the label of the enriched data set is present in corresponding data sets respectively stored in separate databases of the plurality of databases, the label of the enriched data set is removed if the fundamental data of the enriched data set is distinct from the fundamental data of the corresponding data set stored in the database characterized by the greatest reliability coefficient.
- each set of data stored in the at least one database further comprises fundamental data
- the search is carried out on a plurality of databases, each database being characterized by a coefficient of reliability, and, if the combination of at least part of the metadata and the label of the enriched dataset is present in corresponding data sets respectively stored in separate databases of the plurality of databases, each fundamental data present in at least one of the corresponding data sets is associated with a likelihood factor determined as a function of the reliability coefficient of each database storing a corresponding data set comprising the fundamental data in question, and the label of the enriched data set is removed if the fundamental data of the enriched data set is distinct from the fundamental data associated with the factor highest likelihood.
- each metadata of an enriched data set being associated with a weight the combination of at least part of the metadata and of the label is present in a database if and only if a value of one. presence function, calculated as a function of the respective weights of the metadata of the combination present in the database, is greater than or equal to a predetermined threshold.
- the enriched data set is again enriched by data representative of the similarity function and / or at least one database within which the combination of at least part of the metadata and of the label of the enriched data set has been found.
- steps b) to e) are repeated for the data sets from which the label has been removed with a new similarity function, so that a data set cannot be enriched by a label already aggregated then removed previously.
- steps b) to e) are limited to a predetermined maximum number of iterations.
- the fundamental data relates to an individual or of an entity
- the metadata comprises at least contact data of the individual or of the entity, and in which the enriched data set is transmitted, using contact data, to the individual or entity for verification of the aggregated label.
- contact data is a postal address, phone number, email address and / or an address of an application user account.
- the present invention also relates to a computer program comprising instructions for implementing the method described above, when the instructions are executed by at least one processor.
- the present invention relates to a data enrichment system comprising:
- a communication module designed to receive several sets of data, a set of data comprising fundamental data and one or more metadata relating to the fundamental data,
- processing unit designed for:
- At least one database configured to store sets of data each comprising metadata and a label
- processing unit being furthermore arranged for:
- FIG. 1 illustrates a data enrichment system according to the invention
- FIG. 2 illustrates a data enrichment method according to the invention.
- FIG. 1 illustrates a data enrichment system, hereinafter SYS system, according to the invention.
- the SYS system is designed to receive data presenting the risk of including errors or inaccuracies and to enrich these data despite these potential errors or inaccuracies.
- the data received by the SYS system is indeed likely to include a certain level of noise.
- this data is liable to contain erroneous characters or inaccuracies. It is therefore understood here that noise typically designates any error introduced into a character sequence at the source, on transmission or on reception or during the transmission of data.
- the data received at the entrance of the SYS system are data allowing a user to access financial services from a mobile device. , for example a cell phone.
- the data transferred then makes it possible to consult an online account or to make a transfer.
- the data can correspond to the name of a merchant, to his activity code, to his location, therefore his city, his address and his postal code, or even to his name.
- certain information is limited to a maximum number of characters. The transfer of data representative of such information is therefore necessarily imprecise and incomplete since not all the characters could be entered.
- noise refers, for example, to this type of imprecision.
- the SYS system is designed to allow, even when these data contain errors, sometimes introduced at the source but also at the sending or receiving of the data, the provision of the service.
- These data can also be a sound signal comprising noise or a digital image comprising digital noise.
- the SYS system is designed to enrich the data despite this potential noise
- the SYS system comprises a processing unit UNT and at least one database, here two databases DB1, DB2.
- the UNT processing unit is designed to, upon receipt of several data sets, enrich each data set and verify the relevance of this data enrichment using databases DB1, DB2. More specifically, the processing unit UNT is designed to generate, for each set of data received, an additional piece of data called a label and to aggregate or append the label generated to the associated set of data. In the literature, we will also talk about label to designate the label.
- processing unit UNT is also arranged to apply processing to at least part of the data received in order to reduce a level of noise that the data is likely to contain.
- the processing unit UNT is furthermore arranged for, once a data set has been enriched, forwarding this data set to an address in order to allow a user to take cognizance of the enriched data and to verify that this data has been correctly enriched. .
- Each set of data comprises a fundamental datum Di, D 2 , D 3 and one or more metadata relating to this fundamental datum.
- Metadata accompanying fundamental data is descriptive data used to describe or define fundamental data.
- the fundamental data is the description of a merchant while the metadata characterizes his activity code, his location or any other information concerning the merchant in question.
- the first set of data DAT1 comprises the fundamental data item Di and further comprises metadata MDi 1 , MDi m .
- the second set of data DAT2 comprises the fundamental data item D 2 and further comprises metadata MD 2 ⁇ MD 2 ′′.
- the third set of data DAT3 comprises the fundamental data item D 3 and further comprises metadata MD 3 ⁇ MD 3 P.
- m, n and p are natural numbers designating the respective number of metadata of the first, second and third data sets DAT1, DAT2, DAT3.
- the fundamental data of each set of data is likely to present a certain level of noise and therefore to include errors or inaccuracies.
- the processing applied by the UNT to fundamental data to reduce noise can also be applied to metadata.
- the processing unit UNT is designed in particular to generate a piece of data processed by applying a processing for reducing a noise level to the fundamental datum of a set of data.
- the metadata can also be noisy and that the processing unit UNT can also be arranged to generate new metadata by applying a noise reduction processing to the received metadata.
- three enriched data sets DAT1 *, DAT2 *, DAT3 * are generated by the UNT processing unit.
- the processing unit UNT is more particularly designed to generate, for each set of data received, additional data also called label or label and to enrich each set of data by aggregating or adding to it the label generated.
- the first and second data sets DAT1, DAT2 are enriched by a single label label (Ci) while the third data set DAT3 is enriched by a label label (C 3 ).
- the UNT processing unit comprises a COM communication module, a MEM memory and a PROC processor.
- the COM communication module is designed to receive several sets of data.
- the communication module COM is arranged to receive the first, second and third sets of data DAT1, DAT2, DAT3.
- the communication module COM is furthermore designed to send several enriched data sets.
- the communication module COM is arranged to send the first, second and third enriched data sets DAT1 *, DAT2 *, DAT3 *.
- the communication module COM can integrate one or more communication modules, for example radiofrequency communication and be configured for the transmission and reception of radiofrequency signals, according to one or more technologies, such as TDMA, FDMA, OFDMA, CDMA, or one or more standards communications, such as GSM, EDGE, CDMA, UMTS, HSPA, LTE, LTE-A, WiFi (IEEE 802.11) and WiMAX (IEEE 802.16), or their variants or evolutions, currently known or developed later.
- technologies such as TDMA, FDMA, OFDMA, CDMA
- standards communications such as GSM, EDGE, CDMA, UMTS, HSPA, LTE, LTE-A, WiFi (IEEE 802.11) and WiMAX (IEEE 802.16), or their variants or evolutions, currently known or developed later.
- the COM communication module is arranged to communicate with a wide area network (also known by the English acronym WAN for “Wide Area Newtork”), a local network (also known by the English acronym LAN for " Local Area Network ”) or any other type of network.
- a wide area network also known by the English acronym WAN for "Wide Area Newtork”
- a local network also known by the English acronym LAN for " Local Area Network ”
- any other type of network any other type of network.
- Data sets are, for example, sent to the COM communication module of the UNT processing unit following the use of an application.
- an application is typically implemented on a terminal, for example a mobile terminal of smartphone type (common English term for a smart phone), and is for example intended to be used by a user.
- the user makes a payment via this application and this payment generates the generation of at least part of the data of a set of data, whether it is the fundamental data and / or the metadata. It is typically in such a case that noise can be introduced with errors or inaccuracies.
- this information is a series of characters.
- the memory MEM is arranged to store instructions in the form of a computer program whose execution by the processor PROC results in the operation of the processing unit UNT.
- the SYS system also includes at least one database.
- the SYS system includes two databases DB1, DB2. Nevertheless, those skilled in the art understand here that the SYS system can only include a single database.
- Each DB1, DB2 database is configured to store data sets each including metadata and a label.
- one or more data sets stored in a database DB 1, DB2 also include, in addition to metadata and a label, fundamental data.
- each DB1, DB2 database is configured to be accessible to the UNT processing unit within the SYS system. As explained in the remainder of the description, this accessibility results in the possibility for the processing unit UNT to perform a search within each database DB1, DB2 to establish, if possible, a correspondence between a set of enriched data and data sets stored in databases. This research aims in particular to verify the relevance of the enrichment of the set of data produced and its conformity with known databases.
- the databases sent by the processing unit UNT to verify that a set of data received has been correctly enriched is for example a database of the SIREN type (for "Identification system of the directory of companies ”), SIRET (for“ Identification system of the establishment directory ”) or even Infogreffe.
- SIREN for "Identification system of the directory of companies ”
- SIRET for“ Identification system of the establishment directory
- Infogreffe information relating to the identification of a company, a company, an establishment, an organization or an association with activities in France.
- DB1 and DB2 databases can refer to any database of this type and not only for France.
- databases DB1, DB2 can also refer to other types of databases accessible by programming interfaces (also known by the acronym API for "Application programming interface").
- a set of data then relates, for example, to a company, a company or a merchant and the metadata included in the set of data are informative or descriptive data of a fundamental data concerning the name of the company, the company or the merchant.
- This fundamental data is, due to the transfer of the data set, likely to include errors or inaccuracies and therefore to be corrupted by a certain noise level. This noise may have been introduced at the source, on transmission or even on reception.
- the SYS system receives several sets of data. More specifically, the data sets are received by the COM communication module of the SYS system UNT processing unit.
- the communication module COM receives a first set of data DAT1, a second set of data DAT2 and a third set of data DAT3.
- this example is purely illustrative and the SYS system may have to process a much larger number of data sets.
- Each data set comprises a fundamental data and one or more metadata relating to the fundamental data.
- metadata makes it possible to define, describe or provide additional information about the fundamental data.
- the first set of data DAT1 comprises metadata MDi 1 , ..., MDi m describing the fundamental data item Di.
- the second data set DAT2 comprises metadata MD 2 ⁇ ..., Ml) 2 "describing the fundamental data item D 2.
- the third data set DAT3 comprises metadata MD3 1 , ..., MD 3 P describing the fundamental data D 3 .
- the processing unit UNT of the system SYS generates for each set of data, by applying a processing for reducing a noise level to the fundamental datum, a processed datum associated with the 'dataset.
- noise can be introduced at the source, on the transmission or on the reception in the data set and more specifically in the fundamental data.
- the implementation of the service requiring the correct routing of the data set is then compromised by such errors or inaccuracies.
- the UNT processing unit applies any type of data processing allowing the noise level level of the fundamental data to be reduced. Those skilled in the art are familiar with the techniques usually employed to decrease the noise level or completely remove it from one or more data.
- the processed data is generated by deleting the sequence of characters of one or more characters from a list of characters predetermined.
- This list of characters is for example stored in the memory MEM of the processing unit UNT so that, when the processing unit detects a character from this list in a fundamental datum taking the form of a series of characters, this character is deleted to generate the processed data.
- the fundamental data can also be a sound signal or a digital image.
- the various techniques for reducing or eliminating noise in a sound signal or a digital image are widely known to those skilled in the art so that the UNT processing unit can be configured to be able to apply such techniques on the fundamental data of each set of data received by the SYS system.
- Di ′ denotes the processed data item generated by the first set of data DAT1 by reducing the noise level of the fundamental data item Di.
- D 2 ' denotes the processed data item generated by the second set of data DAT2 by reducing the noise level of the fundamental datum D 2
- D 3 ' denotes the processed datum generated by the third set of data DAT3 by reduction of the noise level of the fundamental datum D 3 .
- the data processed for a data set can be aggregated or appended to the data set in addition to or in place of the fundamental data, and in the company of the corresponding metadata.
- this step S2 has been implemented and that the processed data item replaces the fundamental data.
- this processed data item is therefore not generated and what is carried out subsequently using the processed data item l 'is using the fundamental data.
- this processed data can be identical to the fundamental data.
- the fundamental datum does not include any noise, the processed datum is identical to the fundamental datum.
- a counter i initialized to 1, is incremented and a similarity function I j is selected.
- the memory MEM stores a set of similarity functions.
- the processing unit UNT groups the data sets according to the processed data respectively associated with the data sets according to the similarity function.
- the grouping of data implemented by the processing unit UNT is better known under the English term “data clustering” or more simply “clustering”. We can also speak here of partitioning or clustering of data.
- the grouping techniques used by the processing unit UNT are techniques known to those skilled in the art.
- the grouping implemented by the processing unit UNT makes it possible to obtain a great intra-group similarity, namely a high homogeneity between the elements, here data sets, of the same group, and a low similarity inter-group, in order to have well-differentiated groups.
- the grouping implemented by the UNT processing unit comprises a partitioning algorithm, a hierarchical algorithm, a density-based algorithm, a grid algorithm or even a model algorithm.
- the grouping of data sets uses an unsupervised learning algorithm.
- Such algorithms are known to those skilled in the art.
- the data sets are grouped into groups, better known under the English term “clusters”, according to the similarity function used.
- the similarity function is a distance function defined on a space of M + l dimensions, where M is the number of metadata (M + l therefore corresponding to the cardinality of a set of data received with M metadata and a fundamental data ).
- M is the number of metadata
- M + l therefore corresponding to the cardinality of a set of data received with M metadata and a fundamental data .
- the similarity function can be a Euclidean distance.
- the similarity function can be a Levenshtein distance.
- the similarity function can be a combination of a Euclidean distance and a Levenshtein distance.
- the first set of data DAT1 and the second set of data DAT2 are grouped together in the same group or “cluster” Ci.
- the third set of data DAT3 is for its part placed in a group C 2 .
- the first, second and third data sets DAT1, DAT2, DAT3 have been grouped together according to their respective processed data Di ', D 2 ', D 3 '.
- the generation of the processed data item is optional.
- the grouping of the datasets is implemented according to the respective fundamental data of the datasets.
- the processing unit UNT enriches each set of data with an additional piece of data called a label characterizing the group to which the set of data considered belongs.
- a set of data receives, at the end of the grouping, a data additional characterization of the group into which the data set in question has been classified.
- this additional data also called label or tag, is aggregated or appended to the data set.
- the first and second data sets DAT1, DAT2 have been classified in the same group or “cluster” Ci. These two data sets DAT1, DAT2 are therefore enriched by the same additional data item referenced label (Ci). Likewise, the third set of data DAT3 having been classified in the group or "cluster” C 2 , it is enriched by the additional data item label (C 2 ).
- the processing unit UNT searches, for each enriched data set, in at least one database storing data sets each comprising metadata and a label, a combination of at least part metadata and the label of the enriched data set considered.
- the first enriched set DAT1 comprises the fundamental data Di, metadata MDi 1 , ..., MDi m , a label label (Ci) and, optionally, the processed data Di ' .
- the search performed by the processing unit UNT in at least one of the databases DB1, DB2 therefore aims to determine whether the combination of at least part of the metadata MDi 1 , ..., MDi m and the label label ( Ci) is present in a dataset among the datasets stored in the database DB1, DB2.
- a search is carried out in all the databases, therefore here the database DB 1 and the database DB2.
- a “corresponding data set” to designate a data set stored in a database and comprising the desired combination.
- this dataset is a corresponding dataset of the enriched dataset from which the sought combination is derived.
- each metadata of an enriched data set is associated with a weight.
- This weight makes it possible to characterize the importance of a metadata within a set of data.
- the combination of at least part of the metadata and the label is then considered to be present in a database if and only if a value of a presence function, calculated according to the respective weights of the metadata of the combination present in the database in question is greater than or equal to a predetermined threshold.
- an additional criterion is applied to determine whether a dataset stored in a database can be considered a "corresponding dataset”.
- This criterion consists of verifying whether a potential corresponding dataset is sufficiently meaningful, according to the metadata it contains and shares in common with an enriched dataset.
- the need for the label of this potential corresponding data set to be the same as the enriched data set considered remains in this specific embodiment.
- the metadata MDi 1 , ..., MDi m are all respectively associated with a weight Pi 1 , ..., Pi m .
- a set of data includes the metadata MDi 1 , ..., MDi k and the label label (Ci), where k is a natural integer strictly less than m.
- this data set found in the database DB1 does indeed include at least part of the metadata of the first enriched data set DAT1 * as well as the label label (Ci). So this is a potential matching dataset.
- Vf G (Pi ⁇ ..., Pi k )
- This value Vi is then compared with a predetermined threshold and, if this value is greater than or equal to the predetermined threshold, then the dataset found in the database DB1 is relevant and is retained as a corresponding dataset.
- the presence function G is an addition or a multiplication.
- the dataset found in the DB2 database includes the metadata MD, MDi m , where j is a natural number less than m, and the label label (Ci).
- the processing unit UNT then calculates the value V 2 taken by the function G taken for this combination found. In other words:
- V 2 G (P 1 j , ..., P 1 k )
- This value V 2 is then compared with the predetermined threshold and, if this value is greater than or equal to the predetermined threshold, then the data set found in the database DB2 is relevant and is retained as a corresponding data set.
- the dataset found in the DB1 database is retained according to this criterion while the one found in the DB2 database is not.
- a corresponding data set is not only a data set stored in a database comprising the combination of at least part of the metadata and the label of an enriched data set but also a data set verifying the criterion described above concerning the respective weights of the metadata that it shares with the enriched data set on the basis of which the search is carried out by the UNT processing.
- step S6 implemented in particular in the case where a combination of at least part of the metadata and of the label of an enriched data set is absent from the at least one database, the label previously assigned is removed from the enriched dataset.
- the label since there is no trace in any database of a combination of at least part of the metadata and the label, it is considered that it was by mistake that it was assigned during the grouping of step S3 to the data set considered.
- This previously enriched is therefore removed from the additional data or label which has been aggregated or added to it.
- step S4 this has been enriched by the additional data item label (Ci).
- the processing unit UNT therefore then searched, during step S5, in at least one of the databases DB 1, DB2 if a set of data stored in one of these databases DB 1, DB2, comprises both at least part of the metadata M1) 2 ', ..., MD 2 "and the additional data label (Ci). If no data set stored in the databases DB1, DB2 does not include such a combination, the label label (Ci) is therefore removed from the second enriched data set DAT2 *.
- such a search can be performed on a plurality of databases, here two databases DB1, DB2 and a corresponding set of data can be found in several different databases.
- the processing unit UNT searched in the database DB1 but also in the database DB2 for a data set comprising the combination of at least part of the metadata MD 3 1 , ... MD 3 P and the label label (C 2 ). It is quite possible that a matching dataset was found in the DB1 database, while another matching dataset was found in the DB2 database.
- the processing unit UNT has found a data set stored in the database DB1 comprising the combination of at least part of the metadata and the label of the third enriched data set DAT3 * but also found a dataset stored in the DB2 database including this same combination of metadata and label.
- the processing unit UNT applies a predefined criterion to determine whether the result of this search which resulted in finding a corresponding set of data in more than one database of the system SYS makes it possible to conclude on the relevance of the label attributed or not.
- each database is characterized by a reliability coefficient.
- each set of data stored in a database further comprises fundamental data.
- the SYS system includes two databases DB1, DB2. Since there are several databases, each is assigned a reliability coefficient to quantify its relevance or reliability.
- the respective reliability coefficients of two distinct databases are distinct.
- the database DB1 is characterized by a reliability coefficient CFI while the database DB2 is characterized by a reliability coefficient CF2. It is also considered that the database DB 1 being more reliable than the database DB2, we have: CF1> CF2.
- the database with the highest reliability coefficient is the database DB 1 characterized by the reliability coefficient CFI.
- the processed data Di ’of the first enriched data set DAT1 * is therefore compared with the fundamental data of the corresponding data set found in the database DB1.
- the label of the enriched data set is removed during step S6 then implemented by the processing unit UNT if the processed data of the enriched data set is distinct from the fundamental data of the corresponding data set stored in the database characterized by the highest coefficient of reliability.
- each database is again characterized by a reliability coefficient.
- each set of data stored in a database further comprises fundamental data.
- the processing unit UNT takes into account all the databases comprising a corresponding set of data.
- Each fundamental data item present in at least one of the corresponding data sets is associated with a likelihood factor determined as a function of the reliability coefficient of each database storing a corresponding data set comprising the considered fundamental data.
- a third database (not shown here) is included in the SYS system and is searched by the processing unit UNT in addition to the databases DB1, DB2.
- this third database is characterized by a reliability coefficient CF3.
- the database DB1 is characterized by a reliability coefficient CFI while the database DB2 is characterized by a reliability coefficient CF2.
- the respective reliability coefficients of two distinct databases are distinct.
- the processing unit UNT determines a likelihood factor FV (DF I 2 ) associated with the fundamental datum DFi 2 .
- This likelihood factor FV (DF I 2 ) is calculated as a function of the reliability coefficients of the database DB1 and of the database DB2, namely CFI and CF2.
- the processing unit UNT determines a likelihood factor FV (DF 3 ) associated with the fundamental datum DF 3 .
- This likelihood factor FV (DF 3 ) is calculated as a function of the reliability coefficient of the third database, namely CF3.
- a likelihood factor is determined by adding the reliability coefficients.
- the processed data D 3 'of the third enriched data set DAT3 * is then compared with the fundamental data associated with the highest likelihood factor.
- each data item fundamental present in at least one of the corresponding data sets is associated with a likelihood factor determined as a function of the reliability coefficient of each database storing a corresponding data set comprising the fundamental data considered, and the label of the set of enriched data is removed during step S6 then implemented by the processing unit UNT if the processed data of the enriched data set is distinct from the fundamental data associated with the highest likelihood factor.
- Step S6 is implemented for an enriched data set either following step S5 if it turns out that the combination of at least part of the metadata and the label of this enriched data set is not present in any database, either following step S7 if this combination has been found in several databases and it turns out that the assigned label is incorrect. During this step S6, therefore, the label of the enriched data set is removed. Then, as illustrated in [Fig. 2], it is determined whether the counter i characterizing the number of iterations of the method is less than or equal to a predetermined maximum number of iterations N. If this maximum number of iterations has not yet been reached, the counter is incremented .
- a new similarity function for example stored in the memory MEM of the processing unit UNT. Steps S3 and following are then repeated for the data sets whose label has been removed with the new similarity function, so that a data set cannot be enriched by a label already aggregated and then removed previously.
- a similarity function calculates a distance between two sets of data so that two sets of data are grouped together in the same group or cluster when the distance between these two sets of data is less than or equal to a certain threshold. Also, when a new similarity function is selected, it is also possible to modify this threshold, for example by increasing it. Furthermore, it is also possible to keep the same similarity function and only change the threshold.
- the processing unit UNT interrupts the loop and proceeds to step S8 even though some data sets are found without an assigned label.
- Step S8 is implemented at the end of step S7 if it is determined that the label attributed to a set of data during the enrichment of the latter is correct in view of the search carried out on T at least one database, it is then considered that this data set has been correctly enriched.
- Step S8 can also be implemented if the maximum number of predetermined iterations N of the method has been reached.
- step S8 can also be implemented in the case where, for an enriched data set, the combination of at least part of the metadata and of the label of this enriched data set has only been found. in a single database at the end of step S5.
- the fundamental data of the enriched data set is replaced if necessary by the fundamental data of the data set corresponding.
- the fundamental data of the corresponding data set is distinct from the enriched data set.
- this fundamental data present in the corresponding dataset may correspond to the processed data.
- the enriched data set at the output of the system comprises at least either the original fundamental data, the processed data or the fundamental data found in the corresponding data set.
- the processing unit UNT compares the fundamental datum D 2 of the second enriched data set DAT2 * is compared with the fundamental data of the corresponding data set stored in the database DB1. If the fundamental data of the corresponding data set is distinct from the fundamental data of the second data set enriched, the latter is then replaced in the second enriched data set by the fundamental data of the corresponding data set.
- each set of data has benefited from at most N iterations of steps S3 and following to be assigned a consistent label in view of the search carried out on one or more databases of the system.
- enriched data sets which, at the end of step S8, have retained their label because of the positive result of the search in the database or databases DB1, DB2, or because the combination d '' at least part of the metadata and the label were found in a single database, either because this combination was found in several databases and the label finally appeared correct in the light of the search, such enriched sets may also be supplemented by new metadata from databases.
- the test consists in determining whether, for a set of data, the combination of at least part of the metadata and of the label generated are included in at least one set of data, called the corresponding set of data, at least one database. But such corresponding sets can of course include other data in addition to the desired combination.
- This additional metadata can then be retrieved by the UNT processing unit to advantageously complement the enriched data sets.
- the enriched datasets DAT1 *, DAT2 *, DAT3 * do not include additional metadata compared to the datasets DAT1, DAT2, DAT3 received by the system.
- the enriched data can comprise additional metadata originating from databases DB1, DB2.
- the enriched data sets can be enriched again so as to keep, for the sake of traceability, a history of the enrichment of the data and of the search within the databases.
- an enriched data set can be completed by a piece of data representative of the similarity function used to implement the grouping in step S2.
- an enriched data set can also or alternatively be supplemented by data representative of the database within which the most relevant corresponding set has been found.
- data representative of the database within which the most relevant corresponding set has been found may be representative of at least part of the databases within which these corresponding data sets are stored.
- a set of enriched data at the output of the SYS system can include, in addition to the label and possibly the data processed with or in place of the original fundamental data, data making it possible to characterize the different steps of the process that led to generation and verification of enriched data sets.
- This additional enrichment of a set of data typically comprises data representative of the similarity function used and / or one or more data representative of databases in which corresponding sets are stored.
- the enriched data set in question is again enriched by data representative of the similarity function and / or at least one database within which the combination of at least part of the metadata and the label of this enriched data set has been found.
- step S9 optionally implemented at the end of step S8, the metadata of the enriched data sets are used in order to carry out a verification of the assigned label. Such a check can also make it possible to correct the fundamental data if necessary.
- the fundamental data relates to an individual or of an entity
- the metadata comprises at least contact data of the individual or of the entity.
- the enriched data set is transmitted, using contact data, for verification of the aggregated label.
- An entity can designate here a company, a company, an organization or an establishment.
- contact data may already be present in the data set received and then enriched, but may also be retrieved from one of the databases.
- DB1 data, DB2 if the search result is satisfactory.
- contact data is sought. in the corresponding set or sets within the database or databases.
- the enriched data sets are processed before transmission in order to keep either the fundamental data as received by the communication module COM or the data processed or the fundamental data recovered from a database.
- the fundamental data item Di received is kept alone.
- the processed data item D 2 'generated is kept alone.
- the processed data item D 3 'generated is kept alone.
- the contact data can be for example a postal address, a telephone number and / or an e-mail address.
- the metadata of the first enriched data set comprises contact data relating to an electronic address ADD1
- the metadata of the second enriched data set include contact data relating to a telephone number.
- telephone ADD2 while the metadata of the third enriched data set comprises contact data relating to a postal address ADD3.
- an enriched data set may be, for verification purposes, obviously transmitted to the individual or entity being the subject of these data but may also be sent to the source of the dataset.
- the generation of a set of data and then the transmission to the SYS system may have been triggered by a user's terminal, for example during a payment. . More precisely, this data is generated from a user account of the user on the payment application. These data do not relate to the user in question but to the trade, business or society.
- the enriched data set can therefore be transmitted for verification of course to the business, the company or the company via contact data included in the metadata, but can also, still for verification, also be sent to the user account at the origin of the generation of the data set as received by the SYS system and more particularly by the communication module COM.
- the enriched data sets are then transmitted to these addresses provided by the contact data, for example via the communication module COM, for verifications of the label, and possibly of the fundamental / processed data and of the data. 'enriched data set transmitted.
- the processing unit UNT is for example provided with technologies to automatically send an email or use a call bot to automatically call the retrieved phone number.
- this erroneous data can be corrected and then sent back to the SYS system.
- this application also allows him to receive the enriched set of data at the output of the system and to access, at least in part, certain data in the enriched data set for verification purposes. If a data, for example the fundamental data or the processed data or a metadata, is erroneous, the user has the possibility to correct this data then to send this correction to the SYS system.
- the SYS system can then re-implement certain steps of the method described above. For example, the SYS system can re-group or cluster on multiple corrected rich data sets or re-search one or more databases.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Quality & Reliability (AREA)
- Library & Information Science (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Description
Claims
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
FR1903406A FR3094508A1 (fr) | 2019-03-29 | 2019-03-29 | Système et procédé d’enrichissement de données |
PCT/FR2020/050609 WO2020201662A1 (fr) | 2019-03-29 | 2020-03-20 | Systeme et procede d'enrichissement de donnees |
Publications (1)
Publication Number | Publication Date |
---|---|
EP3948579A1 true EP3948579A1 (fr) | 2022-02-09 |
Family
ID=67956931
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
EP20731903.9A Pending EP3948579A1 (fr) | 2019-03-29 | 2020-03-20 | Systeme et procede d'enrichissement de donnees |
Country Status (5)
Country | Link |
---|---|
US (1) | US20220171749A1 (fr) |
EP (1) | EP3948579A1 (fr) |
CN (1) | CN113826091A (fr) |
FR (1) | FR3094508A1 (fr) |
WO (1) | WO2020201662A1 (fr) |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11841891B2 (en) * | 2022-04-29 | 2023-12-12 | Content Square SAS | Mapping webpages to page groups |
CN114817229B (zh) * | 2022-06-21 | 2022-09-20 | 布比(北京)网络技术有限公司 | 基于区块链的清分数据处理的方法和区块链系统 |
Family Cites Families (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
AUPR033800A0 (en) * | 2000-09-25 | 2000-10-19 | Telstra R & D Management Pty Ltd | A document categorisation system |
GB2395807A (en) * | 2002-11-27 | 2004-06-02 | Sony Uk Ltd | Information retrieval |
US8504456B2 (en) * | 2009-12-01 | 2013-08-06 | Bank Of America Corporation | Behavioral baseline scoring and risk scoring |
US8983954B2 (en) * | 2012-04-10 | 2015-03-17 | Microsoft Technology Licensing, Llc | Finding data in connected corpuses using examples |
US9218546B2 (en) * | 2012-06-01 | 2015-12-22 | Google Inc. | Choosing image labels |
US20140006275A1 (en) * | 2012-06-28 | 2014-01-02 | Bank Of America Corporation | Electronic identification and notification of banking record discrepancies |
CA2892891C (fr) * | 2014-05-27 | 2022-09-06 | The Toronto-Dominion Bank | Systemes et methodes d'alertes de fraude transmises aux marchands |
US9740979B2 (en) * | 2015-12-06 | 2017-08-22 | Xeeva, Inc. | Model stacks for automatically classifying data records imported from big data and/or other sources, associated systems, and/or methods |
CN107133226B (zh) * | 2016-02-26 | 2021-12-07 | 阿里巴巴集团控股有限公司 | 一种区分主题的方法及装置 |
US20180011919A1 (en) * | 2016-07-05 | 2018-01-11 | Kira Inc. | Systems and method for clustering electronic documents |
US20220035862A1 (en) * | 2018-12-19 | 2022-02-03 | jSonar Inc. | Context enriched data for machine learning model |
US11625723B2 (en) * | 2020-05-28 | 2023-04-11 | Paypal, Inc. | Risk assessment through device data using machine learning-based network |
-
2019
- 2019-03-29 FR FR1903406A patent/FR3094508A1/fr not_active Withdrawn
-
2020
- 2020-03-20 WO PCT/FR2020/050609 patent/WO2020201662A1/fr unknown
- 2020-03-20 CN CN202080035793.4A patent/CN113826091A/zh active Pending
- 2020-03-20 EP EP20731903.9A patent/EP3948579A1/fr active Pending
- 2020-03-20 US US17/599,113 patent/US20220171749A1/en active Pending
Also Published As
Publication number | Publication date |
---|---|
FR3094508A1 (fr) | 2020-10-02 |
US20220171749A1 (en) | 2022-06-02 |
WO2020201662A1 (fr) | 2020-10-08 |
CN113826091A (zh) | 2021-12-21 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
EP3443678B1 (fr) | Methode de décodage d'un code polaire avec inversion de bits peu fiables | |
CN108959370B (zh) | 一种基于知识图谱中实体相似度的社区发现方法及装置 | |
US10073876B2 (en) | Bloom filter index for device discovery | |
EP0995272B1 (fr) | Decodage iteratif de codes produits | |
EP3948579A1 (fr) | Systeme et procede d'enrichissement de donnees | |
US20190065518A1 (en) | Context aware delta algorithm for genomic files | |
US11366641B2 (en) | Generating microservices for monolithic system using a design diagram | |
EP3671578A1 (fr) | Procédé d'analyse d'une simulation de l'exécution d'un circuit quantique | |
FR3009462B1 (fr) | Procede ameliore de decodage d'un code correcteur avec passage de message, en particulier pour le decodage de codes ldpc ou codes turbo | |
EP3806548A1 (fr) | Procédé d'optimisation de la quantité de ressources réseau et du nombre de services susceptibles d'utiliser lesdites ressources | |
WO2018067388A1 (fr) | Réparation de données par connaissance de domaine | |
US11789810B2 (en) | Method and system for detecting data corruption | |
EP3970025A1 (fr) | Gestion de données d'événement réseau dans un réseau de télécommunications | |
US8788500B2 (en) | Electronic mail duplicate detection | |
CN114661793A (zh) | 模糊查询方法、装置、电子设备及存储介质 | |
FR2871631A1 (fr) | Procede de decodage iteractif de codes blocs et dispositif decodeur correspondant | |
FR2884661A1 (fr) | Procede et dispositif de decodage d'un code a longueur variable prenant en compte une information de probabilite a priori | |
EP3671577A1 (fr) | Procédé d'analyse d'une simulation de l'exécution d'un circuit quantique | |
EP3869368A1 (fr) | Procede et dispositif de detection d'anomalie | |
WO2018104557A1 (fr) | Procédé d'émission d'un message, procédé de réception, dispositif d'émission, dispositif de réception et système de communication associés | |
CN117093880B (zh) | 一种基于医疗集成平台的单点登录用户管理方法及系统 | |
FR3047580B1 (fr) | Index de table de base de donnees | |
US20210303797A1 (en) | Semantic correction of messages | |
US20230315883A1 (en) | Method to privately determine data intersection | |
EP4117222A1 (fr) | Procédés de comparaison de bases de données biométriques confidentielles |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: UNKNOWN |
|
STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: THE INTERNATIONAL PUBLICATION HAS BEEN MADE |
|
PUAI | Public reference made under article 153(3) epc to a published international application that has entered the european phase |
Free format text: ORIGINAL CODE: 0009012 |
|
STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: REQUEST FOR EXAMINATION WAS MADE |
|
17P | Request for examination filed |
Effective date: 20210915 |
|
AK | Designated contracting states |
Kind code of ref document: A1 Designated state(s): AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL PT RO RS SE SI SK SM TR |
|
DAV | Request for validation of the european patent (deleted) | ||
DAX | Request for extension of the european patent (deleted) | ||
STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: EXAMINATION IS IN PROGRESS |
|
17Q | First examination report despatched |
Effective date: 20230818 |
|
RAP3 | Party data changed (applicant data changed or rights of an application transferred) |
Owner name: ORANGE |