CN114969041B - Multi-source main and auxiliary entity identity discrimination and data self-supplementing processing method - Google Patents

Multi-source main and auxiliary entity identity discrimination and data self-supplementing processing method Download PDF

Info

Publication number
CN114969041B
CN114969041B CN202210592302.7A CN202210592302A CN114969041B CN 114969041 B CN114969041 B CN 114969041B CN 202210592302 A CN202210592302 A CN 202210592302A CN 114969041 B CN114969041 B CN 114969041B
Authority
CN
China
Prior art keywords
entity
information
data
database
item
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210592302.7A
Other languages
Chinese (zh)
Other versions
CN114969041A (en
Inventor
吴峰
张朝宗
李银生
王红
聂永川
任雁
毋鹏杰
杨扬
刘淼
张义倩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hebei Academy Of Science And Technology Information Hebei Academy Of Science And Technology Innovation Strategy
Original Assignee
Hebei Academy Of Science And Technology Information Hebei Academy Of Science And Technology Innovation Strategy
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hebei Academy Of Science And Technology Information Hebei Academy Of Science And Technology Innovation Strategy filed Critical Hebei Academy Of Science And Technology Information Hebei Academy Of Science And Technology Innovation Strategy
Priority to CN202210592302.7A priority Critical patent/CN114969041B/en
Publication of CN114969041A publication Critical patent/CN114969041A/en
Priority to ZA2022/11776A priority patent/ZA202211776B/en
Application granted granted Critical
Publication of CN114969041B publication Critical patent/CN114969041B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2228Indexing structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/288Entity relationship models

Abstract

The invention discloses a processing method for identifying identity of a multi-source main entity and attaching entity and self-supplementing data, which is applied to the field of big data processing and provides a method for stripping multi-source data entities according to the main entity and attaching entity, identifying the same entity according to the same scene, entity attribute classification, weight and the like, and respectively processing and storing the identification probability. The invention systematically solves the problems of respectively processing and collecting main and auxiliary entities according to the identity probability, combining and supplementing data, uniformly storing entity relations, separating entities according to the needs and the like by the technical methods of calculating the identity probability of the main entity and the auxiliary entities, combining and merging indexes of the same entity, extracting and storing entity directory items, separating entity sub directory items and the like, and provides a feasible solution for carrying out multi-source and large-scale data association operation.

Description

Multi-source main and auxiliary entity identity discrimination and data self-supplementing processing method
Technical Field
The invention relates to the technical field of big data application, in particular to a multi-source main and auxiliary entity identity screening and data self-supplementing processing method.
Background
The existing method for identifying, extracting and storing the entity for processing the multi-source data is generally classified according to the source or type, and is matched and identified one by one according to the entity attribute of the data, and due to the lack of discrimination mechanisms such as entity inscription items, same scene, entity attribute classification, weight and the like, the data redundancy, the non-uniform expression, the low matching accuracy, the low execution efficiency, the loss of information in the identification process and the like are mainly realized in the following aspects:
1) The data is redundant and cannot be uniformly expressed. In the prior art, when collecting the entities of the heterogeneous data, the heterogeneous data is usually collected according to the source or type, and because the indexes of the data representing the entities are various, the collected entity data indexes are often inconsistent, and unified storage, standard expression and external service cannot be provided.
2) The entity matching accuracy is not high. The existing entity identification technology generally performs matching and identification according to entity attributes of data, and is restricted by factors such as various entity attributes, huge data volume and the like, so that the problems of low matching degree, low precision and the like are generally existed.
3) The entity identification is not efficient to perform. In the prior art, the entities are generally judged in sequence according to the attribute sequence of the entities, and the problems of long entity identification calculation time, contradiction of attribute sequence and the like are often caused by lack of classification definition, weight assignment and the like aiming at the attributes of the entities.
4) The entity is relatively stationary and the data quality cannot be improved. In the prior art, when the entity is identified and extracted, a direct separation mode is generally adopted, the attribute expansion is limited, the mutual correction, the supplementation and the expansion of the data are not or rarely carried out according to the implicit attribute among the data, the self-perfection of the data cannot be realized, and the data quality cannot be effectively ensured.
5) The identification process information is lost. In the prior art, when an entity is identified, only attribute information of the same entity which is successfully identified is usually recorded, and a large probability event in the entity identification process, such as the situation that two entities are identified as the same entity in a large probability manner but cannot be completely identified as the same entity, is rarely recorded, so that the deep mining and analysis of the data relationship are not facilitated.
Disclosure of Invention
The invention provides a processing method for multi-source main entity identity discrimination and data self-supplementing, which is used for solving the problems of multi-source multi-period data main entity identity discrimination, data automatic merging and supplementing and the like and providing a feasible solution for multi-source and large-scale data association operation.
In order to solve the technical problems, the technical scheme adopted by the invention is as follows.
A processing method for identifying identity of a multi-source main entity and attaching entity and self-supplementing data specifically comprises the following steps:
A1. extracting a main entity title item MEFS and an auxiliary entity title item SEFS from an entity title item database EFDB of a source A, extracting an application scene ES between a main entity M (M) and an auxiliary entity S (M) from an entity application scene database ESDB of the source A, extracting entity information related to an entity static library from an entity static database RSDB, extracting information representing the single source same entity according to the main entity and the same scene information by utilizing a single source same entity screening and data supplementing device, storing the information representing the single source same entity into the same entity database SEDB, and supplementing data;
A2. extracting entity static library related entity information from an entity static database RSDB, extracting subordinate entity title item SEFS from an entity title item database EFDB of a source B, extracting an application scene ES between a main entity M (M) and a subordinate entity S (M) from an entity application scene database ESDB of the source B, extracting dynamic library entity data information from an entity dynamic database RVDB, extracting the same entity data information from the same entity database SEDB, judging the identity of a heterologous entity by utilizing a heterologous identical entity discriminator according to rules, extracting information representing the heterologous identical entity, transmitting the information into a heterologous entity data supplement, and storing the information into the main entity dynamic database RVDB;
A3. extracting dynamic library entity data information from an entity dynamic database RVDB, extracting the same entity data information from the same entity database SEDB, receiving the information of the same source and the same entity from a heterogeneous same entity discriminator, supplementing the heterogeneous entity information by utilizing a heterogeneous entity data supplement according to a time-nearest principle, and storing the heterogeneous entity supplement information into the entity dynamic database RVDB;
A4. extracting the same entity data information from the same entity database SEDB, extracting dynamic library entity data information from an entity dynamic database RVDB, automatically extracting and generating by utilizing entity directory items, extracting entity directory ELS information according to entity directory essential items ELES, and storing the entity directory ELS information into an entity directory database EDDB;
A5. extracting dynamic library entity data information from an entity dynamic database RVDB, extracting entity directory information from an entity directory database EDDB, automatically separating the sub-entity information from the entity directory database EDDB according to rules by using a sub-entity automatic separator to form sub-entity directory information, and storing the sub-entity directory information into the entity directory database EDDB.
The processing method for identifying the identity of the multi-source main entity and the auxiliary entity and self-supplementing the data in the step A1 comprises the following steps:
A11. reading a single-source multi-library data set DSB from an entity static library database RSDB of a source A;
A12. reading the number N1 of the warehouse-in warehouse from the entity title item database EFDB of the source A, and setting n1=1;
A13. reading a main entity title item MEFS of the library n1, obtaining a data set DSA of the main entity title item MEFS, and simultaneously obtaining the number I1 of records of the data set DSA, wherein i1=1;
A14. reading an i1 record in the data set DSA, matching the unique item K of the item data with the data in the data set DSB, if the matching is successful, executing the step A15, and if the matching is unsuccessful, executing the step A19;
A15. extracting the related information representing the identity of the single-source entity of the main entity m1 corresponding to the record i1, and writing the related information into the same entity database SEDB;
A16. reading a related information data set DSC of the main entity m1 characterizing the same entity in a source A from the same entity database SEDB;
A17. reading an auxiliary entity information set DSS corresponding to a main entity m1 from an entity application scene database ESDB, and judging whether a specific auxiliary entity s exists in the same entity or not by utilizing a same scene SS rule; if the same entity exists, executing step A18, otherwise, executing step A19;
A18. extracting the same entity related information of a specific subordinate entity s, and writing the same entity related information into the same entity database SEDB;
A19. judging whether the I1> I1 is true, if true, executing i1=i1+1, and jumping to the step A14 for execution; otherwise, jumping to the step A110 for execution;
A110. judging whether N1> N1 is true, if true, executing n1=n1+1, and jumping to the step A13 for executing; otherwise, ending.
The specific method for distinguishing the identity of the heterologous entity by the heterologous identical entity discriminator in the step A2 comprises the following steps of:
A21. reading the number N2 of the unbanked subordinate entity types from the entity title item database EFDB of the source B according to the entity types, and setting n2=1;
A22. reading the related information of the specific subordinate entity type n2, and simultaneously obtaining a warehouse-in threshold TH of the subordinate entity type n2 set by a system;
A23. judging whether the corresponding entity dynamic database RVDB exists according to the affiliated entity type n2, if so, executing the step A24, and if not, jumping to the step A214 for execution;
A24. reading a related information data set DSF representing the same subordinate entity type n2 from the same entity database SEDB according to the subordinate entity type n 2;
A25. reading a dynamic library information data set DSD from an entity dynamic library RVDB;
A26. reading a set DSG of the subordinate entity types n2 from an subordinate entity title database EFDB of the source B to obtain the record number M2, and setting m2=1;
A27. reading the m 2-th record of the subordinate entity title item from the set DSG;
A28. reading a specific application scene es between an auxiliary entity corresponding to the record m2 and a main entity from an entity application scene database ESDB of the source B according to the auxiliary entity type n2 and the record m 2;
A29. reading a specific static database data set DSE corresponding to the record m2 from an entity static database RSDB of the source B according to the affiliated entity type n2 and the record m 2;
A210. obtaining set DSF information from step a24, obtaining set DSD information from step a25, obtaining record m2 information from step a27, obtaining application scene es information from step a28, obtaining set DSE information from step a29, matching in set DSD according to a set rule by using unique item, invariant item and common item attributes of record m2 of the subject item of the subordinate entity, and application scene es, set DSD, set DSE, set DSF information, and calculating similarity probability P (a) between entities;
A211. judging whether P (A) > TH is true, if not, jumping to the step A213 for execution, and if true, writing the information of P (A) and the characterization entity item into the same entity database SEDB;
A212. judging whether P (a) =100% is true, if not, jumping to step a213 for execution, if true, transmitting the record m2, the specific record item d corresponding to the set DSD, the specific record item e corresponding to the set DSE, and the specific record item f corresponding to the set DSF into the heterogeneous entity data adder, and starting the operation;
A213. judging whether M2> M2 is true, if true, executing m2=m2+1, and simultaneously jumping to the step A26 for execution; if not, step A214 is performed;
A214. judging whether N2> N2 is true, if true, executing n2=n2+1, and simultaneously jumping to the step A22 for execution; if not, ending.
The specific method for supplementing the information of the heterologous entity in the step A3 is as follows:
A31. receiving record m2, a specific record item d corresponding to the set DSD, a specific record item e corresponding to the set DSE and specific record item f information corresponding to the set DSF;
A32. aiming at the unique item, the unchanged item and the common item attribute of a specific title item, obtaining the attribute number N3, and setting n3=1;
A33. obtaining an attribute name of an nth 3 attribute;
A34. reading the corresponding data dn of the record item d according to the attribute name, and simultaneously sequentially reading the corresponding data of the record m2, the record item e and the record item f, and comparing with the dn;
A35. judging whether dn is empty or not, if so, jumping to the step A36 for execution, and if not, jumping to the step A37 for execution;
A36. supplementing corresponding latest data in record m2, record item e and record item f into dn according to a time nearest principle, and recording a time stamp and source information of the supplementary data;
A37. marking the time stamp and source information of the corresponding attribute data in record m2, record item e and record item f;
A38. forming a temporary record item d', judging whether N3> N3 is true, if so, jumping to the step A33 for execution, otherwise, executing the step A39;
A39. for other attributes except the unique item, the unchanged item and the common item, corresponding attribute data in the record item m2, the record item e and the record item f are sequentially read and compared with the record item d;
A310. recording the time stamp and the source information to form the latest temporary record item; updating into the entity dynamic database RVDB.
The method for processing identity discrimination and data self-compensation of the multi-source main and auxiliary entities in the step A4 comprises the following steps:
A41. setting entity types according to a system, obtaining the number N4 of the entity types, and setting n4=1;
A42. reading an entity list item els and an entity list essential item eles of the entity n 4;
A43. reading the same entity data set DSH of P (a) =100% of the entities n from the same entity database SEDB;
A44. extracting the related data information of the entity directory item els of the entity n4 from the entity dynamic database according to the set DSH and the latest time principle to form a temporary data set DSI;
A45. filtering the set DSI according to the data non-null principle of the essential items eles of the entity directory of the entity n4 to form a data subset DSJ;
A46. the set DSJ is used as entity directory ELS information of the entity n4 and is written into an entity directory database EDDB;
A47. judging whether N4> N4 is true, if so, making n4=n4+1, and jumping to the step A42 for execution, otherwise ending.
The method for processing identity discrimination and data self-compensation of the multi-source main and auxiliary entities comprises the following steps of:
A51. starting a sub-entity separation program of the specific entity n5 according to a user instruction;
A52. reading entity separation rules r specified or preset by a user;
A53. reading a directory data set DSO of a specific entity n5 from an entity directory database EDDB, and setting a temporary data set DSP;
A54. obtaining the number I5 of records in the set DSO, and setting i5=1;
A55. reading a record n5 in the set DSO, reading corresponding dynamic library entity data information in the entity dynamic database RVDB according to the information of the record n5, matching, executing a step A56 if the matching is successful, otherwise, executing a step A57;
A56. adding record n5 into the data set DSP;
A57. judging whether the I5> I5 is satisfied, if so, executing the I5 = I5+1, jumping to the step A55 for execution, and if not, executing the step A58;
A58. and writing the data set DSP into an entity directory database EDDB.
By adopting the technical scheme, the invention has the following technical progress.
The invention systematically solves the problems of respectively processing and collecting main and auxiliary entities according to the identity probability, combining and supplementing data, uniformly storing entity relations, separating entities according to the needs and the like by the technical methods of calculating the identity probability of the main entity and the auxiliary entities, combining and merging indexes of the same entity, extracting and storing entity directory items, separating entity sub directory items and the like, and provides a feasible solution for carrying out multi-source and large-scale data association operation.
Mainly has the following remarkable effects.
1) The data is regular and the expression is uniform. As the identification, extraction and storage are carried out according to the entity directory entries, and the secondary processing and extraction of the data are carried out according to the entity directory entries, compared with the prior art, the index realizes standardization and unification, the data can be stored regularly and uniformly, and the entity expression is more uniform and more flexible to use.
2) The entity matching precision and the execution efficiency are improved. Because the invention classifies the specific attributes of the entities, gives different weights, combines information such as contract scenes and the like, carries out entity matching and extraction, and has smaller matching difficulty and higher matching precision compared with the prior art; the calculated attribute is less, and the execution efficiency is higher; the problems of contradiction, inconsistency and the like of the attribute values can be effectively relieved.
3) The data quality is improved. In the process of extracting and storing entity data, the invention realizes self-perfection and correction of the entity data by extracting and identifying the implicit attribute, and compared with the prior art, the invention can automatically compare and correct the data, can automatically supplement and expand the entity attribute, and has richer data and higher quality.
4) The same entity probability is recorded. According to the invention, the data fusion accuracy is improved compared with the prior art by respectively storing and processing according to the probability of the same entity in the identification process; the difficulty of secondary entity identification is reduced; the method is beneficial to the deep mining and data analysis of different scene applications and entity relations.
Drawings
FIG. 1 is a schematic diagram of the structure of the present invention;
FIG. 2 is a flow chart of the present invention;
FIG. 3 is a schematic diagram of a workflow of the single source same entity screening and data supplementation device of the present invention;
FIG. 4 is a schematic diagram of the workflow of the heterogeneous same entity discriminator of the present invention;
FIG. 5 is a schematic workflow diagram of a heterogeneous entity data supplement according to the present invention;
FIG. 6 is a schematic diagram of a workflow of an entity directory entry automatic extraction generator according to the present invention;
FIG. 7 is a schematic diagram of the working flow of the fruit body autoseparator according to the present invention.
Detailed Description
The invention will be described in further detail with reference to the drawings and the detailed description.
A processing method for identifying identity of a multi-source main entity and self-supplementing data is applied to the field of big data processing, and provides a technical scheme for stripping multi-source data entities according to the main entity, identifying the same entity according to the same scene, entity attribute classification, weight and the like, respectively processing and storing the identification probability, and providing feasibility for different scene applications of data, deep mining of entity relations and data analysis.
During actual operation, firstly, information representing the same entity of a single source is extracted; then distinguishing the heterogeneous same entity information, and carrying out data supplementation and expansion; and finally forming entity directory entries and entity sub directory entries.
In the present invention, it is applied to the following databases: 1) An entity static database RSDB (RelativeStatic Database) storing data of a plurality of libraries originating from the same source (single source); 2) An entity dynamic database RVDB (RelativeVarietyDatabase) for storing the index and data of the entity from the heterogeneous entity after integration; 3) An entity entry database EFDB (EntityFeatureDatabase) for storing information such as the main entity entry MEFS, related data, the subsidiary entity entry SEFS, related data, and the like; 4) The entity application scenario database ESDB (EntitySenseDatabase) stores application scenarios ES between the master entity M (M) and the subordinate entity S (M).
In the present invention, proper nouns applied include: 1) A Source S for describing a set of data sets of a particular subject, having stability and continuity over a period of time; 2) A library (Data-Set) DS, which refers to a Set of Data sets generated by a source for a period of time, may be composed of one or more two-dimensional Data tables; 3) Table (Table) T, which refers to a two-dimensional data Table in the library; 4) An Entity (Entity) can be a research object with relative stability and uniqueness, which is described by a group of characteristic variables, and the Entity is divided into a main Entity and an auxiliary Entity according to the mutual attachment relationship among different entities; 5) A main entity (MainEntity) refers to a research entity described by all or most of attributes in a source, generally only one main entity in one source is represented by an entity (main entity corresponding to the entity) format, and the main entity is represented as M (M); 6) An subordinate entity (Subsidiary Entity), which refers to an entity in the source that depends on the main entity, is typically part of the main entity or is a set of variables that describe the attributes of the main entity, and is represented in an "entity (main entity to which the entity corresponds)" format, and is denoted as S (M); 7) The entity entry EFS (EntityFeatureStructure: entity feature structure) capable of reflecting a set of index sets of entity attributes; 8) The main entity entry MEFS (MainEntity FeatureStructure: main entity feature structure), which refers to a set of index sets that can reflect the attributes of the main entity; 9) The subordinate entity entry SEFS (subsidiaryencyfeaturescription: secondary physical feature structure): the index set can reflect the affiliated entity and the association relation between the affiliated entity and the main entity, so that the attribute of the affiliated entity can be reflected, and the related attribute of the state of the main entity where the affiliated entity is located can be reflected; 10 When peeling off entities, aiming at subordinate entities in homology, the same scene SS (SameSense) is adopted when indexes are consistent and corresponding specific main entities are consistent.
For entity identity identification, the attributes of the entity title items are divided into unique items, unchanged items and common items, wherein: the unique term K (Key) refers to an attribute that characterizes entity uniqueness, such as: identification card number, unified social credit code, organization code, etc.; the invariant item UC (uchange) refers to an attribute that an entity typically does not change often or never, for example: name, sex, etc. of personnel entity, unit name, address, etc. of institution entity; the Normal term N refers to an attribute of an entity other than the unique term K and the invariant term UC.
In order to provide services for an application entity and extract entity list items, the entity list items and entity list essential items are used, wherein: entity directory entries els (EntityListStructure) refer to a limited set of attributes that reflect the basic status of an entity, selected according to a particular application, such as: for an entity of an organization, the basic items can be set as an organization name, a unified social credit code, an address and the like; entity directory necessaries ELES (EntityListEssentialStructure) refer to a limited set of attributes, typically name class attributes, selected according to a specific application, that can guarantee that an entity directory is meaningful, the absence of which can result in the meaningless of a specific entity, for example: an "organization name" of an "organization" entity, a "name" of a "personnel" entity, and the like.
In the invention, the heterologous data are respectively stored in the following two databases after being identified, extracted and processed according to the entity: entity directory database EDDB (EntityDirectoryDatabase) stores entity directory information of heterogeneous external services; the same entity database SEDB (SameEntityDatabase) stores information characterizing the same entity.
The implementation of the invention depends on a plurality of module implementations, as shown in fig. 1, and comprises a single-source identical entity discriminator, a data complement device, a heterogeneous identical entity discriminator, a heterogeneous entity data complement device, an entity directory item automatic extraction generator and a sub-entity automatic separator.
A processing method for identifying identity of a multi-source main entity and self-supplementing data is shown in figure 2, and specifically comprises the following steps.
A1. Extracting a main entity title item MEFS and an auxiliary entity title item SEFS from an entity title item database EFDB of a source A, extracting an application scene ES between a main entity M (M) and an auxiliary entity S (M) from an entity application scene database ESDB of the source A, extracting entity information related to an entity static database RSDB, extracting information representing the single source same entity by utilizing a single source same entity screening and data supplementing device according to the information of the main entity, the same scene and the like, storing the information into the same entity database SEDB, and supplementing data.
In this step, the working method of the single-source same entity screening and data supplementing device is shown in fig. 3, and is specifically as follows.
A11. Reading a single-source multi-library data set DSB from an entity static library database RSDB of a source A;
A12. reading the number N1 of the warehouse-in warehouse from the entity title item database EFDB of the source A, and setting n1=1;
A13. reading a main entity title item MEFS of the library n1, obtaining a data set DSA of the main entity title item MEFS, and simultaneously obtaining the number I1 of records of the data set DSA, wherein i1=1;
A14. reading an i1 record in the data set DSA, matching the unique item K of the item data with the data in the data set DSB, if the matching is successful, executing the step A15, and if the matching is unsuccessful, executing the step A19;
A15. extracting the related information representing the identity of the single-source entity of the main entity m1 corresponding to the record i1, and writing the related information into the same entity database SEDB;
A16. reading a related information data set DSC of the main entity m1 characterizing the same entity in a source A from the same entity database SEDB;
A17. reading an auxiliary entity information set DSS corresponding to a main entity m1 from an entity application scene database ESDB, and judging whether a specific auxiliary entity s exists in the same entity or not by utilizing a same scene SS rule; if the same entity exists, executing step A18, otherwise, executing step A19;
A18. extracting the same entity related information of a specific subordinate entity s, and writing the same entity related information into the same entity database SEDB;
A19. judging whether the I1> I1 is true, if true, executing i1=i1+1, and jumping to the step A14 for execution; otherwise, jumping to the step A110 for execution;
A110. judging whether N1> N1 is true, if true, executing n1=n1+1, and jumping to the step A13 for executing; otherwise, ending.
A2. Extracting entity static database related entity information from an entity static database RSDB, extracting subordinate entity title item SEFS from an entity title item database EFDB of a source B, extracting application scene ES between a main entity M (M) and a subordinate entity S (M) from an entity application scene database ESDB of the source B, extracting dynamic database entity data information from an entity dynamic database RVDB, extracting the same entity data information from the same entity database SEDB, judging the identity of a heterologous entity by utilizing a heterologous identical entity discriminator according to rules, extracting information representing the heterologous identical entity, transmitting the information into a heterologous entity data adder, and storing the information into the main entity dynamic database RVDB.
In this step, the procedure of discriminating the identity of the heterologous entity by the heterologous identical entity discriminator is shown in fig. 4, and the specific method is as follows.
A21. Reading the number N2 of the unbanked subordinate entity types from the entity title item database EFDB of the source B according to the entity types, and setting n=1;
A22. reading the related information of the specific subordinate entity type n2, and simultaneously obtaining a warehouse-in threshold TH of the subordinate entity type n2 set by a system;
A23. judging whether the corresponding entity dynamic database RVDB exists according to the affiliated entity type n2, if so, executing the step A24, and if not, jumping to the step A214 for execution;
A24. reading a related information data set DSF representing the same subordinate entity type n2 from the same entity database SEDB according to the subordinate entity type n 2;
A25. reading a dynamic library information data set DSD from an entity dynamic library RVDB;
A26. reading a set DSG of the subordinate entity type n from an subordinate entity title database EFDB of the source B to obtain the record number M, wherein m=1;
A27. reading the m 2-th record of the subordinate entity title item from the set DSG;
A28. reading a specific application scene es between an auxiliary entity corresponding to the record m2 and a main entity from an entity application scene database ESDB of the source B according to the auxiliary entity type n2 and the record m 2;
A29. reading a specific static database data set DSE corresponding to the record m2 from an entity static database RSDB of the source B according to the affiliated entity type n2 and the record m 2;
A210. obtaining set DSF information from step a24, obtaining set DSD information from step a25, obtaining record m information from step a27, obtaining application scenario es information from step a28, obtaining set DSE information from step a29, matching in set DSD according to a set rule by using unique item, invariant item and common item attributes of record m2 of the subject entry of the subordinate entity, and application scenario es, set DSD, set DSE, set DSF information, and calculating similarity probability P (a) between entities;
in this embodiment: when matching personnel entities, aiming at the information of two personnel, if the identity card numbers are the same, P (A) is 100%; if the name and the mobile phone number are the same, P (A) is 100%; if the name and unit are the same, P (A) is 80%, etc.
A211. Judging whether P (A) > TH is true, if not, jumping to the step A213 for execution, and if true, writing the information of P (A) and the characterization entity item into the same entity database SEDB;
A212. judging whether P (a) =100% is true, if not, jumping to step a213 for execution, if true, transmitting the record m2, the specific record item d corresponding to the set DSD, the specific record item e corresponding to the set DSE, and the specific record item f corresponding to the set DSF into the heterogeneous entity data adder, and starting the operation;
A213. judging whether M2> M2 is true, if true, executing m2=m2+1, and simultaneously jumping to the step A26 for execution; if not, step A214 is performed;
A214. judging whether N2> N2 is true, if true, executing n2=n2+1, and simultaneously jumping to the step A22 for execution; if not, ending.
A3. Extracting dynamic library entity data information from an entity dynamic database RVDB, extracting the same entity data information from the same entity database SEDB, receiving the information of the same source and the same entity from a heterogeneous same entity discriminator, supplementing the heterogeneous entity information by utilizing a heterogeneous entity data supplement according to the principles of time latest and the like, and storing the heterogeneous entity supplement information into the entity dynamic database RVDB.
In this step, the procedure of the heterogeneous entity information supplementation is shown in fig. 5, and the specific method is as follows.
A31. Receiving record m2, a specific record item d corresponding to the set DSD, a specific record item e corresponding to the set DSE and specific record item f information corresponding to the set DSF;
A32. aiming at the unique item, the unchanged item and the common item attribute of a specific title item, obtaining the attribute number N3, and setting n3=1;
A33. obtaining an attribute name of an nth 3 attribute;
A34. reading the corresponding data dn of the record item d according to the attribute name, and simultaneously sequentially reading the corresponding data of the record m2, the record item e and the record item f, and comparing with the dn;
A35. judging whether dn is empty or not, if so, jumping to the step A36 for execution, and if not, jumping to the step A37 for execution;
A36. supplementing corresponding latest data in record m2, record item e and record item f into dn according to a time nearest principle, and recording a time stamp and source information of the supplementary data;
A37. marking the time stamp and source information of the corresponding attribute data in record m2, record item e and record item f;
A38. forming a temporary record item d', judging whether N3> N3 is true, if so, jumping to the step A33 for execution, otherwise, executing the step A39;
A39. for other attributes except the unique item, the unchanged item and the common item, corresponding attribute data in the record m2, the record item e and the record item f are sequentially read and compared with the record item d;
A310. recording the time stamp and the source information to form the latest temporary record item; updating into the entity dynamic database RVDB.
A4. Extracting the same entity data information from the same entity database SEDB, extracting dynamic library entity data information from the entity dynamic database RVDB, automatically extracting and generating by utilizing entity directory items, extracting entity directory ELS information according to entity directory essential items ELES, and storing the entity directory ELS information into the entity directory database EDDB.
In this step, the specific flow of the entity directory information is shown in fig. 6, and the generation method is as follows.
A41. Setting entity types according to a system, obtaining the number N4 of the entity types, and setting n4=1;
A42. reading an entity list item els and an entity list essential item eles of the entity n 4;
A43. reading the same entity data set DSH of P (a) =100% of the entities n from the same entity database SEDB;
A44. extracting the related data information of the entity directory item els of the entity n from the entity dynamic database according to the set DSH and the latest time principle to form a temporary data set DSI;
A45. filtering the set DSI according to the data non-null principle of the essential items eles of the entity directory of the entity n to form a data subset DSJ;
A46. the set DSJ is used as entity directory ELS information of the entity n4 and is written into an entity directory database EDDB;
A47. judging whether N4> N4 is true, if so, making n4=n4+1, and jumping to the step A42 for execution, otherwise ending.
A5. Extracting dynamic library entity data information from an entity dynamic database RVDB, extracting entity directory information from an entity directory database EDDB, automatically separating the sub-entity information from the entity directory database EDDB according to rules by using a sub-entity automatic separator to form sub-entity directory information, and storing the sub-entity directory information into the entity directory database EDDB.
In this step, the method for automatically separating the directory information of the fruiting body is shown in fig. 7, and is specifically as follows.
A51. Starting a sub-entity separation program of a specific entity n according to a user instruction;
A52. reading entity separation rules r specified or preset by a user;
A53. reading a directory data set DSO of a specific entity n from an entity directory database EDDB, and setting a temporary data set DSP;
A54. obtaining the number I5 of records in the set DSO, and setting 5i=1;
A55. reading a record n5 in the set DSO, reading corresponding dynamic library entity data information in the entity dynamic database RVDB according to the information of the record n5, matching, executing a step A56 if the matching is successful, otherwise, executing a step A57;
A56. adding record n5 into the data set DSP;
A57. judging whether the I5> I5 is satisfied, if so, executing the I5 = I5+1, jumping to the step A55 for execution, and if not, executing the step A58;
A58. and writing the data set DSP into an entity directory database EDDB.
The application of the present invention can realize the following functions.
1) And providing the main and auxiliary entity title items and directory items. When the entity of the heterogeneous data is identified, a large number of various data index items are screened and extracted according to the main and auxiliary entity title items, which is beneficial to representing the consistent index of the entity and the unified storage of the data, and meanwhile, the data is processed and extracted for the second time according to the entity title items, which is beneficial to unified external service of the data and large-scale data relation calculation.
2) The entity matches the scene. And when the entity attribute of the data is utilized for matching and identifying, the same scene identification mechanism of the entity is introduced according to the entity application scene of the data, so that the entity matching difficulty and complexity are reduced, and the entity matching accuracy is improved.
3) Entity attribute classification and weighting are presented. According to the characteristics of the entity attributes, the attributes of the entity title items are divided into unique items, unchanged items and common items, different weight values are respectively assigned, and the weight values are used for entity identification, so that the entity identification calculation time is reduced, and the problems of contradiction between the attributes and the like are solved.
4) The discrimination probabilities are stored and processed separately. In the process of identifying the entity, the same entity information which is successfully identified is recorded, the same entity probability among a plurality of entities is recorded, and the same entity probability is stored and processed respectively, so that the difficulty of secondary entity identification is reduced, and the deep mining and data analysis of different scene applications and entity relations are facilitated.

Claims (5)

1. A processing method for identifying identity of a multi-source main entity and self-supplementing data is characterized by comprising the following steps:
A1. extracting a main entity title item MEFS and an auxiliary entity title item SEFS from an entity title item database EFDB of a source A, extracting an application scene es between a main entity m (m) and an auxiliary entity s (m) from an entity application scene database ESDB of the source A, extracting entity information related to an entity static database RSDB, extracting information representing the single source same entity by utilizing a single source same entity screening and data supplementing device according to the main entity and the same scene information, storing the information into the same entity database SEDB, and supplementing data; wherein the entity static database RSDB stores data of multiple libraries from the same source;
the working method of the single-source same entity screening and data supplementing device in the step A1 is as follows:
A11. reading a single-source multi-library data set DSB from an entity static library database RSDB of a source A;
A12. reading the number N1 of the warehouse-in warehouse from the entity title item database EFDB of the source A, and setting n1=1;
A13. reading a main entity title item MEFS of the library n1, obtaining a data set DSA of the main entity title item MEFS, and simultaneously obtaining the number I1 of records of the data set DSA, wherein i1=1;
A14. reading an i1 record in the data set DSA, matching the unique item K of the item data with the data in the data set DSB, if the matching is successful, executing the step A15, and if the matching is unsuccessful, executing the step A19;
A15. extracting the related information representing the identity of the single-source entity of the main entity m1 corresponding to the record i1, and writing the related information into the same entity database SEDB;
A16. reading a related information data set DSC of the main entity m1 characterizing the same entity in a source A from the same entity database SEDB;
A17. reading an auxiliary entity information set DSS corresponding to a main entity m1 from an entity application scene database ESDB, and judging whether a specific auxiliary entity s exists in the same entity or not by utilizing a same scene SS rule; if the same entity exists, executing step A18, otherwise, executing step A19;
A18. extracting the same entity related information of a specific subordinate entity s, and writing the same entity related information into the same entity database SEDB;
A19. judging whether the I1> I1 is true, if true, executing i1=i1+1, and jumping to the step A14 for execution; otherwise, jumping to the step A110 for execution;
A110. judging whether N1> N1 is true, if true, executing n1=n1+1, and jumping to the step A13 for executing; otherwise, ending;
A2. extracting entity static library related entity information from an entity static database RSDB, extracting subordinate entity title item SEFS from an entity title item database EFDB of a source B, extracting application scenes es between a main entity m (m) and a subordinate entity s (m) from an entity application scene database ESDB of the source B, extracting dynamic library entity data information from an entity dynamic database RVDB, extracting the same entity data information from the same entity database SEDB, judging the identity of a heterologous entity by utilizing a heterologous identical entity discriminator according to rules, extracting information representing the heterologous identical entity, transmitting the information into a heterologous entity data supplement, and simultaneously storing the information into the main entity dynamic database RVDB; wherein, the entity dynamic database RVDB stores indexes and data of the entities which come from different sources and are integrated;
A3. extracting dynamic library entity data information from an entity dynamic database RVDB, extracting the same entity data information from the same entity database SEDB, receiving the information of the same source and the same entity from a heterogeneous same entity discriminator, supplementing the heterogeneous entity information by utilizing a heterogeneous entity data supplement according to a time-nearest principle, and storing the heterogeneous entity supplement information into the entity dynamic database RVDB;
A4. extracting the same entity data information from the same entity database SEDB, extracting dynamic library entity data information from an entity dynamic database RVDB, automatically extracting and generating entity directory items, extracting entity directory information according to entity directory essential items eles, and storing the entity directory information into the entity directory database EDDB;
A5. extracting dynamic library entity data information from an entity dynamic database RVDB, extracting entity directory information from an entity directory database EDDB, automatically separating the sub-entity information from the entity directory database EDDB according to rules by using a sub-entity automatic separator to form sub-entity directory information, and storing the sub-entity directory information into the entity directory database EDDB.
2. The method for processing identity discrimination and data self-compensation of multi-source attached entity according to claim 1, wherein the specific method for discriminating the identity of the heterologous entity by the heterologous identical entity discriminator in step A2 is as follows:
A21. reading the number N2 of the unbanked subordinate entity types from the entity title item database EFDB of the source B according to the entity types, and setting n2=1;
A22. reading the related information of the specific subordinate entity type n2, and simultaneously obtaining a warehouse-in threshold TH of the subordinate entity type n2 set by a system;
A23. judging whether the corresponding entity dynamic database RVDB exists according to the affiliated entity type n2, if so, executing the step A24, and if not, jumping to the step A214 for execution;
A24. reading a related information data set DSF representing the same subordinate entity type n2 from the same entity database SEDB according to the subordinate entity type n 2;
A25. reading a dynamic library information data set DSD from an entity dynamic library RVDB;
A26. reading a set DSG of the subordinate entity types n2 from an subordinate entity title database EFDB of the source B to obtain the record number M2, and setting m2=1;
A27. reading the m 2-th record of the subordinate entity title item from the set DSG;
A28. reading a specific application scene es between an auxiliary entity corresponding to the record m2 and a main entity from an entity application scene database ESDB of the source B according to the auxiliary entity type n2 and the record m 2;
A29. reading a specific static database data set DSE corresponding to the record m2 from an entity static database RSDB of the source B according to the affiliated entity type n2 and the record m 2;
A210. obtaining set DSF information from step a24, obtaining set DSD information from step a25, obtaining record m2 information from step a27, obtaining application scene es information from step a28, obtaining set DSE information from step a29, matching in set DSD according to a set rule by using unique item, invariant item and common item attributes of record m2 of the subject item of the subordinate entity, and application scene es, set DSD, set DSE, set DSF information, and calculating similarity probability P (a) between entities;
A211. judging whether P (A) > TH is true, if not, jumping to the step A213 for execution, and if true, writing the information of P (A) and the characterization entity item into the same entity database SEDB;
A212. judging whether P (a) =100% is true, if not, jumping to step a213 for execution, if true, transmitting the record m2, the specific record item d corresponding to the set DSD, the specific record item e corresponding to the set DSE, and the specific record item f corresponding to the set DSF into the heterogeneous entity data adder, and starting the operation;
A213. judging whether M2> M2 is true, if true, executing m2=m2+1, and simultaneously jumping to the step A26 for execution; if not, step A214 is performed;
A214. judging whether N2> N2 is true, if true, executing n2=n2+1, and simultaneously jumping to the step A22 for execution; if not, ending.
3. The method for processing identity discrimination and data self-supplementing of multi-source attached entity according to claim 2, wherein the specific method for supplementing the information of the heterogeneous entity in the step A3 is as follows:
A31. receiving record m2, a specific record item d corresponding to the set DSD, a specific record item e corresponding to the set DSE and specific record item f information corresponding to the set DSF;
A32. aiming at the unique item, the unchanged item and the common item attribute of a specific title item, obtaining the attribute number N3, and setting n3=1;
A33. obtaining an attribute name of an nth 3 attribute;
A34. reading the corresponding data dn of the record item d according to the attribute name, and simultaneously sequentially reading the corresponding data of the record m2, the record item e and the record item f, and comparing with the dn;
A35. judging whether dn is empty or not, if so, jumping to the step A36 for execution, and if not, jumping to the step A37 for execution;
A36. supplementing corresponding latest data in record m2, record item e and record item f into dn according to a time nearest principle, and recording a time stamp and source information of the supplementary data;
A37. marking the time stamp and source information of the corresponding attribute data in record m2, record item e and record item f;
A38. forming a temporary record item d', judging whether N3> N3 is true, if so, jumping to the step A33 for execution, otherwise, executing the step A39;
A39. for other attributes except the unique item, the unchanged item and the common item, corresponding attribute data in the record item m2, the record item e and the record item f are sequentially read and compared with the record item d;
A310. recording the time stamp and the source information to form the latest temporary record item; updating into the entity dynamic database RVDB.
4. The method for processing identity discrimination and data self-compensation of multi-source attached entity according to claim 3, wherein the method for generating entity directory information in step A4 is as follows:
A41. setting entity types according to a system, obtaining the number N4 of the entity types, and setting n4=1;
A42. reading an entity list item els and an entity list essential item eles of the entity n 4;
A43. reading the same entity data set DSH of P (a) =100% of the entities n from the same entity database SEDB;
A44. extracting the related data information of the entity directory item els of the entity n4 from the entity dynamic database according to the set DSH and the latest time principle to form a temporary data set DSI;
A45. filtering the set DSI according to the data non-null principle of the essential items eles of the entity directory of the entity n4 to form a data subset DSJ;
A46. using the set DSJ as entity directory information of the entity n4, and writing the entity directory information into an entity directory database EDDB;
A47. judging whether N4> N4 is true, if so, making n4=n4+1, and jumping to the step A42 for execution, otherwise ending.
5. The method for processing identity discrimination and data self-compensation of multi-source attached entities according to claim 4, wherein the method for automatically separating the directory information of the sub-entities in step A5 is as follows:
A51. starting a sub-entity separation program of the specific entity n5 according to a user instruction;
A52. reading entity separation rules r specified or preset by a user;
A53. reading a directory data set DSO of a specific entity n5 from an entity directory database EDDB, and setting a temporary data set DSP;
A54. obtaining the number I5 of records in the set DSO, and setting i5=1;
A55. reading a record n5 in the set DSO, reading corresponding dynamic library entity data information in the entity dynamic database RVDB according to the information of the record n5, matching, executing a step A56 if the matching is successful, otherwise, executing a step A57;
A56. adding record n5 into the data set DSP;
A57. judging whether the I5> I5 is satisfied, if so, executing the I5 = I5+1, jumping to the step A55 for execution, and if not, executing the step A58;
A58. and writing the data set DSP into an entity directory database EDDB.
CN202210592302.7A 2022-05-27 2022-05-27 Multi-source main and auxiliary entity identity discrimination and data self-supplementing processing method Active CN114969041B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202210592302.7A CN114969041B (en) 2022-05-27 2022-05-27 Multi-source main and auxiliary entity identity discrimination and data self-supplementing processing method
ZA2022/11776A ZA202211776B (en) 2022-05-27 2022-10-28 Multisource main-subsidiary entity identity discrimination and data self-supplementation processing method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210592302.7A CN114969041B (en) 2022-05-27 2022-05-27 Multi-source main and auxiliary entity identity discrimination and data self-supplementing processing method

Publications (2)

Publication Number Publication Date
CN114969041A CN114969041A (en) 2022-08-30
CN114969041B true CN114969041B (en) 2023-06-30

Family

ID=82958053

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210592302.7A Active CN114969041B (en) 2022-05-27 2022-05-27 Multi-source main and auxiliary entity identity discrimination and data self-supplementing processing method

Country Status (2)

Country Link
CN (1) CN114969041B (en)
ZA (1) ZA202211776B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112231283A (en) * 2020-09-08 2021-01-15 苏宁金融科技(南京)有限公司 Generation management method and system based on multi-source heterogeneous data unified entity identification code
CN113076306A (en) * 2021-06-07 2021-07-06 航天神舟智慧系统技术有限公司 Data resource automatic collection method and system based on cataloguing rule

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
AU2002367376A1 (en) * 2001-12-28 2003-07-24 Jeffrey James Jonas Real time data warehousing
US7984019B2 (en) * 2007-12-28 2011-07-19 Knowledge Computing Corporation Method and apparatus for loading data files into a data-warehouse system
CN105893526A (en) * 2016-03-30 2016-08-24 上海坤士合生信息科技有限公司 Multi-source data fusion system and method
GB2572541A (en) * 2018-03-27 2019-10-09 Innoplexus Ag System and method for identifying at least one association of entity
CN113342909B (en) * 2021-08-06 2021-11-02 中科雨辰科技有限公司 Data processing system for identifying identical solid models
CN113760996A (en) * 2021-09-09 2021-12-07 上海明略人工智能(集团)有限公司 Data integration method, system, equipment and storage medium

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112231283A (en) * 2020-09-08 2021-01-15 苏宁金融科技(南京)有限公司 Generation management method and system based on multi-source heterogeneous data unified entity identification code
CN113076306A (en) * 2021-06-07 2021-07-06 航天神舟智慧系统技术有限公司 Data resource automatic collection method and system based on cataloguing rule

Also Published As

Publication number Publication date
ZA202211776B (en) 2022-12-21
CN114969041A (en) 2022-08-30

Similar Documents

Publication Publication Date Title
AU2018215082B2 (en) Massive scale heterogeneous data ingestion and user resolution
US9619512B2 (en) Memory searching system and method, real-time searching system and method, and computer storage medium
CN111459985B (en) Identification information processing method and device
US8438183B2 (en) Ascribing actionable attributes to data that describes a personal identity
US11907659B2 (en) Item recall method and system, electronic device and readable storage medium
CN108228825B (en) A kind of station address data cleaning method based on participle
MX2012008714A (en) System and method for aggregation and association of professional affiliation data with commercial data content.
US10698937B2 (en) Split mapping for dynamic rendering and maintaining consistency of data processed by applications
US8479203B2 (en) Reducing processing overhead and storage cost by batching task records and converting to audit records
CN110381115B (en) Information pushing method and device, computer readable storage medium and computer equipment
US20140025373A1 (en) Fixing Broken Tagged Words
CN114398315A (en) Data storage method, system, storage medium and electronic equipment
CN114969041B (en) Multi-source main and auxiliary entity identity discrimination and data self-supplementing processing method
CN113920410A (en) Method for realizing portrait clustering based on multi-data fusion analysis
US8515987B1 (en) Database information consolidation
US20230350878A1 (en) Automated database updating and curation
CN112667701A (en) Government affair big data super search method
CN110941957A (en) Traffic science and technology data indexing method and system
CN105677801B (en) Data processing method and system based on graph
CN114780654A (en) Processing method for modular construction of multi-source main and auxiliary entity structure
US8204917B2 (en) Apparatus, method, and computer program product thereof for storing a data and data storage system comprising the same
CN114880330B (en) Modularized construction and entity automatic identification processing method for single-source multi-period index system
US20230297648A1 (en) Correlating request and response data using supervised learning
JP2001147923A (en) Device and method for retrieving similar document and recording medium
CN104899213A (en) Method and device for resolving organization names

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant