CN114969041B

CN114969041B - Multi-source main and auxiliary entity identity discrimination and data self-supplementing processing method

Info

Publication number: CN114969041B
Application number: CN202210592302.7A
Authority: CN
Inventors: 吴峰; 张朝宗; 李银生; 王红; 聂永川; 任雁; 毋鹏杰; 杨扬; 刘淼; 张义倩
Original assignee: Hebei Academy Of Science And Technology Information Hebei Academy Of Science And Technology Innovation Strategy
Current assignee: Hebei Academy Of Science And Technology Information Hebei Academy Of Science And Technology Innovation Strategy
Priority date: 2022-05-27
Filing date: 2022-05-27
Publication date: 2023-06-30
Anticipated expiration: 2042-05-27
Also published as: ZA202211776B; CN114969041A

Abstract

The invention discloses a processing method for identifying identity of a multi-source main entity and attaching entity and self-supplementing data, which is applied to the field of big data processing and provides a method for stripping multi-source data entities according to the main entity and attaching entity, identifying the same entity according to the same scene, entity attribute classification, weight and the like, and respectively processing and storing the identification probability. The invention systematically solves the problems of respectively processing and collecting main and auxiliary entities according to the identity probability, combining and supplementing data, uniformly storing entity relations, separating entities according to the needs and the like by the technical methods of calculating the identity probability of the main entity and the auxiliary entities, combining and merging indexes of the same entity, extracting and storing entity directory items, separating entity sub directory items and the like, and provides a feasible solution for carrying out multi-source and large-scale data association operation.

Description

Multi-source main and auxiliary entity identity discrimination and data self-supplementing processing method

Technical Field

The invention relates to the technical field of big data application, in particular to a multi-source main and auxiliary entity identity screening and data self-supplementing processing method.

Background

The existing method for identifying, extracting and storing the entity for processing the multi-source data is generally classified according to the source or type, and is matched and identified one by one according to the entity attribute of the data, and due to the lack of discrimination mechanisms such as entity inscription items, same scene, entity attribute classification, weight and the like, the data redundancy, the non-uniform expression, the low matching accuracy, the low execution efficiency, the loss of information in the identification process and the like are mainly realized in the following aspects:

1) The data is redundant and cannot be uniformly expressed. In the prior art, when collecting the entities of the heterogeneous data, the heterogeneous data is usually collected according to the source or type, and because the indexes of the data representing the entities are various, the collected entity data indexes are often inconsistent, and unified storage, standard expression and external service cannot be provided.

2) The entity matching accuracy is not high. The existing entity identification technology generally performs matching and identification according to entity attributes of data, and is restricted by factors such as various entity attributes, huge data volume and the like, so that the problems of low matching degree, low precision and the like are generally existed.

3) The entity identification is not efficient to perform. In the prior art, the entities are generally judged in sequence according to the attribute sequence of the entities, and the problems of long entity identification calculation time, contradiction of attribute sequence and the like are often caused by lack of classification definition, weight assignment and the like aiming at the attributes of the entities.

4) The entity is relatively stationary and the data quality cannot be improved. In the prior art, when the entity is identified and extracted, a direct separation mode is generally adopted, the attribute expansion is limited, the mutual correction, the supplementation and the expansion of the data are not or rarely carried out according to the implicit attribute among the data, the self-perfection of the data cannot be realized, and the data quality cannot be effectively ensured.

5) The identification process information is lost. In the prior art, when an entity is identified, only attribute information of the same entity which is successfully identified is usually recorded, and a large probability event in the entity identification process, such as the situation that two entities are identified as the same entity in a large probability manner but cannot be completely identified as the same entity, is rarely recorded, so that the deep mining and analysis of the data relationship are not facilitated.

Disclosure of Invention

The invention provides a processing method for multi-source main entity identity discrimination and data self-supplementing, which is used for solving the problems of multi-source multi-period data main entity identity discrimination, data automatic merging and supplementing and the like and providing a feasible solution for multi-source and large-scale data association operation.

In order to solve the technical problems, the technical scheme adopted by the invention is as follows.

A processing method for identifying identity of a multi-source main entity and attaching entity and self-supplementing data specifically comprises the following steps:

A1. extracting a main entity title item MEFS and an auxiliary entity title item SEFS from an entity title item database EFDB of a source A, extracting an application scene ES between a main entity M (M) and an auxiliary entity S (M) from an entity application scene database ESDB of the source A, extracting entity information related to an entity static library from an entity static database RSDB, extracting information representing the single source same entity according to the main entity and the same scene information by utilizing a single source same entity screening and data supplementing device, storing the information representing the single source same entity into the same entity database SEDB, and supplementing data;

A2. extracting entity static library related entity information from an entity static database RSDB, extracting subordinate entity title item SEFS from an entity title item database EFDB of a source B, extracting an application scene ES between a main entity M (M) and a subordinate entity S (M) from an entity application scene database ESDB of the source B, extracting dynamic library entity data information from an entity dynamic database RVDB, extracting the same entity data information from the same entity database SEDB, judging the identity of a heterologous entity by utilizing a heterologous identical entity discriminator according to rules, extracting information representing the heterologous identical entity, transmitting the information into a heterologous entity data supplement, and storing the information into the main entity dynamic database RVDB;

A3. extracting dynamic library entity data information from an entity dynamic database RVDB, extracting the same entity data information from the same entity database SEDB, receiving the information of the same source and the same entity from a heterogeneous same entity discriminator, supplementing the heterogeneous entity information by utilizing a heterogeneous entity data supplement according to a time-nearest principle, and storing the heterogeneous entity supplement information into the entity dynamic database RVDB;

A4. extracting the same entity data information from the same entity database SEDB, extracting dynamic library entity data information from an entity dynamic database RVDB, automatically extracting and generating by utilizing entity directory items, extracting entity directory ELS information according to entity directory essential items ELES, and storing the entity directory ELS information into an entity directory database EDDB;

A5. extracting dynamic library entity data information from an entity dynamic database RVDB, extracting entity directory information from an entity directory database EDDB, automatically separating the sub-entity information from the entity directory database EDDB according to rules by using a sub-entity automatic separator to form sub-entity directory information, and storing the sub-entity directory information into the entity directory database EDDB.

The processing method for identifying the identity of the multi-source main entity and the auxiliary entity and self-supplementing the data in the step A1 comprises the following steps:

A11. reading a single-source multi-library data set DSB from an entity static library database RSDB of a source A;

A12. reading the number N1 of the warehouse-in warehouse from the entity title item database EFDB of the source A, and setting n1=1;

A13. reading a main entity title item MEFS of the library n1, obtaining a data set DSA of the main entity title item MEFS, and simultaneously obtaining the number I1 of records of the data set DSA, wherein i1=1;

A14. reading an i1 record in the data set DSA, matching the unique item K of the item data with the data in the data set DSB, if the matching is successful, executing the step A15, and if the matching is unsuccessful, executing the step A19;

A15. extracting the related information representing the identity of the single-source entity of the main entity m1 corresponding to the record i1, and writing the related information into the same entity database SEDB;

A16. reading a related information data set DSC of the main entity m1 characterizing the same entity in a source A from the same entity database SEDB;

A17. reading an auxiliary entity information set DSS corresponding to a main entity m1 from an entity application scene database ESDB, and judging whether a specific auxiliary entity s exists in the same entity or not by utilizing a same scene SS rule; if the same entity exists, executing step A18, otherwise, executing step A19;

A18. extracting the same entity related information of a specific subordinate entity s, and writing the same entity related information into the same entity database SEDB;

A19. judging whether the I1> I1 is true, if true, executing i1=i1+1, and jumping to the step A14 for execution; otherwise, jumping to the step A110 for execution;

A110. judging whether N1> N1 is true, if true, executing n1=n1+1, and jumping to the step A13 for executing; otherwise, ending.

The specific method for distinguishing the identity of the heterologous entity by the heterologous identical entity discriminator in the step A2 comprises the following steps of:

A21. reading the number N2 of the unbanked subordinate entity types from the entity title item database EFDB of the source B according to the entity types, and setting n2=1;

A22. reading the related information of the specific subordinate entity type n2, and simultaneously obtaining a warehouse-in threshold TH of the subordinate entity type n2 set by a system;

A23. judging whether the corresponding entity dynamic database RVDB exists according to the affiliated entity type n2, if so, executing the step A24, and if not, jumping to the step A214 for execution;

A24. reading a related information data set DSF representing the same subordinate entity type n2 from the same entity database SEDB according to the subordinate entity type n 2;

A25. reading a dynamic library information data set DSD from an entity dynamic library RVDB;

A26. reading a set DSG of the subordinate entity types n2 from an subordinate entity title database EFDB of the source B to obtain the record number M2, and setting m2=1;

A27. reading the m 2-th record of the subordinate entity title item from the set DSG;

A28. reading a specific application scene es between an auxiliary entity corresponding to the record m2 and a main entity from an entity application scene database ESDB of the source B according to the auxiliary entity type n2 and the record m 2;

A29. reading a specific static database data set DSE corresponding to the record m2 from an entity static database RSDB of the source B according to the affiliated entity type n2 and the record m 2;

A210. obtaining set DSF information from step a24, obtaining set DSD information from step a25, obtaining record m2 information from step a27, obtaining application scene es information from step a28, obtaining set DSE information from step a29, matching in set DSD according to a set rule by using unique item, invariant item and common item attributes of record m2 of the subject item of the subordinate entity, and application scene es, set DSD, set DSE, set DSF information, and calculating similarity probability P (a) between entities;

A211. judging whether P (A) > TH is true, if not, jumping to the step A213 for execution, and if true, writing the information of P (A) and the characterization entity item into the same entity database SEDB;

A212. judging whether P (a) =100% is true, if not, jumping to step a213 for execution, if true, transmitting the record m2, the specific record item d corresponding to the set DSD, the specific record item e corresponding to the set DSE, and the specific record item f corresponding to the set DSF into the heterogeneous entity data adder, and starting the operation;

A213. judging whether M2> M2 is true, if true, executing m2=m2+1, and simultaneously jumping to the step A26 for execution; if not, step A214 is performed;

A214. judging whether N2> N2 is true, if true, executing n2=n2+1, and simultaneously jumping to the step A22 for execution; if not, ending.

The specific method for supplementing the information of the heterologous entity in the step A3 is as follows:

A31. receiving record m2, a specific record item d corresponding to the set DSD, a specific record item e corresponding to the set DSE and specific record item f information corresponding to the set DSF;

A32. aiming at the unique item, the unchanged item and the common item attribute of a specific title item, obtaining the attribute number N3, and setting n3=1;

A33. obtaining an attribute name of an nth 3 attribute;

A34. reading the corresponding data dn of the record item d according to the attribute name, and simultaneously sequentially reading the corresponding data of the record m2, the record item e and the record item f, and comparing with the dn;

A35. judging whether dn is empty or not, if so, jumping to the step A36 for execution, and if not, jumping to the step A37 for execution;

A36. supplementing corresponding latest data in record m2, record item e and record item f into dn according to a time nearest principle, and recording a time stamp and source information of the supplementary data;

A37. marking the time stamp and source information of the corresponding attribute data in record m2, record item e and record item f;

A38. forming a temporary record item d', judging whether N3> N3 is true, if so, jumping to the step A33 for execution, otherwise, executing the step A39;

A39. for other attributes except the unique item, the unchanged item and the common item, corresponding attribute data in the record item m2, the record item e and the record item f are sequentially read and compared with the record item d;

A310. recording the time stamp and the source information to form the latest temporary record item; updating into the entity dynamic database RVDB.

The method for processing identity discrimination and data self-compensation of the multi-source main and auxiliary entities in the step A4 comprises the following steps:

A41. setting entity types according to a system, obtaining the number N4 of the entity types, and setting n4=1;

A42. reading an entity list item els and an entity list essential item eles of the entity n 4;

A43. reading the same entity data set DSH of P (a) =100% of the entities n from the same entity database SEDB;

A44. extracting the related data information of the entity directory item els of the entity n4 from the entity dynamic database according to the set DSH and the latest time principle to form a temporary data set DSI;

A45. filtering the set DSI according to the data non-null principle of the essential items eles of the entity directory of the entity n4 to form a data subset DSJ;

A46. the set DSJ is used as entity directory ELS information of the entity n4 and is written into an entity directory database EDDB;

A47. judging whether N4> N4 is true, if so, making n4=n4+1, and jumping to the step A42 for execution, otherwise ending.

The method for processing identity discrimination and data self-compensation of the multi-source main and auxiliary entities comprises the following steps of:

A51. starting a sub-entity separation program of the specific entity n5 according to a user instruction;

A52. reading entity separation rules r specified or preset by a user;

A53. reading a directory data set DSO of a specific entity n5 from an entity directory database EDDB, and setting a temporary data set DSP;

A54. obtaining the number I5 of records in the set DSO, and setting i5=1;

A55. reading a record n5 in the set DSO, reading corresponding dynamic library entity data information in the entity dynamic database RVDB according to the information of the record n5, matching, executing a step A56 if the matching is successful, otherwise, executing a step A57;

A56. adding record n5 into the data set DSP;

A57. judging whether the I5> I5 is satisfied, if so, executing the I5 = I5+1, jumping to the step A55 for execution, and if not, executing the step A58;

A58. and writing the data set DSP into an entity directory database EDDB.

By adopting the technical scheme, the invention has the following technical progress.

The invention systematically solves the problems of respectively processing and collecting main and auxiliary entities according to the identity probability, combining and supplementing data, uniformly storing entity relations, separating entities according to the needs and the like by the technical methods of calculating the identity probability of the main entity and the auxiliary entities, combining and merging indexes of the same entity, extracting and storing entity directory items, separating entity sub directory items and the like, and provides a feasible solution for carrying out multi-source and large-scale data association operation.

Mainly has the following remarkable effects.

1) The data is regular and the expression is uniform. As the identification, extraction and storage are carried out according to the entity directory entries, and the secondary processing and extraction of the data are carried out according to the entity directory entries, compared with the prior art, the index realizes standardization and unification, the data can be stored regularly and uniformly, and the entity expression is more uniform and more flexible to use.

2) The entity matching precision and the execution efficiency are improved. Because the invention classifies the specific attributes of the entities, gives different weights, combines information such as contract scenes and the like, carries out entity matching and extraction, and has smaller matching difficulty and higher matching precision compared with the prior art; the calculated attribute is less, and the execution efficiency is higher; the problems of contradiction, inconsistency and the like of the attribute values can be effectively relieved.

3) The data quality is improved. In the process of extracting and storing entity data, the invention realizes self-perfection and correction of the entity data by extracting and identifying the implicit attribute, and compared with the prior art, the invention can automatically compare and correct the data, can automatically supplement and expand the entity attribute, and has richer data and higher quality.

4) The same entity probability is recorded. According to the invention, the data fusion accuracy is improved compared with the prior art by respectively storing and processing according to the probability of the same entity in the identification process; the difficulty of secondary entity identification is reduced; the method is beneficial to the deep mining and data analysis of different scene applications and entity relations.

Drawings

FIG. 1 is a schematic diagram of the structure of the present invention;

FIG. 2 is a flow chart of the present invention;

FIG. 3 is a schematic diagram of a workflow of the single source same entity screening and data supplementation device of the present invention;

FIG. 4 is a schematic diagram of the workflow of the heterogeneous same entity discriminator of the present invention;

FIG. 5 is a schematic workflow diagram of a heterogeneous entity data supplement according to the present invention;

FIG. 6 is a schematic diagram of a workflow of an entity directory entry automatic extraction generator according to the present invention;

FIG. 7 is a schematic diagram of the working flow of the fruit body autoseparator according to the present invention.

Detailed Description

The invention will be described in further detail with reference to the drawings and the detailed description.

A processing method for identifying identity of a multi-source main entity and self-supplementing data is applied to the field of big data processing, and provides a technical scheme for stripping multi-source data entities according to the main entity, identifying the same entity according to the same scene, entity attribute classification, weight and the like, respectively processing and storing the identification probability, and providing feasibility for different scene applications of data, deep mining of entity relations and data analysis.

During actual operation, firstly, information representing the same entity of a single source is extracted; then distinguishing the heterogeneous same entity information, and carrying out data supplementation and expansion; and finally forming entity directory entries and entity sub directory entries.

In the present invention, it is applied to the following databases: 1) An entity static database RSDB (RelativeStatic Database) storing data of a plurality of libraries originating from the same source (single source); 2) An entity dynamic database RVDB (RelativeVarietyDatabase) for storing the index and data of the entity from the heterogeneous entity after integration; 3) An entity entry database EFDB (EntityFeatureDatabase) for storing information such as the main entity entry MEFS, related data, the subsidiary entity entry SEFS, related data, and the like; 4) The entity application scenario database ESDB (EntitySenseDatabase) stores application scenarios ES between the master entity M (M) and the subordinate entity S (M).

In the present invention, proper nouns applied include: 1) A Source S for describing a set of data sets of a particular subject, having stability and continuity over a period of time; 2) A library (Data-Set) DS, which refers to a Set of Data sets generated by a source for a period of time, may be composed of one or more two-dimensional Data tables; 3) Table (Table) T, which refers to a two-dimensional data Table in the library; 4) An Entity (Entity) can be a research object with relative stability and uniqueness, which is described by a group of characteristic variables, and the Entity is divided into a main Entity and an auxiliary Entity according to the mutual attachment relationship among different entities; 5) A main entity (MainEntity) refers to a research entity described by all or most of attributes in a source, generally only one main entity in one source is represented by an entity (main entity corresponding to the entity) format, and the main entity is represented as M (M); 6) An subordinate entity (Subsidiary Entity), which refers to an entity in the source that depends on the main entity, is typically part of the main entity or is a set of variables that describe the attributes of the main entity, and is represented in an "entity (main entity to which the entity corresponds)" format, and is denoted as S (M); 7) The entity entry EFS (EntityFeatureStructure: entity feature structure) capable of reflecting a set of index sets of entity attributes; 8) The main entity entry MEFS (MainEntity FeatureStructure: main entity feature structure), which refers to a set of index sets that can reflect the attributes of the main entity; 9) The subordinate entity entry SEFS (subsidiaryencyfeaturescription: secondary physical feature structure): the index set can reflect the affiliated entity and the association relation between the affiliated entity and the main entity, so that the attribute of the affiliated entity can be reflected, and the related attribute of the state of the main entity where the affiliated entity is located can be reflected; 10 When peeling off entities, aiming at subordinate entities in homology, the same scene SS (SameSense) is adopted when indexes are consistent and corresponding specific main entities are consistent.

For entity identity identification, the attributes of the entity title items are divided into unique items, unchanged items and common items, wherein: the unique term K (Key) refers to an attribute that characterizes entity uniqueness, such as: identification card number, unified social credit code, organization code, etc.; the invariant item UC (uchange) refers to an attribute that an entity typically does not change often or never, for example: name, sex, etc. of personnel entity, unit name, address, etc. of institution entity; the Normal term N refers to an attribute of an entity other than the unique term K and the invariant term UC.

In order to provide services for an application entity and extract entity list items, the entity list items and entity list essential items are used, wherein: entity directory entries els (EntityListStructure) refer to a limited set of attributes that reflect the basic status of an entity, selected according to a particular application, such as: for an entity of an organization, the basic items can be set as an organization name, a unified social credit code, an address and the like; entity directory necessaries ELES (EntityListEssentialStructure) refer to a limited set of attributes, typically name class attributes, selected according to a specific application, that can guarantee that an entity directory is meaningful, the absence of which can result in the meaningless of a specific entity, for example: an "organization name" of an "organization" entity, a "name" of a "personnel" entity, and the like.

In the invention, the heterologous data are respectively stored in the following two databases after being identified, extracted and processed according to the entity: entity directory database EDDB (EntityDirectoryDatabase) stores entity directory information of heterogeneous external services; the same entity database SEDB (SameEntityDatabase) stores information characterizing the same entity.

The implementation of the invention depends on a plurality of module implementations, as shown in fig. 1, and comprises a single-source identical entity discriminator, a data complement device, a heterogeneous identical entity discriminator, a heterogeneous entity data complement device, an entity directory item automatic extraction generator and a sub-entity automatic separator.

A processing method for identifying identity of a multi-source main entity and self-supplementing data is shown in figure 2, and specifically comprises the following steps.

A1. Extracting a main entity title item MEFS and an auxiliary entity title item SEFS from an entity title item database EFDB of a source A, extracting an application scene ES between a main entity M (M) and an auxiliary entity S (M) from an entity application scene database ESDB of the source A, extracting entity information related to an entity static database RSDB, extracting information representing the single source same entity by utilizing a single source same entity screening and data supplementing device according to the information of the main entity, the same scene and the like, storing the information into the same entity database SEDB, and supplementing data.

In this step, the working method of the single-source same entity screening and data supplementing device is shown in fig. 3, and is specifically as follows.

A2. Extracting entity static database related entity information from an entity static database RSDB, extracting subordinate entity title item SEFS from an entity title item database EFDB of a source B, extracting application scene ES between a main entity M (M) and a subordinate entity S (M) from an entity application scene database ESDB of the source B, extracting dynamic database entity data information from an entity dynamic database RVDB, extracting the same entity data information from the same entity database SEDB, judging the identity of a heterologous entity by utilizing a heterologous identical entity discriminator according to rules, extracting information representing the heterologous identical entity, transmitting the information into a heterologous entity data adder, and storing the information into the main entity dynamic database RVDB.

In this step, the procedure of discriminating the identity of the heterologous entity by the heterologous identical entity discriminator is shown in fig. 4, and the specific method is as follows.

A21. Reading the number N2 of the unbanked subordinate entity types from the entity title item database EFDB of the source B according to the entity types, and setting n=1;

A26. reading a set DSG of the subordinate entity type n from an subordinate entity title database EFDB of the source B to obtain the record number M, wherein m=1;

A210. obtaining set DSF information from step a24, obtaining set DSD information from step a25, obtaining record m information from step a27, obtaining application scenario es information from step a28, obtaining set DSE information from step a29, matching in set DSD according to a set rule by using unique item, invariant item and common item attributes of record m2 of the subject entry of the subordinate entity, and application scenario es, set DSD, set DSE, set DSF information, and calculating similarity probability P (a) between entities;

in this embodiment: when matching personnel entities, aiming at the information of two personnel, if the identity card numbers are the same, P (A) is 100%; if the name and the mobile phone number are the same, P (A) is 100%; if the name and unit are the same, P (A) is 80%, etc.

A3. Extracting dynamic library entity data information from an entity dynamic database RVDB, extracting the same entity data information from the same entity database SEDB, receiving the information of the same source and the same entity from a heterogeneous same entity discriminator, supplementing the heterogeneous entity information by utilizing a heterogeneous entity data supplement according to the principles of time latest and the like, and storing the heterogeneous entity supplement information into the entity dynamic database RVDB.

In this step, the procedure of the heterogeneous entity information supplementation is shown in fig. 5, and the specific method is as follows.

A33. obtaining an attribute name of an nth 3 attribute;

A39. for other attributes except the unique item, the unchanged item and the common item, corresponding attribute data in the record m2, the record item e and the record item f are sequentially read and compared with the record item d;

A4. Extracting the same entity data information from the same entity database SEDB, extracting dynamic library entity data information from the entity dynamic database RVDB, automatically extracting and generating by utilizing entity directory items, extracting entity directory ELS information according to entity directory essential items ELES, and storing the entity directory ELS information into the entity directory database EDDB.

In this step, the specific flow of the entity directory information is shown in fig. 6, and the generation method is as follows.

A44. extracting the related data information of the entity directory item els of the entity n from the entity dynamic database according to the set DSH and the latest time principle to form a temporary data set DSI;

A45. filtering the set DSI according to the data non-null principle of the essential items eles of the entity directory of the entity n to form a data subset DSJ;

In this step, the method for automatically separating the directory information of the fruiting body is shown in fig. 7, and is specifically as follows.

A51. Starting a sub-entity separation program of a specific entity n according to a user instruction;

A52. reading entity separation rules r specified or preset by a user;

A53. reading a directory data set DSO of a specific entity n from an entity directory database EDDB, and setting a temporary data set DSP;

A54. obtaining the number I5 of records in the set DSO, and setting 5i=1;

A56. adding record n5 into the data set DSP;

A58. and writing the data set DSP into an entity directory database EDDB.

The application of the present invention can realize the following functions.

1) And providing the main and auxiliary entity title items and directory items. When the entity of the heterogeneous data is identified, a large number of various data index items are screened and extracted according to the main and auxiliary entity title items, which is beneficial to representing the consistent index of the entity and the unified storage of the data, and meanwhile, the data is processed and extracted for the second time according to the entity title items, which is beneficial to unified external service of the data and large-scale data relation calculation.

2) The entity matches the scene. And when the entity attribute of the data is utilized for matching and identifying, the same scene identification mechanism of the entity is introduced according to the entity application scene of the data, so that the entity matching difficulty and complexity are reduced, and the entity matching accuracy is improved.

3) Entity attribute classification and weighting are presented. According to the characteristics of the entity attributes, the attributes of the entity title items are divided into unique items, unchanged items and common items, different weight values are respectively assigned, and the weight values are used for entity identification, so that the entity identification calculation time is reduced, and the problems of contradiction between the attributes and the like are solved.

4) The discrimination probabilities are stored and processed separately. In the process of identifying the entity, the same entity information which is successfully identified is recorded, the same entity probability among a plurality of entities is recorded, and the same entity probability is stored and processed respectively, so that the difficulty of secondary entity identification is reduced, and the deep mining and data analysis of different scene applications and entity relations are facilitated.

Claims

1. A processing method for identifying identity of a multi-source main entity and self-supplementing data is characterized by comprising the following steps:

A1. extracting a main entity title item MEFS and an auxiliary entity title item SEFS from an entity title item database EFDB of a source A, extracting an application scene es between a main entity m (m) and an auxiliary entity s (m) from an entity application scene database ESDB of the source A, extracting entity information related to an entity static database RSDB, extracting information representing the single source same entity by utilizing a single source same entity screening and data supplementing device according to the main entity and the same scene information, storing the information into the same entity database SEDB, and supplementing data; wherein the entity static database RSDB stores data of multiple libraries from the same source;

the working method of the single-source same entity screening and data supplementing device in the step A1 is as follows:

A110. judging whether N1> N1 is true, if true, executing n1=n1+1, and jumping to the step A13 for executing; otherwise, ending;

A2. extracting entity static library related entity information from an entity static database RSDB, extracting subordinate entity title item SEFS from an entity title item database EFDB of a source B, extracting application scenes es between a main entity m (m) and a subordinate entity s (m) from an entity application scene database ESDB of the source B, extracting dynamic library entity data information from an entity dynamic database RVDB, extracting the same entity data information from the same entity database SEDB, judging the identity of a heterologous entity by utilizing a heterologous identical entity discriminator according to rules, extracting information representing the heterologous identical entity, transmitting the information into a heterologous entity data supplement, and simultaneously storing the information into the main entity dynamic database RVDB; wherein, the entity dynamic database RVDB stores indexes and data of the entities which come from different sources and are integrated;

A4. extracting the same entity data information from the same entity database SEDB, extracting dynamic library entity data information from an entity dynamic database RVDB, automatically extracting and generating entity directory items, extracting entity directory information according to entity directory essential items eles, and storing the entity directory information into the entity directory database EDDB;

2. The method for processing identity discrimination and data self-compensation of multi-source attached entity according to claim 1, wherein the specific method for discriminating the identity of the heterologous entity by the heterologous identical entity discriminator in step A2 is as follows:

3. The method for processing identity discrimination and data self-supplementing of multi-source attached entity according to claim 2, wherein the specific method for supplementing the information of the heterogeneous entity in the step A3 is as follows:

A33. obtaining an attribute name of an nth 3 attribute;

4. The method for processing identity discrimination and data self-compensation of multi-source attached entity according to claim 3, wherein the method for generating entity directory information in step A4 is as follows:

A46. using the set DSJ as entity directory information of the entity n4, and writing the entity directory information into an entity directory database EDDB;

5. The method for processing identity discrimination and data self-compensation of multi-source attached entities according to claim 4, wherein the method for automatically separating the directory information of the sub-entities in step A5 is as follows:

A52. reading entity separation rules r specified or preset by a user;

A54. obtaining the number I5 of records in the set DSO, and setting i5=1;

A56. adding record n5 into the data set DSP;

A58. and writing the data set DSP into an entity directory database EDDB.