CN114417845B

CN114417845B - Same entity identification method and system based on knowledge graph

Info

Publication number: CN114417845B
Application number: CN202210321327.3A
Authority: CN
Inventors: 桂正科; 高率荏; 何雨潇; 张喜; 林昊; 阳进
Original assignee: Alipay Hangzhou Information Technology Co Ltd
Current assignee: Alipay Hangzhou Information Technology Co Ltd
Priority date: 2022-03-30
Filing date: 2022-03-30
Publication date: 2022-07-12
Anticipated expiration: 2042-03-30
Also published as: CN114417845A

Abstract

The embodiment of the specification discloses a method and a system for identifying the same entity based on a knowledge graph. Wherein, the method comprises the following steps: respectively acquiring a plurality of characteristics of two or more objects to be identified; the plurality of features comprise graph characteristic features and graph structure features, wherein the graph characteristic features reflect the characteristic information of corresponding nodes of the object in the graph, and the graph structure features reflect the information of at least part of associated nodes of the corresponding nodes of the object in the graph; determining a joint similarity of a plurality of features of two or more objects; based on the joint similarity, it is determined whether the two or more objects correspond to the same entity.

Description

Identical entity identification method and system based on knowledge graph

Technical Field

The specification relates to the technical field of computers, in particular to a method and a system for identifying the same entity based on a knowledge graph.

Background

Knowledge graph (or simply graph) can describe knowledge resources and carriers thereof by using a visualization technology, and aims to describe various objects and relationships thereof existing in the real world, wherein the objects and the relationships form a huge semantic network graph, nodes represent the objects, and edges are formed by attributes or relationships.

The knowledge graph has a complex structure, diversified attribute types and a multi-level learning task, and various application problems can be better solved by fully utilizing the knowledge graph.

The specification provides a method and a system for identifying the same entity based on a knowledge graph, so as to solve the problem of identifying the same entity.

Disclosure of Invention

One aspect of embodiments of the present specification provides a method of identical entity identification based on a knowledge-graph. The method comprises the following steps: respectively acquiring a plurality of characteristics of two or more objects to be identified; the plurality of features comprise graph characteristic features and graph structure features, wherein the graph characteristic features reflect the characteristic information of corresponding nodes of the object in the graph, and the graph structure features reflect the information of at least part of associated nodes of the corresponding nodes of the object in the graph; determining a joint similarity of a plurality of features of the two or more objects; determining whether the two or more objects correspond to the same entity based on the joint similarity.

Another aspect of embodiments of the present specification provides a system for knowledge-graph based identity recognition of entities. The system comprises: a feature acquisition module for respectively acquiring a plurality of features of two or more objects to be identified; the plurality of features comprise graph characteristic features and graph structure features, wherein the graph characteristic features reflect the characteristic information of corresponding nodes of the object in the graph, and the graph structure features reflect the information of at least part of associated nodes of the corresponding nodes of the object in the graph; a joint similarity determination module to determine joint similarities of features of the two or more objects; a same entity identification module to determine whether the two or more objects correspond to a same entity based on the joint similarity.

Another aspect of embodiments of the present specification provides a knowledge-graph based identity recognition apparatus comprising at least one storage medium and at least one processor, the at least one storage medium for storing computer instructions; the at least one processor is configured to execute the computer instructions to implement a knowledgegraph-based identity recognition method.

Another aspect of embodiments of the present specification provides a computer-readable storage medium storing computer instructions which, when read by a computer, cause the computer to perform a method for identical entity identification based on a knowledge-graph.

Drawings

The present description will be further explained by way of exemplary embodiments, which will be described in detail by way of the accompanying drawings. These embodiments are not intended to be limiting, and in these embodiments like numerals are used to indicate like structures, wherein:

FIG. 1 is a schematic diagram of an application scenario for knowledge-graph based identification of the same entity, according to some embodiments of the present description;

FIG. 2 is an exemplary flow diagram of a method for knowledge-graph based identity recognition in accordance with some embodiments of the present description;

FIG. 3 is an exemplary flow diagram illustrating obtaining graph structure features according to further embodiments of the present description;

FIG. 4 is an exemplary flow diagram illustrating the determination of joint similarity according to some embodiments of the present description;

FIG. 5 is an exemplary block diagram of a knowledge-graph based identity recognition system in accordance with some embodiments of the present description.

Detailed Description

In order to more clearly illustrate the technical solutions of the embodiments of the present disclosure, the drawings used in the description of the embodiments will be briefly described below. It is obvious that the drawings in the following description are only examples or embodiments of the present description, and that for a person skilled in the art, the present description can also be applied to other similar scenarios on the basis of these drawings without inventive effort. Unless otherwise apparent from the context, or otherwise indicated, like reference numbers in the figures refer to the same structure or operation.

It should be understood that "system", "device", "unit" and/or "module" as used herein is a method for distinguishing different components, elements, parts, portions or assemblies at different levels. However, other words may be substituted by other expressions if they accomplish the same purpose.

As used in this specification and the appended claims, the terms "a," "an," "the," and/or "the" are not intended to be inclusive in the singular, but rather are intended to be inclusive in the plural, unless the context clearly dictates otherwise. In general, the terms "comprises" and "comprising" merely indicate that steps and elements are included which are explicitly identified, that the steps and elements do not form an exclusive list, and that a method or apparatus may include other steps or elements.

Flowcharts are used in this specification to illustrate the operations performed by the system according to embodiments of the present specification. It should be understood that the preceding or following operations are not necessarily performed in the exact order in which they are performed. Rather, the various steps may be processed in reverse order or simultaneously. Meanwhile, other operations may be added to the processes, or a certain step or several steps of operations may be removed from the processes.

A knowledge graph is a data structure that describes, in graph mode, various objects and their relationships in the physical world. An object may refer to something that is distinguishable and exists independently. Such as a person, a city, a plant, etc., a commodity, an account, a device, or an abstract concept (e.g., gender, occupation, nationality), etc. Objects can be regarded as the most basic elements in a knowledge graph, and different relationships exist among different objects.

The knowledge data for constructing the knowledge graph can be from various services, and the data has the characteristics of wide service sources, complex relationship, huge data volume and the like. In knowledge-graph based business applications, there are a number of same entity identification problems. Identical entity identification refers to determining whether two or more objects belong to the same entity. An entity may be understood as a generalized designation for an object, identity, or attribute that actually exists in a physical space. For more insight about an entity, reference may also be made to the relevant explanations of the object in this description. It should be understood that the concepts of objects and entities may overlap when viewed individually, but in an associative manner, objects and entities may have an attributed or corresponding relationship that distinguishes them from one another. For example, "whether two money receiving codes (objects) belong to the same shop (entity)" or not (for example, in an offline marketing scenario, the service platform provides a reward for a new merchant to stay in, a cheating merchant often applies for a plurality of money receiving codes in order to obtain more benefits, and in order to avoid being identified by anti-cheating, the store names and store ids of the money receiving codes provided by the merchant are naturally not the same), whether two account numbers (objects) belong to the same natural person (entity), whether two enterprises (objects) collected by multiple sources are the same, and the like. In the related art, for the problem of identifying the same entity, one solution is to judge according to the precise matching of the attributes of the objects, for example, according to whether the identity card number and the registration number of an enterprise are the same; another way is to make a decision by means of a search engine based on text similarity. However, in the first scheme, the integrity of the attribute information depending on the object is accurately matched by virtue of the attribute, and the service problem often has the characteristics of missing of the object dimension information and sparse data, which means that the first scheme can only solve the problem of a very small proportion and has low coverage rate. The second scheme is based on the comparison of text similarity of a search engine and is easily interfered by cheating means; in addition, slight differences of texts may have high similarity but large business semantic differences, for example, the "XX road 1 in XX district XX city XX in XX province" and the "XX road 2 in XX district XX city XX district in XX province" may be completely different business entities, and the recognition accuracy is low.

In view of the above, it is desirable to provide a method for large-scale identical entity recognition of a knowledge graph, which improves the coverage rate and the recognition accuracy. Therefore, some embodiments of the present disclosure provide a method and a system for identifying the same entity based on a knowledge graph, by which the same entity identification problem can be converted into a distance calculation problem for multiple features of the entity, and the calculation is performed from the perspective of multiple features of the entity, so that not only the calculation result is accurate, but also the calculation amount and the total time consumption can be further reduced compared with the second scheme in the actual execution. It should be noted that the above examples are only for illustrative purposes and are not intended to limit the application scenarios of the technical solutions disclosed in the present specification, and the technical solutions disclosed in the present specification are explained in detail by the description of the drawings below.

FIG. 1 is a schematic diagram of an application scenario for knowledge-graph based identification of the same entity, according to some embodiments of the present description.

As shown in fig. 1, the scenario 100 may include a processing device 110, a network 120, and a user terminal 130.

The processing device 110 may be used to process information and/or data associated with an object to perform one or more of the functions disclosed in this specification. In some embodiments, the processing device may be a server owned by a service platform capable of providing one or more services to the user terminal 130. Further, the processing device may obtain the registration request and the registration information or other information of the user terminal 130 through the network 120, and in some scenarios, the service platform needs to perform the same entity identification based on the information from the terminal through the processing device. In some embodiments, processing device 110 may include one or more processing engines (e.g., single core processing engines or multiple core processing engines). By way of example only, the processing device 110 may include one or more combinations of a central processing unit (cpu), an Application Specific Integrated Circuit (ASIC), an application specific instruction set processor (ASIP), an image processor (GPU), a physical arithmetic processing unit (PPU), a Digital Signal Processor (DSP), a Field Programmable Gate Array (FPGA), a Programmable Logic Device (PLD), a controller, a microcontroller unit, a Reduced Instruction Set Computer (RISC), a microprocessor, and the like. In some embodiments, one or more storage devices may be included in the processing device for storing data that needs to be processed by the processing device or result data of the processing, and the like. For example, a knowledge graph, a plurality of features of an object, recognition results of the same entity recognition, and the like may be stored in the storage device.

Network 120 may facilitate the exchange of information and/or data. In some embodiments, one or more components (e.g., processing device 110, user terminal 130) in the scenario 100 may transmit information to other components in the scenario 100 via the network 120. For example, processing device 110 may obtain description data for objects and/or relationships from user terminal 130 via network 120. As another example, the user terminal 130 may request or receive the same entity identification result of the processing device 110 through the network 120. In some embodiments, the network 120 may be any form of wired network, wireless network, or any combination thereof. By way of example only, network 120 may be one or more combinations of a wireline network, a fiber optic network, a telecommunications network, an intranet, the internet, a Local Area Network (LAN), a Wide Area Network (WAN), a Wireless Local Area Network (WLAN), a Metropolitan Area Network (MAN), a Wide Area Network (WAN), a Public Switched Telephone Network (PSTN), a bluetooth network, and so forth.

User terminal 130 may be a device having data acquisition, storage, and/or transmission capabilities. In some embodiments, the user terminal 130 may apply for registration with the aforementioned service platform to become its user. The user terminal 130 may send a registration request along with registration information to the service platform via the network 120. In some embodiments, the user terminal 130 may also be authorized to obtain user data and upload it to the service platform or processing device 110, for example, location information of the user or behavior data on the user service platform. In some embodiments, the user data may be used to generate or update a knowledge graph. In some embodiments, the user terminal 130 may receive the same entity identification result determined by the processing device 110. In some embodiments, the user terminal 130 may include, but is not limited to, a mobile device 130-1, a tablet 130-2, a laptop 130-3, a desktop 130-4, and the like, or any combination thereof. Exemplary mobile devices 130-1 may include, but are not limited to, smart phones, Personal Digital Assistants (PDAs), handheld game consoles, smart watches, wearable devices, virtual display devices, display enhancement devices, and the like, or any combination thereof. In some embodiments, the user terminal 130 may send the retrieved data to one or more devices in the scene 100.

It should be noted that the above description of the various components in the application scenario 100 is for illustration and description only and does not limit the scope of applicability of the present description. It will be apparent to those skilled in the art, given the benefit of this disclosure, that additions or subtractions of components in the application scenario 100 may be made. However, such variations are still within the scope of the present description.

FIG. 2 is an exemplary flow diagram of a method for knowledge-graph based identity recognition in accordance with some embodiments of the present description. In some embodiments, flow 200 may be performed by a processing device. For example, the process 200 may be stored in a storage device (e.g., an onboard storage unit of a processing device or an external storage device) in the form of a program or instructions that, when executed, may implement the process 200. The flow 200 may include the following operations.

In step 202, a plurality of features of two or more objects to be identified are obtained, respectively.

The objects may include accounts, people, businesses, stores, checkout codes, devices, and so forth. Relevant descriptions about objects can also be found elsewhere in this specification.

A feature may refer to a characteristic of an object that is different from other things. In some embodiments, a feature may be an abstract description of certain characteristics of an object. For example, a concept is abstracted according to some similar characteristics of an object, and the concept can be used as the characteristics corresponding to the characteristics of the object.

In some embodiments, a knowledge graph may be constructed based on a plurality of objects and relationships, and features of the objects may be obtained based on the knowledge graph. Wherein the objects correspond to nodes in the graph and the relationship between the objects is related to edges in the graph. In some embodiments, the plurality of features of the object to be identified may include graph feature features as well as graph structure features.

Graph characterization features may be used to reflect characterization information of corresponding nodes of an object in a graph. In some embodiments, the processing device may obtain graph characterization features of the resulting nodes based on graph representation learning. For example, the processing device may perform representation learning on the graph; and representing the vector of the corresponding node of the object in the graph after representation learning as the graph characteristic feature of the object.

Representation learning may refer to the process of converting a map into a vector. In some embodiments, the processing device may learn the representation of the atlas by a decomposition-based approach, a random walk-based approach, and a deep learning-based (e.g., graph neural network-based) approach. Each node and/or edge of the graph can be represented by a vector through representation learning. For example, the processing device may give an initial vector to different nodes (or edges) in the graph, input the initial vector of each node (or edge) in the graph to a graph neural network model (GCN), and perform one or more rounds of iterative updating on the vector corresponding to each node/edge in the graph based on a preset prediction task to obtain a graph after representation learning, where the vectors corresponding to the nodes in the graph represent abstract information capable of more accurately comprehensively reflecting the attributes of the corresponding objects and the relationships between the corresponding objects and other objects. Furthermore, according to the position of the node corresponding to the object to be identified in the graph, the corresponding vector representation can be extracted from the graph after the representation learning, and the graph characteristic feature of the object can be obtained.

Graph structure features may be used to more closely reflect information of at least some associated nodes of corresponding nodes of an object in the graph. Since the associated node is a node having an edge connection with the object to be identified, and the edge determines the structure information of the graph, the information of the associated node can reflect the graph structure information related to the object to be identified in the graph, which can also be referred to as the graph structure feature of the object to be identified. In some embodiments, at least some of the associated nodes may include neighbor nodes of the object to be identified within an N-hop sub-graph of the corresponding node in the knowledge-graph (e.g., other nodes within the N-hop sub-graph other than the object corresponding node); wherein N is a positive integer. When N takes 1, the associated node is specifically a neighbor node connected with the node to be identified through one edge, and when N takes 2, the associated node further comprises a neighbor node connected with the neighbor node through one edge, and so on. At least part of the associated nodes can also be considered as nodes having direct or indirect connection relation with the nodes corresponding to the objects to be identified. In some embodiments, the associated node may be all neighbor nodes in the N-hop subgraph, or may be some neighbor nodes therein.

The information of the associated node may specifically refer to description information of the associated node. The description information may further include attribute information of the associated node and relationship information of the associated node with a corresponding node of the object to be identified. The attribute information may include identity information, account login information, wifi connection information, location information, and the like. The relationship information may include transactional relationships, social relationships, relatives, and the like. In some embodiments, the descriptive information may be more concrete or explicit semantic data used to represent relationships between nodes or edges, including but not limited to descriptive text or encoded data (e.g., one-hot encoded data). Illustratively, when the relationship between the node a corresponding to the user a and the node B corresponding to the user B is a transaction relationship, the transaction relationship may be represented by the code 2, and further, the relationship information may further include more fine-grained relationship information such as transaction amount and transaction time in a text form. The attribute information of node a and node B may further include basic information (such as name, address, etc.) of the business in text form, industry type, etc. In some alternative embodiments, the description information may not include relationship information.

In some embodiments, the processing device may associate the description information of the node to obtain a graph structure feature of the object to be identified.

Illustratively, the processing device may obtain description information of at least a portion of associated nodes of corresponding nodes of the object in the graph. In some embodiments, the processing device may obtain the description information by reading data or calling a related data interface.

The processing device may splice the description information of each associated node to obtain spliced description information. The concatenation may be a combination of the description information. For example, the object C to be recognized has two associated nodes A, B, and the description information of the node a and the description information of the node B may be spliced end to obtain a longer spliced description information.

The processing device may process the splicing description information based on a probabilistic collision algorithm to obtain a vector of a preset dimension, and the vector is used as a graph structure feature of the object. The vector of the preset dimension may be a preset vector of a certain length, for example, 128 dimensions, 256 dimensions, etc. In some embodiments, the length of the preset dimension is smaller than the length of the splicing description information.

For more description of the probabilistic collision algorithm, reference may be made to the related description of fig. 3, which is not described herein again.

In some embodiments, the plurality of features of the object may further comprise semantic features and/or spatial features in addition to the atlas-related features.

Semantic features may refer to abstract generalizations of information semantically related to the text of the object to be recognized. Such as name, address, etc. The properties of the object can be described from the semantic layer by semantic features.

In some embodiments, the processing device may obtain a description text of the object to be recognized. The description text may describe text of information such as a name, an address, an account name, and the like of the object. In some embodiments, the processing device may obtain the descriptive text by reading data, calling a data interface, and the like.

The processing device may generate a corresponding semantic vector based on the description text and take it as a semantic feature of the object. In some embodiments, the processing device may process the description text by natural language processing techniques to obtain a semantic vector. For example, the processing device may input descriptive text into a natural language processing model, with the model outputting semantic vectors. In some embodiments, the natural language processing model may be a pre-training model, for example, a BERT model, and the embodiments of the present specification do not limit the natural language processing technology as long as the text can be processed to obtain its corresponding vector.

The spatial feature may reflect location information, such as LBS information, of the object to be recognized.

In some embodiments, the spatial features may comprise a single location point of the object to be identified. The processing device may directly take a single location point of the object as a spatial feature.

In some embodiments, the spatial features may include two or more location points of the object to be identified. Two or more location points may form a sequence of locations. The manner of forming the sequence of positions may include ordering two or more position points based on time. The location point can be represented by a street name, a house number, a longitude and latitude value obtained by a positioning technology or a coordinate value in a spherical coordinate. The processing device may obtain a sequence of positions of the object; and processing the position sequence based on a probability collision algorithm to obtain a vector of a preset dimension, and taking the vector as the spatial feature of the object. For the probabilistic collision algorithm, reference may be made to the related description of fig. 3, which is not described herein again.

In some embodiments, the processing device may obtain a plurality of features of two objects to be identified (e.g., objects to be compared with the objects to be identified) by the above-described method, respectively.

In some embodiments, the processing device may also directly obtain a plurality of features of the object to be recognized by reading from a database, a storage device, invoking a related data interface, and the like. A plurality of features may be obtained in advance and stored.

Step 204, determining joint similarity of a plurality of features of the two or more objects.

The joint similarity may be an overall similarity of a plurality of features of two or more objects to be recognized, or an overall similarity of a plurality of features of a target object and the rest of the objects in the two or more objects. Taking two of the objects as an example, the joint similarity may be a weighted sum of similarities between different features of the two objects. Specifically, the processing device may determine similarity of each of a plurality of features of two objects to be recognized, and then perform weighted summation on the plurality of similarities to obtain a joint similarity.

In some embodiments, the processing device may calculate a similarity distance between a plurality of features of two objects to be recognized, and determine a joint similarity according to the similarity distance. For more description on determining the joint similarity, reference may be made to the description of fig. 4, which is not repeated herein.

Step 206, determining whether the two or more objects correspond to the same entity based on the joint similarity.

In some embodiments, the processing device may determine whether two or more objects correspond to the same entity based on the magnitude of the joint similarity. Correspondence here may also be understood as belonging to, for example, whether two registered accounts belong to the same company. Specifically, the processing device may compare the joint similarity of two objects to be identified with a preset joint similarity threshold, and when the joint similarity is greater than the preset threshold, the two objects may be considered to correspond to the same entity. For another example, the processing device may sort the joint similarities corresponding to the target object and the plurality of other objects from large to small, and determine that the object with the highest joint similarity or N-bits before the sorting corresponds to the same entity.

In some embodiments of the present description, multiple features such as semantic features, graph feature features, graph structure features, and spatial features of an object to be recognized are obtained, joint similarity is obtained in a weighted combination manner to recognize the same entity, and the multiple features reflect the similarity between objects from different angles, thereby avoiding the limitations based on attribute precise matching and text search. Meanwhile, compared with a mode of individually recalling and sequencing each feature, the method reduces the calling complexity and improves the stability of the system. For example, if there are 4 features, each feature in the recall is sorted according to the feature distance, 4 times of sorting is required, and the final similarity is determined comprehensively according to the results of the 4 times of sorting, and if the joint similarity is passed, it can be determined whether two objects correspond to the same entity only by one time of sorting.

FIG. 3 is an exemplary flow diagram illustrating obtaining graph structure features according to further embodiments of the present description. In some embodiments, flow 300 may be performed by a processing device. For example, the process 300 may be stored in a storage device (e.g., an onboard storage unit of a processing device or an external storage device) in the form of a program or instructions that, when executed, may implement the process 300. The flow 300 may include the following operations.

Step 302, obtaining the description information of at least part of the associated nodes of the corresponding nodes of the object in the graph.

For the description of step 302, reference may be made to the related description in step 202, and details are not repeated here.

And 304, grouping the associated nodes based on the relationship types of the corresponding nodes and the associated nodes.

In some embodiments, the relationship types of different associated nodes and corresponding nodes may be different. Relationship types may include transaction types, business types, login types, social relationship types, and the like. In this way, the associated nodes of different relation types can be processed respectively, so as to further improve the calculation accuracy.

In some embodiments, the relationship type may be determined based on the side information of the node corresponding to the object and the associated node, or may be determined based on the description information, and specifically may be determined according to the relationship information in the description information.

And grouping the associated nodes of different relation types to further improve the calculation accuracy. Grouping may refer to bringing together associated nodes of the same relationship type. For example, assume that the object to be identified is an account, and the relationship types with the associated nodes include an administration relationship, a consumption relationship, and a login relationship. The processing device may put the associated nodes corresponding to the business relationship (for example, a terminal point representing an edge of the business relationship) together to obtain a business relationship group; putting the associated nodes corresponding to the consumption relations together to obtain a consumption relation group; and putting the associated nodes corresponding to the login relations together to obtain a login relation group.

Step 306, for each group in the grouping result: and splicing the description information of the associated nodes of the group to obtain splicing description information.

For example, the processing device may splice description information of the associated nodes in the business relationship group in the above example to obtain splicing description information corresponding to the business relationship group. For a detailed description of the splicing manner, reference may be made to the related description of fig. 2, which is not described herein again. Therefore, the splicing description information corresponding to the consumption relation group and the splicing description information corresponding to the login relation group can be further obtained.

308, processing the splicing description information based on a probability collision algorithm to obtain a vector of a preset dimension;

in some embodiments, the probabilistic collision algorithm may be used in an algorithm that maps the splicing description information into a preset dimensional vector, while ensuring that collision probabilities of front and rear information are consistent. Where consistent is understood to be the same or a positive correlation.

For example, assuming that the dimension of the first splicing description information of a certain object is ten million dimensions, the collision probability between the first splicing description information and the second splicing description information of another object (the dimension also reaches ten million dimensions) is 10%, the first splicing description information and the second splicing description information are respectively processed by using a probability collision algorithm to obtain a first vector and a second vector of a preset dimension, and the collision probability between the first vector and the second vector is consistent with the collision probability between the first splicing description information and the second splicing description information (for example, is also 10%). The collision probability can also be understood as the intersection of two vectors or arrays, and the number of the intersected elements accounts for the proportion of the total number of the elements of a single vector or a single array.

Illustratively, the probabilistic collision algorithm may include a samehash algorithm.

The splicing description information corresponding to each group can be processed based on a probability collision algorithm, and vectors of preset dimensions corresponding to each group can be obtained. The high-dimensional array or vector can be compressed into a short vector with preset dimensions through a probability collision algorithm, and meanwhile, the collision probability is kept consistent, so that the calculation time and the storage space are effectively saved. The probabilistic collision algorithm can also be used for processing the position sequence to obtain the spatial features of the preset dimensionality.

And step 310, fusing the vectors corresponding to the groups to be used as the graph structure characteristics of the object.

In some embodiments, fusing the vectors corresponding to each group may refer to splicing the vectors corresponding to each group. For example, after the three vectors with 128 dimensions are spliced, a vector with 384 dimensions can be obtained. In some examples, vectors corresponding to each group may also be weighted and summed to obtain the graph structure feature of the object.

In the embodiment, the characteristic length of the object to be identified can be effectively reduced through the probability collision algorithm, the characteristic length is controlled within a reasonable range, and the calculation speed in the subsequent similarity determination is effectively improved.

FIG. 4 is an exemplary flow diagram illustrating the determination of joint similarity according to some embodiments of the present description. In some embodiments, flow 400 may be performed by a processing device. For example, the process 400 may be stored in a storage device (e.g., an onboard storage unit of a processing device or an external storage device) in the form of a program or instructions that, when executed, may implement the process 400. The flow 400 may include the following operations.

In some embodiments, the processing device may perform step 402 for each of the plurality of features, calculate a similar distance of each feature, and then obtain similar distances corresponding to the plurality of features respectively.

For each of a plurality of features, a similar distance of the feature of the two or more objects is determined, step 402.

The similarity distance may be used to reflect the similarity between two objects. Similar to the joint similarity, for each of the plurality of features, the similar distance of each of two or more objects to be recognized under the feature may be calculated, or the similar distance of the target object and the rest of the objects under the feature may be calculated. In some embodiments, the smaller the similarity distance, the greater the similarity.

In some embodiments, the similar distances may include cosine distances, Euclidean distances, sphere distances, and Jacard distances, among others.

In some embodiments, the processing device may calculate similar distances between features based on a distance function. Taking the jarard distance as an example, the processing device may calculate similar distances between the graph structure feature and the space feature based on a jarard operator (jaccard operator). In some embodiments, the calculation of the Jacard operator may be as shown in equation (1) below.

（1）

Wherein, A represents a first feature, for example, a graph structure feature of one of the objects to be recognized; b denotes a second feature, for example, a graph structure feature of another object to be recognized.

Indicating the number of elements to be calculated. Further, can be based on

The Jerad distance is obtained.

For example, taking the spatial feature including multiple position points as an example, the more elements in the overlapping portion of two tracks, the higher the proportion of the total number of elements, the more similar the tracks are, the larger the value calculated by the jaccard operator is, and the smaller the corresponding jarad distance is. The structural features of the drawing are similar and are not described in detail here.

In some embodiments, the processing device may calculate similar distances using different distance functions for different features of the object to be identified. For example, when the graph structure feature and the spatial feature include a plurality of location points, the jaccard distance of the graph structure feature/the spatial feature of the two objects may be calculated, and when the spatial feature is a single location point, the spherical distance of the spatial feature of the two objects may be calculated; and calculating the similar distance of the semantic features of the two objects by using the cosine distance or the Euclidean distance.

And step 404, performing weighted summation on the plurality of similar distances to obtain the joint similarity.

For two objects, after obtaining the similar distances of the two different features based on the foregoing steps, the processing device may perform weighted summation on the multiple similar distances based on the weight coefficients corresponding to the respective features, and obtain the joint similarity based on the result of the weighted summation.

Illustratively, taking the weighted summation of the semantic features and the single-point spatial features as an example, the calculation process can be shown as the following formula (2).

（2）

Where dist denotes the result of the weighted sum,

representing corresponding weights, which can be user-definedMean setting of f_cosRepresenting a cosine distance function, X_bertRepresenting a semantic feature of one of the objects to be recognized, Y_bertRepresenting a semantic feature of another object to be recognized; f. of_sphereRepresenting a spherical distance function, X_lbsRepresenting spatial features of one of the objects to be identified, Y_sphereRepresenting a spatial feature of another object to be identified.

In this embodiment, a user may customize a feature type for performing weighted summation, for example, as in the above example, the user may select a semantic feature and a spatial feature for performing weighted summation, or select a graph representation feature and a graph structural feature for performing weighted summation, or simultaneously select a semantic feature, a graph representation feature, a graph structural feature, and a spatial feature, and may also customize a weight for performing weighted summation, so that the technical scheme disclosed in this embodiment of the specification may be applied to different types of application scenarios in a targeted manner, for example, scenarios such as security contract, same-shop identification, international business same-shop, and business graph construction can all be applied.

It should be noted that the above description of the respective flows is only for illustration and description, and does not limit the applicable scope of the present specification. Various modifications and alterations to the flow may occur to those skilled in the art, given the benefit of this description. However, such modifications and variations are intended to be within the scope of the present description. For example, changes to the flow steps described herein, such as the addition of pre-processing steps and storage steps, may be made.

FIG. 5 is an exemplary block diagram of a knowledge-graph based identity recognition system in accordance with some embodiments of the present description. As shown in fig. 5, the system 500 may include a feature acquisition module 510, a joint similarity determination module 520, and a same entity identification module 530.

The feature acquisition module 510 may be configured to acquire a plurality of features of two or more objects to be identified, respectively.

The plurality of features comprise graph characteristic features and graph structure features, wherein the graph characteristic features reflect the characteristic information of corresponding nodes of the object in the graph, and the graph structure features reflect the information of at least partial associated nodes of the corresponding nodes of the object in the graph

In some embodiments, the plurality of features may also include semantic features and/or spatial features.

The joint similarity determination module 520 may be used to determine joint similarity of multiple features of the two or more objects.

The same entity identification module 530 may be configured to determine whether the two or more objects correspond to the same entity based on the joint similarity.

With regard to the specific description of the modules of the system shown above, reference may be made to the flow chart portion of this specification, e.g., the associated description of fig. 2-4.

It should be understood that the system and its modules shown in FIG. 5 may be implemented in a variety of ways. For example, in some embodiments, the system and its modules may be implemented in hardware, software, or a combination of software and hardware. Wherein the hardware portion may be implemented using dedicated logic; the software portions may be stored in a memory for execution by a suitable instruction execution system, such as a microprocessor or specially designed hardware. Those skilled in the art will appreciate that the methods and systems described above may be implemented using computer executable instructions and/or embodied in processor control code, such code being provided, for example, on a carrier medium such as a diskette, CD-or DVD-ROM, a programmable memory such as read-only memory (firmware), or a data carrier such as an optical or electronic signal carrier. The system and its modules in this specification may be implemented not only by hardware circuits such as very large scale integrated circuits or gate arrays, semiconductors such as logic chips, transistors, or programmable hardware devices such as field programmable gate arrays, programmable logic devices, etc., but also by software executed by various types of processors, for example, or by a combination of the above hardware circuits and software (e.g., firmware).

It should be noted that the above descriptions of the knowledge-graph based identity recognition system and its modules are only for convenience of description, and should not be construed as limiting the scope of the present disclosure to the illustrated embodiments. It will be appreciated by those skilled in the art that, given the teachings of the present system, any combination of modules or sub-system configurations may be used to connect to other modules without departing from such teachings. For example, in some embodiments, the feature obtaining module 510, the joint similarity determining module 520, and the same entity identifying module 530 may be different modules in a system, or may be a module that implements the functions of two or more modules described above. For example, the feature obtaining module 510 and the joint similarity determining module 520 may be two modules, or one module may have both obtaining and determining functions. For example, each module may share one memory module, and each module may have its own memory module. Such variations are within the scope of the present disclosure.

The beneficial effects that may be brought by the embodiments of the present description include, but are not limited to: (1) the technical scheme disclosed by the specification realizes the same entity recognition capability through weighted combination of multiple characteristics such as semantic characteristics, graph structure characteristics, space characteristics and graph characteristic characteristics, and improves the same entity recognition capability; (2) according to the probability collision principle, the characteristics of the overlong dimension are mapped to preset dimensions (such as 128 and 256), and the collision probability of the two entity characteristics is ensured to be consistent before and after the characteristics are compressed as much as possible, so that the distance calculation of the characteristics of the overlong dimension is equivalent to the distance calculation of the characteristics of the fixed length, and the calculation speed is effectively improved; (3) different distance functions or operators are provided for different types of features, weighted distance combination is realized, and user customization is supported by selection of weights and features. Each type of feature may have different weight meanings in different scenes, and through a weighting combination mode, not only is the time consumption of single calculation of multi-path vectors reduced, but also more flexibility is provided for the application.

It is to be noted that different embodiments may produce different advantages, and in different embodiments, any one or combination of the above advantages may be produced, or any other advantages may be obtained.

Having thus described the basic concept, it will be apparent to those skilled in the art that the foregoing detailed disclosure is to be considered as illustrative only and not limiting, of the present invention. Various modifications, improvements and adaptations to the present description may occur to those skilled in the art, although not explicitly described herein. Such modifications, improvements and adaptations are proposed in the present specification and thus fall within the spirit and scope of the exemplary embodiments of the present specification.

Also, the description uses specific words to describe embodiments of the description. Reference throughout this specification to "one embodiment," "an embodiment," and/or "some embodiments" means that a particular feature, structure, or characteristic described in connection with at least one embodiment of the specification is included. Therefore, it is emphasized and should be appreciated that two or more references to "an embodiment" or "one embodiment" or "an alternative embodiment" in various places throughout this specification are not necessarily all referring to the same embodiment. Furthermore, certain features, structures, or characteristics may be combined as suitable in one or more embodiments of the specification.

Moreover, those skilled in the art will appreciate that aspects of the present description may be illustrated and described in terms of several patentable categories or situations, including any new and useful combination of processes, machines, manufacture, or materials, or any new and useful modification thereof. Accordingly, aspects of this description may be performed entirely by hardware, entirely by software (including firmware, resident software, micro-code, etc.), or by a combination of hardware and software. The above hardware or software may be referred to as "data block," module, "" engine, "" unit, "" component, "or" system. Furthermore, aspects of the present description may be represented as a computer product, including computer readable program code, embodied in one or more computer readable media.

The computer storage medium may comprise a propagated data signal with the computer program code embodied therewith, for example, on baseband or as part of a carrier wave. The propagated signal may take any of a variety of forms, including electromagnetic, optical, etc., or any suitable combination. A computer storage medium may be any computer-readable medium that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code located on a computer storage medium may be propagated over any suitable medium, including radio, cable, fiber optic cable, RF, or the like, or any combination of the preceding.

Computer program code required for the operation of various portions of this specification may be written in any one or more programming languages, including an object oriented programming language such as Java, Scala, Smalltalk, Eiffel, JADE, Emerald, C + +, C #, VB.NET, Python, and the like, a conventional programming language such as C, Visual Basic, Fortran 2003, Perl, COBOL 2002, PHP, ABAP, a dynamic programming language such as Python, Ruby, and Groovy, or other programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any network format, such as a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet), or in a cloud computing environment, or as a service, such as a software as a service (SaaS).

Additionally, the order in which the elements and sequences of the process are recited in the specification, the use of alphanumeric characters, or other designations, is not intended to limit the order in which the processes and methods of the specification occur, unless otherwise specified in the claims. While various presently contemplated embodiments of the invention have been discussed in the foregoing disclosure by way of example, it is to be understood that such detail is solely for that purpose and that the appended claims are not limited to the disclosed embodiments, but, on the contrary, are intended to cover all modifications and equivalent arrangements that are within the spirit and scope of the embodiments herein. For example, although the system components described above may be implemented by hardware devices, they may also be implemented by software-only solutions, such as installing the described system on an existing server or mobile device.

Similarly, it should be noted that in the preceding description of embodiments of the present specification, various features are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure aiding in the understanding of one or more of the embodiments. This method of disclosure, however, is not intended to imply that more features than are expressly recited in a claim. Indeed, the embodiments may be characterized as having less than all of the features of a single embodiment disclosed above.

Numerals describing the number of components, attributes, etc. are used in some embodiments, it being understood that such numerals used in the description of the embodiments are modified in some instances by the use of the modifier "about", "approximately" or "substantially". Unless otherwise indicated, "about", "approximately" or "substantially" indicates that the number allows a variation of ± 20%. Accordingly, in some embodiments, the numerical parameters used in the specification and claims are approximations that may vary depending upon the desired properties of the individual embodiments. In some embodiments, the numerical parameter should take into account the specified significant digits and employ a general digit preserving approach. Notwithstanding that the numerical ranges and parameters setting forth the broad scope of the range are approximations, in the specific examples, such numerical values are set forth as precisely as possible within the scope of the application.

For each patent, patent application publication, and other material, such as articles, books, specifications, publications, documents, etc., cited in this specification, the entire contents of each are hereby incorporated by reference into this specification. Except where the application history document does not conform to or conflict with the contents of the present specification, it is to be understood that the application history document, as used herein in the present specification or appended claims, is intended to define the broadest scope of the present specification (whether presently or later in the specification) rather than the broadest scope of the present specification. It is to be understood that the descriptions, definitions and/or uses of terms in the accompanying materials of this specification shall control if they are inconsistent or contrary to the descriptions and/or uses of terms in this specification.

Finally, it should be understood that the embodiments described herein are merely illustrative of the principles of the embodiments of the present disclosure. Other variations are also possible within the scope of the present description. Thus, by way of example, and not limitation, alternative configurations of the embodiments of the present specification can be seen as consistent with the teachings of the present specification. Accordingly, the embodiments of the present description are not limited to only those embodiments explicitly described and depicted herein.

Claims

1. A method of identical entity identification based on a knowledge-graph, the method comprising:

respectively acquiring a plurality of characteristics of two or more objects to be identified; the plurality of features comprise graph characteristic features and graph structure features, wherein the graph characteristic features reflect the characteristic information of corresponding nodes of the object in the graph, and the graph structure features reflect the information of at least part of associated nodes of the corresponding nodes of the object in the graph; wherein, obtaining the graph structure characteristics of the object comprises: acquiring description information of at least part of associated nodes of corresponding nodes of the object in the graph; splicing the description information of each associated node to obtain splicing description information; processing the splicing description information based on a probability collision algorithm to obtain a vector of a preset dimension, and taking the vector as a graph structure characteristic of the object;

determining a joint similarity of a plurality of features of the two or more objects; wherein the joint similarity reflects an overall similarity of a plurality of features of the object to be identified;

determining whether the two or more objects correspond to the same entity based on the joint similarity.

2. The method of claim 1, obtaining a graphical characterization feature of an object, comprising:

performing representation learning on the map;

and representing the vector of the corresponding node of the object in the graph after representation learning as the graph characteristic feature of the object.

3. The method according to claim 1, wherein the information of the associated node comprises description information, and the description information further comprises attribute information of the associated node or comprises attribute information of the associated node and relationship information of the associated node and the corresponding node of the object.

4. The method of claim 3, obtaining graph structure features of an object, comprising:

acquiring description information of at least part of associated nodes of corresponding nodes of the object in the graph;

grouping the associated nodes based on the relationship types of the corresponding nodes and the associated nodes;

for each group of grouping results: splicing the description information of the associated nodes of the group to obtain splicing description information; processing the splicing description information based on a probability collision algorithm to obtain a vector of a preset dimension;

and fusing the vectors corresponding to the groups to be used as the graph structure characteristics of the object.

5. The method of claim 1, the at least some associated nodes including neighbor nodes of the object within an N-hop subgraph of the corresponding node in the graph; wherein N is a positive integer.

6. The method of claim 1, the plurality of features further comprising semantic features and/or spatial features.

7. The method of claim 6, obtaining semantic features of an object, comprising:

acquiring a description text of the object;

and generating a corresponding semantic vector based on the description text, and taking the semantic vector as the semantic feature of the object.

8. The method of claim 6, wherein the spatial feature of the object comprises a single location point of the object, or the spatial feature of the object is obtained as follows:

acquiring a position sequence of the object; the sequence of positions includes two or more position points of the object;

and processing the position sequence based on a probability collision algorithm to obtain a vector of a preset dimension, and taking the vector as the spatial feature of the object.

9. The method of claim 1 or 6, the determining a joint similarity of features of the two or more objects, comprising:

for each of the plurality of features: determining a similar distance of the feature of the two or more objects; further obtaining similar distances corresponding to the plurality of characteristics respectively;

and carrying out weighted summation on the plurality of similar distances to further obtain the joint similarity.

10. The method of claim 9, the similar distances comprising one or more of the following types: cosine distance, Euclidean distance, sphere distance, and Jacobian distance.

11. A system for identical entity recognition based on a knowledge-graph, wherein the system comprises:

a feature acquisition module for respectively acquiring a plurality of features of two or more objects to be identified; the plurality of features comprise graph characteristic features and graph structure features, wherein the graph characteristic features reflect the characteristic information of corresponding nodes of the object in the graph, and the graph structure features reflect the information of at least part of associated nodes of the corresponding nodes of the object in the graph; wherein, obtaining the graph structure characteristics of the object comprises: acquiring description information of at least part of associated nodes of corresponding nodes of the object in the graph; splicing the description information of each associated node to obtain splicing description information; processing the splicing description information based on a probability collision algorithm to obtain a vector of a preset dimension, and taking the vector as a graph structure feature of the object;

a joint similarity determination module for determining joint similarities of features of the two or more objects; wherein the joint similarity reflects an overall similarity of a plurality of features of the object to be identified;

a same entity identification module to determine whether the two or more objects correspond to a same entity based on the joint similarity.

12. A knowledge-graph based identity entity recognition apparatus comprising at least one storage medium and at least one processor, the at least one storage medium for storing computer instructions; the at least one processor is configured to execute the computer instructions to implement the method of any of claims 1-10.