CN114359610B

CN114359610B - Entity classification method, device, equipment and storage medium

Info

Publication number: CN114359610B
Application number: CN202210183949.4A
Authority: CN
Inventors: 吕继根; 王维煜
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2022-02-25
Filing date: 2022-02-25
Publication date: 2023-04-07
Anticipated expiration: 2042-02-25
Also published as: CN114359610A

Abstract

The disclosure provides a method, a device, equipment, a storage medium and a program product for entity classification, and relates to the technical field of artificial intelligence, in particular to the technical fields of knowledge maps and the like. The specific implementation scheme is as follows: determining a block identifier of each entity according to the attribute of each entity in a plurality of entities; obtaining a pre-blocking result, wherein the pre-blocking result is obtained by performing pre-blocking operation on a plurality of entities; determining a target block identifier in the block identifiers of a plurality of entities according to a pre-blocking result; splitting the target block identifier to obtain a plurality of block sub-identifiers; respectively carrying out clustering operation on entities corresponding to each block sub-identifier in the plurality of block sub-identifiers to obtain a clustering result; and classifying the plurality of entities according to the clustering result.

Description

Entity classification method, device, equipment and storage medium

Technical Field

The present disclosure relates to the field of artificial intelligence, and more particularly to the field of knowledge-based maps.

Background

The map disambiguation refers to entity disambiguation in the process of establishing the knowledge map, namely entity normalization, entity unification and the like. The main purpose of entity disambiguation is to determine whether entities from multiple different information sources point to the same object in the real world, and then to fuse and aggregate the information contained in the entities.

Different knowledge maps or information sources have some differences in description of the same entity, and the same entities need to be subjected to complementary fusion to form comprehensive, accurate and complete entity description. Therefore, entity disambiguation is required in the process of multi-source atlas construction.

Disclosure of Invention

The present disclosure provides a method, apparatus, device, storage medium, and program product for entity classification.

According to an aspect of the present disclosure, there is provided a method of entity classification, including: determining a block identifier of each entity according to the attribute of each entity in a plurality of entities; obtaining a pre-blocking result, wherein the pre-blocking result is obtained by performing a pre-blocking operation on the plurality of entities; determining target block identifiers in the block identifiers of the multiple entities according to the pre-blocking result; splitting the target block identifier to obtain a plurality of block sub-identifiers; respectively carrying out clustering operation on the entity corresponding to each block sub-identifier in the plurality of block sub-identifiers to obtain a clustering result; and classifying the plurality of entities according to the clustering result.

According to another aspect of the present disclosure, there is provided an apparatus for entity classification, including: the device comprises a first determining module, a second determining module and a judging module, wherein the first determining module is used for determining the block identifier of each entity according to the attribute of each entity in a plurality of entities; an obtaining module, configured to obtain a pre-blocking result, where the pre-blocking result is obtained by performing a pre-blocking operation on the multiple entities; a second determining module, configured to determine a target blocking identifier among the blocking identifiers of the multiple entities according to the pre-blocking result; the splitting module is used for splitting the target block identifier to obtain a plurality of block sub identifiers; the clustering module is used for respectively carrying out clustering operation on the entity corresponding to each block sub-identifier in the plurality of block sub-identifiers to obtain a clustering result; and the classification module is used for classifying the entities according to the clustering result.

Another aspect of the present disclosure provides an electronic device including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to cause the at least one processor to perform the method of the embodiments of the present disclosure.

According to another aspect of the disclosed embodiments, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method shown in the disclosed embodiments.

According to another aspect of the embodiments of the present disclosure, there is provided a computer program product comprising computer programs/instructions, characterized in that the computer programs/instructions, when executed by a processor, implement the steps of the method shown in the embodiments of the present disclosure.

It should be understood that the statements in this section are not intended to identify key or critical features of the embodiments of the present disclosure, nor are they intended to limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

FIG. 1 is a schematic diagram of an application scenario of a method, an apparatus, an electronic device and a storage medium for entity classification according to an embodiment of the present disclosure;

FIG. 2 schematically shows a flow diagram of a method of entity classification according to an embodiment of the present disclosure;

FIG. 3 schematically illustrates a flow diagram of a method of pre-chunking an entity according to an embodiment of the present disclosure;

FIG. 4 schematically illustrates a flow diagram of a method of splitting a target partition identification, in accordance with an embodiment of the present disclosure;

FIG. 5 schematically illustrates a flow diagram of a method of clustering entities corresponding to each chunking sub-identity in accordance with an embodiment of the present disclosure;

fig. 6 schematically shows a flow chart of a method of classifying a plurality of entities according to clustering results according to an embodiment of the present disclosure;

FIG. 7 schematically illustrates a method of entity classification according to another embodiment of the disclosure;

FIG. 8 schematically illustrates a block diagram of an apparatus for entity classification according to an embodiment of the present disclosure; and

FIG. 9 schematically shows a block diagram of an example electronic device that may be used to implement embodiments of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

An application scenario of the entity classification method and apparatus provided by the present disclosure will be described below with reference to fig. 1.

In the construction of a knowledge graph, a large amount of entity data is involved, which may come from a number of different information sources. The entity data of different information sources have some differences in description of the same object, so that entities pointing to the same object need to be complementarily fused to form a complete, accurate and complete entity description, that is, entity disambiguation is performed on the entity data. According to embodiments of the present disclosure, a similarity between these entities may be calculated. If the similarity between the two entities is higher than the similarity threshold, it indicates that the two entities point to the same object in the real world. Wherein, the similarity threshold value can be set according to actual needs. Based on the similarity, the entities can be classified according to the similarity, and the entities with the similarity higher than the similarity threshold value are divided into the same entity set. Entities in the same entity set all correspond to the same object. Next, the entities in each entity set may be fused, respectively, to obtain a fused entity.

Fig. 1 is a schematic view of an application scenario of a method, an apparatus, an electronic device, and a storage medium for entity classification according to an embodiment of the present disclosure. It should be noted that fig. 1 is only an example of an application scenario in which the embodiments of the present disclosure may be applied to help those skilled in the art understand the technical content of the present disclosure, but does not mean that the embodiments of the present disclosure may not be applied to other devices, systems, environments or scenarios.

As shown in fig. 1, the application scenario 100 includes a plurality of entities to be classified, such as

entities

101, 102, 103, and 104. According to an embodiment of the present disclosure, it may be determined that the similarity between the

entities

101, 102, 103, and 104 is higher than the similarity threshold by calculating the similarity between the

entities

101, 102, 103, and 104, i.e., the

entities

101, 102, 103, and 104 point to the same object. Thus,

entities

101, 102, 103, and 104 may be divided into the same set of entities 110. Next, the entities in the entity collection 110 may be fused, generating a fused entity 120.

In the technical scheme of the disclosure, the collection, storage, use, processing, transmission, provision, disclosure and other processing of the entity data and other data all accord with the regulations of the relevant laws and regulations, and do not violate the common customs of the public order.

Fig. 2 schematically shows a flow chart of a method of entity classification according to an embodiment of the present disclosure.

As shown in fig. 2, the method 200 of entity classification includes determining a blocking identifier of each entity according to an attribute of each entity in a plurality of entities in operation S210.

According to embodiments of the present disclosure, an entity may include, for example, data representing objects such as people, objects, concepts, events, and the like. Attributes may include, for example, name, gender, weight, height, and the like. Each entity may have one or more attributes. In the case where the entity has multiple attributes, one or more of the multiple attributes may be selected, and the blocking identification may be determined based on the selected attributes.

According to embodiments of the present disclosure, the block identification may be a character string generated according to an attribute of the entity. For example, in this embodiment, for example, hash calculation may be performed on an attribute of an entity to obtain a hash value, which is used as a block identifier. It should be noted that, according to other embodiments of the present disclosure, the block identifier may also be generated according to other manners, and this disclosure is not limited in this regard.

Then, in operation S220, a pre-blocking result is acquired.

According to the embodiment of the disclosure, the multiple entities can be pre-blocked in advance to obtain the pre-blocking result, and the pre-blocking result is stored. The stored pre-chunking results may be directly obtained at the time of use.

According to an embodiment of the present disclosure, the pre-blocking result may include, for example, a block identifier corresponding to an entity block with a large data skew degree in the pre-blocking operation. For example, in this embodiment, if the number of entities included in an entity block is greater than the first block capacity, the data skew degree of the entity block is indicated to be greater. Wherein, the first block capacity can be determined according to actual needs.

In operation S230, a target blocking flag is determined among the blocking flags of the plurality of entities according to the pre-blocking result.

According to an embodiment of the present disclosure, for the partition identifier of each entity, it may be determined whether the pre-partition result contains the partition identifier of the entity. And under the condition that the pre-partitioning result contains the partitioning identification of the entity, determining the partitioning identification as a target partitioning identification. In the case that the pre-partitioning result does not contain the partitioning identifier of the entity, the partitioning identifier may not be split.

In operation S240, the target sub-block identifier is split to obtain a plurality of sub-block identifiers.

According to the embodiment of the disclosure, the split number can be determined according to actual needs.

For example, the entities a1, a2, a3, a4, a5, a6, a7 and a8 have the same target chunk identifier b, and the predetermined splitting number is 2, the target chunk identifier b may be split into two chunk sub-identifiers, b1 and b 2. Where b1 corresponds to a1, a3, a5 and a7, and b2 corresponds to a2, a4, a6 and a8.

In operation S250, for each of the plurality of block sub-identifiers, a clustering operation is performed on the entity corresponding to each block sub-identifier, so as to obtain a clustering result.

According to embodiments of the present disclosure, entities may be clustered, for example, according to similarities between the entities.

For example, the partition sub-id b1 corresponds to the entities a1, a3, a5, and a7, and the partition sub-id b2 corresponds to the entities a2, a4, a6, and a8. The clustering operation can be performed on the entities a1, a3, a5 and a7 and on the entities a2, a4, a6 and a8. Exemplarily, in the present embodiment, a1 and a3 are similar, and then a1 and a3 are determined as one entity set. a5 and a7 are similar, then a5 and a7 are determined as one entity set. a2 and a4 are similar, then a2 and a4 are determined as one entity set. a6 and a8 are similar, then a6 and a8 are determined as one entity set.

In operation S260, a plurality of entities are classified according to the clustering result.

According to the embodiment of the disclosure, the entities in the same entity set are the same type of entity.

According to the embodiment of the disclosure, the pre-blocking result is predetermined, and the blocks with larger data inclination degree are further split according to the pre-blocking result during entity classification, so that the occupation of the memory can be reduced, and the processing efficiency is improved.

The method for pre-blocking an entity provided by the present disclosure will be described below with reference to fig. 3.

Fig. 3 schematically shows a flow diagram of a method of pre-blocking an entity according to an embodiment of the present disclosure.

As shown in fig. 3, the method 300 for performing a pre-blocking operation on an entity includes generating a blocking identifier according to an attribute of an entity for each entity in a plurality of entities in operation S310.

According to the embodiment of the present disclosure, the operation of generating the block identifier according to the attribute of the entity is the same as that described above, and reference may be made to the above, which is not described herein again.

In operation S320, the entities with the same partition id in the multiple entities are divided into one entity partition, so as to obtain multiple entity partitions.

In operation S330, a target entity block having an entity number greater than the first block capacity among the plurality of entity blocks is determined, and a target block identity corresponding to the target entity block is determined as a pre-blocking result.

According to the embodiment of the present disclosure, the first blocking capacity may be determined according to actual needs, for example, flexible configuration may be performed according to statistical data and machine resources. The pre-blocking result may include, for example, a target block identifier corresponding to a block with a large data skew. In this embodiment, if the number of entities of a block is greater than the first block capacity, it indicates that the data skew degree of the block is large. According to further embodiments of the present disclosure, the pre-tile result may include an entity identification corresponding to the target tile identification in addition to the target tile identification.

For example, the entity includes c1, c2, c3, c4, c5, c6, c7, and c8, and the first chunking has a capacity of 4. Wherein, the entities c1, c2, c3, c4, c5 have the same block identifier d1, and the entities c6, c7, c8 have the same block identifier d2. Based on this, c1, c2, c3, c4, c5 may be divided into one physical block1, and c6, c7, and c8 may be divided into another physical block2. The number of entities in block1 is 5, which is greater than the first block capacity 4, and the number of entities in block2 is 3, which is less than the first block capacity 4. Therefore, block1 can be determined as a target entity block, and a block identifier d1 corresponding to block1 can be determined as a pre-blocking result.

According to the embodiment of the disclosure, different from the blocking step in the entity classification, the pre-blocking result output in the pre-blocking only includes a blocking identifier and an entity identifier corresponding to the blocking identifier. In the step of partitioning the entity, the output is the whole entity and the partition identifier of the entity.

The method for splitting the target partition identification provided by the present disclosure will be described below with reference to fig. 4.

Fig. 4 schematically illustrates a flow chart of a method of splitting a target partition identification according to an embodiment of the present disclosure.

As shown in fig. 4, the method 440 of pre-blocking includes, in operation S441, determining, for each target block identifier, a splitting parameter according to the number of entities corresponding to the target block identifier and the second blocking capacity.

According to an embodiment of the present disclosure, the split parameter may be calculated, for example, according to the following formula:

s＝num/n+1

wherein s is a splitting parameter, num is the number of entities corresponding to the target block identifier, and n is the capacity of the second block.

According to the embodiment of the present disclosure, the second partition capacity may be determined according to actual needs, for example, flexible configuration may be performed according to statistical data and machine resources. The second block size may be the same as or different from the first block size.

In operation S442, a plurality of segment sub-identifiers are determined according to the entity identifier corresponding to the target segment identifier and the splitting parameter.

According to embodiments of the present disclosure, the additional character string may be determined, for example, from the entity identification and the split parameter. And then combining the entity identification with the additional character string to obtain the block sub-identification.

For example, the appended character string may be added as a suffix to the end of the entity identifier to obtain the block sub-identifier.

According to an embodiment of the present disclosure, the additional string may be determined, for example, by the following splitting function:

fun＝hash(id)％s

wherein fun is a splitting function, and the value of the splitting function is an additional character string. hash () represents a hash operation,% represents a remainder operation, id is an entity identifier, and s is a split parameter.

According to the embodiment of the disclosure, the target block identifier is split, so that the occupation of the memory can be reduced, and the problem of insufficient memory caused by data inclination is avoided.

The method for clustering the entities corresponding to each sub-identifier of the block provided by the present disclosure will be described below with reference to fig. 5.

Fig. 5 schematically illustrates a flow chart of a method of clustering entities corresponding to each chunking sub-identity according to an embodiment of the present disclosure.

As shown in fig. 5, the method 550 of clustering the entities corresponding to each of the sub-identifiers includes determining the entities corresponding to each of the sub-identifiers as a set of entities to be processed in operation S551.

According to an embodiment of the present disclosure, a chunking sub-identity may correspond to a plurality of entities each having the chunking sub-identity configured therewith. Based on this, the plurality of entities may be treated as a set of entities to be processed.

In operation S552, a central entity is determined from the set of entities to be processed.

According to an embodiment of the present disclosure, the central entity may be any one of a set of entities to be processed. For example, in this embodiment, one entity in the set of entities to be processed may be randomly determined as a central entity.

In operation S553, a first similarity between each entity in the set of entities to be processed and the central entity is calculated.

According to an embodiment of the present disclosure, the first similarity may be determined by calculating a cosine value or a euclidean distance between each entity and the center entity, for example. In the process of practical application, the similarity may also be calculated in other manners, which is not specifically limited in this disclosure.

In operation S554, an entity with a first similarity greater than a similarity threshold in the set of entities to be processed is determined as a target entity set.

According to the embodiment of the present disclosure, the similarity threshold may be set according to actual needs, and the present disclosure is not particularly limited thereto.

In operation S555, an entity with a first similarity smaller than or equal to the similarity threshold in the set of entities to be processed is determined as a new set of entities to be processed, and operation S552 is returned for the new set of entities to be processed.

And if the entity with the first similarity smaller than or equal to the similarity threshold value does not exist in the entity set to be processed, the entity set to be processed represents that all the entities are divided into the corresponding target entity set, and the aggregation operation is finished.

A method for classifying a plurality of entities according to a clustering result provided by the present disclosure will be described below with reference to fig. 6.

Fig. 6 schematically shows a flow chart of a method of classifying a plurality of entities according to clustering results according to an embodiment of the present disclosure.

As shown in fig. 6, the method 650 of classifying a plurality of entities according to a clustering result includes, in operation S661, targeting a plurality of corresponding target entity sets corresponding to a plurality of block sub-identifications among a plurality of target entity sets.

In operation S662, a second similarity between each of the central entities of the plurality of corresponding target entity sets is calculated.

According to the embodiment of the present disclosure, the method for calculating the second similarity is the same as the method for calculating the first similarity, and reference may be made to the above description, which is not repeated herein.

In operation S663, the plurality of corresponding target entity sets are merged according to the second similarity.

According to the embodiment of the disclosure, if the two center points are similar, the entity sets corresponding to the two center points are also similar. Based on this, if the second similarity between the two central entities is greater than the similarity threshold, the target entity sets corresponding to the two central entities may be merged into one.

According to the embodiment of the disclosure, only the similarity between the central entities in the entity set needs to be compared, and pairwise comparison calculation of all the entities in the entity set is not needed, so that the calculation times are reduced, and the calculation efficiency is improved.

The method for classifying data shown above is further described with reference to fig. 7 in conjunction with specific embodiments. It will be appreciated by those skilled in the art that the following example embodiments are only for the understanding of the present disclosure, and the present disclosure is not limited thereto.

Illustratively, in this embodiment, a pipeline scheme based on spark parallel computing may be adopted to implement the method for data classification in the embodiments of the present disclosure. For example, the map-reduce framework of spark may be employed.

Based on this, fig. 7 schematically shows a method schematic of entity classification according to another embodiment of the present disclosure.

In fig. 7, it is shown that, in operation S701, raw entity data is preprocessed.

According to embodiments of the present disclosure, preprocessing may include, for example, normalized representation of raw entity data, data cleansing, data verification, and so forth.

In operation S702, a pre-blocking operation is performed on the original entity data to obtain a pre-blocking result.

According to the embodiment of the disclosure, in the map stage, for each entity, according to the value of the attribute or the combination of the attribute values of the entity, the blocking identification key is generated, so that similar entities are divided into the same block as much as possible. The map output of the pre-blocking only has keys and entity identification ids. And in the reduce stage, counting the number of entity ids corresponding to the same key. In this embodiment, a threshold n of the block capacity is preset, and if the number of the entity ids is greater than n, the number of keys and the number of the entity ids are output.

In operation S703, blocking and blocking splitting are performed according to the pre-blocking result.

According to an embodiment of the present disclosure, the result d of the pre-chunking may be loaded. In this embodiment, a broadcast variable of broadcast may be used, thereby reducing network transmission and memory usage. Then, at the map stage, the blocking identifier key0 is generated in the same manner as the pre-blocking. Entities corresponding to the same key0 are divided into one partition. If key0 is in d, the partition represented by key0 is further split.

More specifically, for each chunk to be split, the number of splits s = num/n +1 may be calculated for key0 of each entity in the chunk, where num is the number of entities in the chunk represented by the key0, n is the chunk capacity, and s is the split parameter. Then, calculating a splitting function fun = hash (id)% s, and then combining key0 and fun (id) to obtain a partitioning sub-identifier key1 as: key0@ @ fun (id). Where @ is a delimiter. Then, entities corresponding to the same key1 are classified into the same entity set list.

According to an embodiment of the present disclosure, if key0 is not in d, the partition represented by key0 is not split.

In operation S704, a clustering operation is performed under each partition.

According to the embodiment of the disclosure, for each block sub-identifier key1, one entity p _ i is randomly selected from the entity set list corresponding to the block sub-identifier key1, and the similarity score between each entity and p _ i in the list is calculated through a custom similarity function m.

In this embodiment, a similarity threshold t is preset. A set of similar entities P (P _ i, [ P _1, P _2,.., P _ n ]) centered at P _ i is generated for a set of entities with score greater than t. And forming a new entity set list for the entity sets with score smaller than t.

And then, randomly selecting an entity p _ i from the new entity set list, and returning to the operation of calculating the similarity score between each entity in the list and the p _ i through the self-defined similarity function m until the list is empty.

When the list is empty, outputting a plurality of similar entity sets P with P _ i as a central point;

in operation S705, merging of clustering results is performed on the split blocks. Blocks that are not split skip this step.

According to the embodiment of the disclosure, a reduce operation can be performed on split blocks through the original key0 of the block, so as to obtain all similar entity sets P _1, P _2.. P _ n under the original block represented by the key0.

For example, for each split block, pairwise similarity calculation is performed on a central point P _ i of each similar entity set corresponding to the block, if the central point P _ x is not similar to P _ y, none of the entities in the similar entity sets represented by P _ x and P _ y are similar, and if P _ x is similar to P _ y, the similar entity sets represented by P _ x and P _ y are merged.

In operation S706, all similar entity sets are output.

Fig. 8 schematically shows a block diagram of an apparatus for entity classification according to an embodiment of the present disclosure.

As shown in fig. 8, the apparatus 800 for entity classification includes a first determining module 810, an obtaining module 820, a second determining module 830, a splitting module 840, a clustering module 850, and a classifying module 860.

A first determining module 810, configured to determine a blocking identifier of each entity according to an attribute of each entity in the multiple entities.

An obtaining module 820, configured to obtain a pre-blocking result. And the pre-blocking result is obtained by performing pre-blocking operation on the plurality of entities.

A second determining module 830, configured to determine a target block identifier among the block identifiers of the multiple entities according to the pre-blocking result.

The splitting module 840 is configured to split the target block identifier to obtain a plurality of block sub identifiers.

And the clustering module 850 is configured to perform a clustering operation on the entity corresponding to each of the plurality of block sub-identifiers, respectively, to obtain a clustering result.

And a classification module 860, configured to classify the multiple entities according to the clustering result.

According to an embodiment of the present disclosure, the apparatus may further include: the generating module is used for generating a block identifier according to the attribute of each entity in the plurality of entities; the dividing module is used for dividing the entities with the same block identification in the plurality of entities into one entity block to obtain a plurality of entity blocks; and the third determining module is used for determining the target entity blocks of which the entity quantity is greater than the capacity of the first blocks in the plurality of entity blocks and determining the target block identifications corresponding to the target entity blocks as the pre-blocking results.

According to an embodiment of the present disclosure, the second determining module may include: and the target block identifier determining submodule is used for determining the block identifier as the target block identifier according to the block identifier of each entity under the condition that the pre-partitioning result contains the block identifier of the entity.

According to an embodiment of the present disclosure, the second determining module may include: the parameter determination submodule is used for determining splitting parameters according to the entity number corresponding to the target block identification and the second block capacity aiming at each target block identification; and the identification determining submodule is used for determining a plurality of sub-block identifications according to the entity identification and the splitting parameter corresponding to the target sub-block identification.

According to an embodiment of the present disclosure, a clustering result includes a plurality of target entity sets; the clustering module may include: a first set determining submodule, configured to determine an entity corresponding to each blocking sub-identifier as a to-be-processed entity set; the center determining submodule is used for determining a center entity from the entity set to be processed; the first calculation submodule is used for calculating first similarity between each entity in the entity set to be processed and the central entity; the second set determining submodule is used for determining the entities with the first similarity larger than the similarity threshold in the entity set to be processed as a target entity set; and a third set determining submodule, configured to determine, as a new set of entities to be processed, an entity of which the first similarity is smaller than or equal to the similarity threshold in the set of entities to be processed, and return to an operation of determining the central entity for the new set of entities to be processed.

According to an embodiment of the present disclosure, the classification module may include: the second calculation sub-module is used for calculating second similarity between every two central entities of the multiple corresponding target entity sets aiming at the multiple corresponding target entity sets corresponding to the multiple sub-identifiers in the multiple target entity sets; and the merging submodule is used for merging the plurality of corresponding target entity sets according to the second similarity.

The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.

FIG. 9 schematically shows a block diagram of an example electronic device 900 that may be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 9, the apparatus 900 includes a computing unit 901, which can perform various appropriate actions and processes in accordance with a computer program stored in a Read Only Memory (ROM) 902 or a computer program loaded from a storage unit 908 into a Random Access Memory (RAM) 903. In the RAM 903, various programs and data required for the operation of the device 900 can also be stored. The calculation unit 901, ROM 902, and RAM 903 are connected to each other via a bus 904. An input/output (I/O) interface 905 is also connected to bus 904.

A number of components in the device 900 are connected to the I/O interface 905, including: an input unit 906 such as a keyboard, a mouse, and the like; an output unit 907 such as various types of displays, speakers, and the like; a storage unit 908 such as a magnetic disk, optical disk, or the like; and a communication unit 909 such as a network card, a modem, a wireless communication transceiver, and the like. The communication unit 909 allows the device 900 to exchange information/data with other devices through a computer network such as the internet and/or various telecommunication networks.

The computing unit 901 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of the computing unit 901 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The calculation unit 901 performs the respective methods and processes described above, such as the method of entity classification. For example, in some embodiments, the method of entity classification may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 908. In some embodiments, part or all of the computer program may be loaded and/or installed onto device 900 via ROM 902 and/or communications unit 909. When the computer program is loaded into the RAM 903 and executed by the computing unit 901, one or more steps of the method of entity classification described above may be performed. Alternatively, in other embodiments, the computing unit 901 may be configured by any other suitable means (e.g., by means of firmware) to perform the method of entity classification.

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), system on a chip (SOCs), complex Programmable Logic Devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user can be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.

The Server may be a cloud Server, which is also called a cloud computing Server or a cloud host, and is a host product in a cloud computing service system, so as to solve the defects of high management difficulty and weak service extensibility in a traditional physical host and a VPS service (Virtual Private Server, or VPS for short). The server may also be a server of a distributed system, or a server incorporating a blockchain.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel, sequentially, or in different orders, and are not limited herein as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved.

The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims

1. A method of entity classification, comprising:

determining a block identifier of each entity according to the attribute of each entity in a plurality of entities;

obtaining a pre-blocking result, wherein the pre-blocking result is obtained by performing a pre-blocking operation on the plurality of entities;

determining a target block identifier in the block identifiers of the plurality of entities according to the pre-blocking result;

splitting the target block identifier to obtain a plurality of block sub-identifiers;

respectively carrying out clustering operation on the entity corresponding to each block sub-identifier in the plurality of block sub-identifiers to obtain a clustering result; and

classifying the plurality of entities according to the clustering result,

wherein the clustering result comprises a plurality of target entity sets; the clustering operation is performed on the entity corresponding to each block sub-identifier to obtain a clustering result, and the clustering result comprises the following steps:

determining an entity corresponding to each block sub-identifier as a set of entities to be processed;

determining a central entity from the set of entities to be processed;

calculating a first similarity between each entity in the set of entities to be processed and the central entity;

determining entities with first similarity larger than a similarity threshold in the entity set to be processed as a target entity set; and

and determining an entity with the first similarity smaller than or equal to a similarity threshold value in the entity set to be processed as a new entity set to be processed, and returning to determine the operation of the central entity aiming at the new entity set to be processed.

2. The method of claim 1, wherein the pre-blocking operation comprises:

generating a block identifier for each entity in the plurality of entities according to the attribute of the entity; and

dividing entities with the same block identification in the plurality of entities into one entity block to obtain a plurality of entity blocks;

and determining a target entity block with the entity number larger than the capacity of the first block in the plurality of entity blocks, and determining a target block identifier corresponding to the target entity block as the pre-block result.

3. The method of claim 2, wherein the determining a target blocking identifier among the blocking identifiers of the plurality of entities according to the pre-blocking result comprises:

and aiming at the block identifier of each entity, determining the block identifier as the target block identifier under the condition that the pre-block result contains the block identifier of the entity.

4. The method according to any one of claims 1 to 3, wherein the splitting the target blocking identifier to obtain a plurality of blocking sub-identifiers comprises:

for each of the target block identifications,

determining splitting parameters according to the entity number corresponding to the target block identification and the second block capacity; and

and determining the plurality of sub-block identifiers according to the entity identifier and the splitting parameter corresponding to the target sub-block identifier.

5. The method of claim 1, wherein the classifying the plurality of entities according to the clustering result comprises:

for a plurality of corresponding target entity sets of the plurality of target entity sets corresponding to the plurality of partition sub-identifications,

calculating a second similarity between every two central entities of the plurality of corresponding target entity sets; and

and merging the plurality of corresponding target entity sets according to the second similarity.

6. An apparatus for entity classification, comprising:

the device comprises a first determining module, a second determining module and a judging module, wherein the first determining module is used for determining the block identifier of each entity according to the attribute of each entity in a plurality of entities;

an obtaining module, configured to obtain a pre-blocking result, where the pre-blocking result is obtained by performing a pre-blocking operation on the multiple entities;

a second determining module, configured to determine a target blocking identifier among the blocking identifiers of the multiple entities according to the pre-blocking result;

the splitting module is used for splitting the target block identifier to obtain a plurality of block sub identifiers;

the clustering module is used for clustering an entity corresponding to each block sub-identifier in the plurality of block sub-identifiers respectively to obtain a clustering result; and

a classification module for classifying the plurality of entities according to the clustering result,

wherein the clustering result comprises a plurality of target entity sets; the clustering module comprises:

a first set determining submodule, configured to determine an entity corresponding to each of the block sub-identifiers as a set of entities to be processed;

the center determining submodule is used for determining a center entity from the entity set to be processed;

the first calculation sub-module is used for calculating a first similarity between each entity in the entity set to be processed and the central entity;

a second set determining submodule, configured to determine, as a target entity set, an entity in the to-be-processed entity set, for which the first similarity is greater than a similarity threshold; and

and the third set determining submodule is used for determining an entity of which the first similarity is smaller than or equal to the similarity threshold value in the entity set to be processed as a new entity set to be processed, and returning the operation of determining the central entity aiming at the new entity set to be processed.

7. The apparatus of claim 6, further comprising:

a generating module, configured to generate a blocking identifier for each entity in the plurality of entities according to an attribute of the entity; and

the dividing module is used for dividing the entities with the same block identification in the plurality of entities into one entity block to obtain a plurality of entity blocks;

a third determining module, configured to determine a target entity block of the multiple entity blocks, where the number of entities is greater than the capacity of the first block, and determine a target block identifier corresponding to the target entity block as the pre-blocking result.

8. The apparatus of claim 7, wherein the second determining means comprises:

and the target block identifier determining submodule is used for determining the block identifier as the target block identifier under the condition that the pre-block result contains the block identifier of the entity aiming at the block identifier of each entity.

9. The apparatus of any of claims 6-8, wherein the splitting module comprises:

the parameter determination submodule is used for determining a splitting parameter according to the entity number and the second splitting capacity corresponding to the target splitting identification aiming at each target splitting identification; and

and the identifier determining submodule is used for determining the plurality of sub-identifiers according to the entity identifier and the splitting parameter corresponding to the target sub-identifier.

10. The apparatus of claim 6, wherein the classification module comprises:

a second calculating sub-module, configured to calculate, for multiple corresponding target entity sets corresponding to the multiple segment sub-identifiers in the multiple target entity sets, second similarities between two central entities of the multiple corresponding target entity sets; and

and the merging submodule is used for merging the plurality of corresponding target entity sets according to the second similarity.

11. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-5.

12. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-5.