CN116578668A

CN116578668A - Data processing method and related device

Info

Publication number: CN116578668A
Application number: CN202210079995.XA
Authority: CN
Inventors: 李晨曦; 荆宁; 梁海金; 罗雨
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2022-01-24
Filing date: 2022-01-24
Publication date: 2023-08-11

Abstract

The embodiment of the application discloses a data processing method and a related device, when determining whether a first entity and a second entity correspond to the same virtual main body, first entity information and second entity information can be obtained, the first entity information is used for reflecting the virtual main body corresponding to the first entity, and the second entity information is used for reflecting the virtual main body corresponding to the second entity, so that the similarity between the virtual main bodies corresponding to the first entity and the second entity can be more prominently reflected based on the similarity determined by the first entity information and the second entity information, and further, by the similarity, whether the first entity and the second entity correspond to the same virtual main body can be more accurately determined, and the accurate virtual main body matching condition can provide a entity with more attached entity content for a user when the user searches the entity, and finally the entity browsing experience of the user is improved.

Description

Data processing method and related device

Technical Field

The present application relates to the field of data processing technologies, and in particular, to a data processing method and a related device.

Background

Intellectual property (Intellectual property, IP for short) is one of the popular concepts in the multimedia field and can refer to video literature, games, animation, etc. suitable for two or more adaptation development. IP may be a storyline, a character, a comic, or even just a name or symbol. When IP prevails, almost all kinds of multimedia entities have corresponding IP, and a user browses multimedia entities, if the user is interested in a multimedia entity under a certain IP, the possibility of being interested in other multimedia entities under the IP is also high. It follows that the analysis of whether the multimedia entities correspond to the same IP facilitates more efficient content recommendation to the user.

In the related art, when judging whether the multimedia entities belong to the same IP, the judgment is generally performed only based on the relevance between the multimedia entities, however, the information of the multimedia entities is not reasonably analyzed, and the judgment accuracy is low.

Disclosure of Invention

In order to solve the technical problems, the application provides a data processing method, and processing equipment can determine whether different entities correspond to the same virtual main body based on the similarity between entity information capable of representing the virtual main body corresponding to the entity, so that the similarity between entity contents can be determined based on the dimension of the corresponding virtual main body, and a more accurate entity matching result based on the virtual main body is obtained.

The embodiment of the application discloses the following technical scheme:

in a first aspect, an embodiment of the present application discloses a data processing method, where the method includes:

acquiring first entity information and second entity information, wherein the first entity information is used for reflecting a virtual main body corresponding to a first entity, the second entity information is used for reflecting a virtual main body corresponding to a second entity, and the virtual main body is a main body related to entity content in the entity;

determining a first similarity between the first entity information and the second entity information;

And determining whether the first entity and the second entity correspond to the same virtual main body according to the first similarity.

In a second aspect, an embodiment of the present application discloses a data processing apparatus, including a first acquisition unit, a first determination unit, and a second determination unit:

the first obtaining unit is configured to obtain first entity information and second entity information, where the first entity information is used to embody a virtual main body corresponding to the first entity, the second entity information is used to embody a virtual main body corresponding to the second entity, and the virtual main body is a main body related to entity content in the entity;

the first determining unit is configured to determine a first similarity between the first entity information and the second entity information;

the second determining unit is configured to determine, according to the first similarity, whether the first entity and the second entity correspond to the same virtual body.

In one possible implementation, the first entity information includes any one or more of first entity backbone information, first entity role information, first entity alias information, and first entity quarter number information.

In one possible implementation manner, the first entity information includes first entity backbone information, and the first obtaining unit is specifically configured to:

Acquiring entity name information corresponding to the first entity;

and determining first entity trunk information corresponding to the entity name information according to an entity name rule, wherein the entity name rule is used for identifying an information structure corresponding to the entity name information, and the information structure is used for reflecting the position corresponding to the first entity trunk information in the entity name information.

acquiring entity name information corresponding to the first entity;

according to a trunk extraction model, trunk starting parameters, trunk ending parameters and trunk probability parameters corresponding to the entity name information are determined, wherein the trunk starting parameters are used for identifying the probability that each character in the entity name information is a starting character of the first entity trunk information, the trunk ending parameters are used for identifying the probability that each character in the entity name information is an ending character of the first entity trunk information, the trunk probability parameters are used for identifying the probability that a plurality of undetermined entity trunk information is the first entity trunk information, and the plurality of undetermined entity trunk information is determined from the entity name information based on the trunk starting parameters and the trunk ending parameters;

And determining the first entity trunk information corresponding to the first entity according to the trunk starting parameter, the trunk ending parameter and the trunk probability parameter and the entity name information.

In a possible implementation manner, the first entity information includes the first entity alias information and the first entity backbone information, and the first obtaining unit is specifically configured to:

acquiring search data corresponding to the first entity and trunk information of the first entity, wherein the search data comprises a plurality of undetermined alias information corresponding to the first entity in searching;

determining a second similarity between the plurality of undetermined alias information and the first entity backbone information, respectively;

and determining the undetermined alias information with the second similarity larger than a preset threshold value as first entity alias information corresponding to the first entity.

In one possible implementation manner, the first obtaining unit is specifically configured to:

and acquiring undetermined alias information of the same search result content corresponding to the first entity trunk information, and/or acquiring undetermined alias information of the same search keyword corresponding to the first entity trunk information.

In a possible implementation manner, the apparatus further includes a second acquisition unit and a third determination unit:

The second obtaining unit is configured to obtain first entity embedded information and second entity embedded information, where the first entity embedded information is used to identify an association relationship between the first entity and other entities, and the second entity embedded information is used to identify an association relationship between the second entity and other entities;

the third determining unit is configured to determine a third similarity between the first entity embedded information and the second entity embedded information;

the second determining unit is specifically configured to:

determining the corresponding comprehensive similarity between the first entity and the second entity according to the first similarity and the third similarity;

and determining whether the first entity and the second entity correspond to the same virtual main body according to the comprehensive similarity.

In a possible implementation manner, the second determining unit is specifically configured to:

and determining the comprehensive similarity corresponding to the first entity and the second entity according to the first similarity and the first weight information corresponding to the first similarity and the second weight information corresponding to the second similarity.

In a third aspect, embodiments of the present application disclose a computer device comprising a processor and a memory:

the memory is used for storing program codes and transmitting the program codes to the processor;

the processor is configured to execute the data processing method described in the first aspect according to instructions in the program code.

In a fourth aspect, embodiments of the present application disclose a computer readable storage medium for storing a computer program for executing the data processing method described in the first aspect.

In a fifth aspect, an embodiment of the application discloses a computer program product comprising instructions which, when run on a computer, cause the computer to perform the data processing method described in the first aspect.

According to the technical scheme, when determining whether the first entity and the second entity correspond to the same virtual main body, the first entity information and the second entity information can be obtained, the first entity information is used for reflecting the virtual main body corresponding to the first entity, the second entity information is used for reflecting the virtual main body corresponding to the second entity, and the virtual main body is a main body related to entity content of the entity.

Drawings

In order to more clearly illustrate the embodiments of the application or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a schematic diagram of a data processing method in a practical application scenario according to an embodiment of the present application;

FIG. 2 is a flowchart of a data processing method according to an embodiment of the present application;

fig. 3 is a schematic diagram of a data processing method in a practical application scenario according to an embodiment of the present application;

FIG. 4 is a block diagram of a data processing apparatus according to an embodiment of the present application;

FIG. 5 is a block diagram of a computer device according to an embodiment of the present application;

fig. 6 is a block diagram of a server according to an embodiment of the present application.

Detailed Description

Embodiments of the present application are described below with reference to the accompanying drawings.

At the present time when the IP concept prevails, almost every multimedia entity has its own corresponding IP, where an entity refers to a specific form of multimedia content, and may for example include novels, movies, animations, comics, etc.

In the related art, in order to be able to recommend entities having the same IP to a user, it is common to determine a degree of association between entities based on only entity embedded information corresponding to the entities and then determine whether different entities correspond to the same IP based on the degree of association. However, this approach only considers the degree of association between entities, and does not consider some IP-related features of the entities themselves, which are instead key to representing the IP corresponding to the entities. Therefore, the analysis method in the related art has difficulty in obtaining an accurate IP relationship between entities.

It will be appreciated that the method may be applied to a processing device which is capable of data processing, for example a terminal device or a server having data processing functionality. The method can be independently executed by the terminal equipment or the server, can also be applied to a network scene of communication between the terminal equipment and the server, and is executed by the cooperation of the terminal equipment and the server. The terminal equipment can be a computer, a mobile phone and other equipment. The server can be understood as an application server or a Web server, and can be an independent server or a cluster server in actual deployment.

In addition, the application also relates to artificial intelligence technology. Artificial intelligence (Artificial Intelligence, AI) is the theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and extend human intelligence, sense the environment, acquire knowledge and use the knowledge to obtain optimal results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.

The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning, automatic driving, intelligent traffic and other directions. The present application relates generally to machine learning techniques and natural language processing techniques therein.

Natural language processing (Nature Language processing, NLP) is an important direction in the fields of computer science and artificial intelligence. It is studying various theories and methods that enable effective communication between a person and a computer in natural language. Natural language processing is a science that integrates linguistics, computer science, and mathematics. Thus, the research in this field will involve natural language, i.e. language that people use daily, so it has a close relationship with the research in linguistics. Natural language processing techniques typically include text processing, semantic understanding, machine translation, robotic questions and answers, knowledge graph techniques, and the like.

Machine Learning (ML) is a multi-domain interdisciplinary, involving multiple disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory, etc. It is specially studied how a computer simulates or implements learning behavior of a human to acquire new knowledge or skills, and reorganizes existing knowledge structures to continuously improve own performance. Machine learning is the core of artificial intelligence, a fundamental approach to letting computers have intelligence, which is applied throughout various areas of artificial intelligence. Machine learning and deep learning typically include techniques such as artificial neural networks, confidence networks, reinforcement learning, transfer learning, induction learning, teaching learning, and the like.

The application can identify the entity trunk information, entity alias information and the like in the entity name information by utilizing a natural language processing technology, and can train to obtain a plurality of models for automatically acquiring the entity information by utilizing a machine learning technology, such as a trunk extraction model and the like for automatically identifying the entity trunk information.

In order to facilitate understanding of the technical scheme provided by the application, a data processing method provided by the embodiment of the application will be described below in conjunction with an actual application scenario.

Referring to fig. 1, fig. 1 is a schematic diagram of a data processing method in an actual application scenario provided in an embodiment of the present application, where a processing device may be a server 101.

When determining whether the first entity and the second entity correspond to the same virtual entity, the server 101 may acquire first entity information corresponding to the first entity and second entity information corresponding to the second entity, where the first entity information is used to represent the virtual entity a corresponding to the first entity, and the second entity information is used to represent the virtual entity B corresponding to the second entity. Thus, through the first entity information and the second entity information, the server 101 may trigger from the virtual principal angles corresponding to the two entities respectively, to analyze whether the two entities correspond to the same virtual principal. The server 101 may determine a first similarity between the first entity information and the second entity information, where the first similarity may represent a similarity between the first entity and the second entity from a dimension of the corresponding virtual body, so as to determine, based on the first similarity, whether the first entity and the second entity correspond to the same virtual body more accurately.

Next, a data processing method provided by an embodiment of the present application will be described with reference to the accompanying drawings.

Referring to fig. 2, fig. 2 is a flowchart of a data processing method according to an embodiment of the present application, where the method includes:

s201: and acquiring the first entity information and the second entity information.

It can be appreciated that when an entity has a corresponding virtual body, various information corresponding to the entity can embody the corresponding virtual body to a certain extent. The virtual body refers to a body related to the entity content of the entity, for example, may be IP corresponding to the entity, and when the entity contents of two entities are similar, the corresponding virtual bodies are also closer in general. For example, the entity names of the entities such as novels, movies and the like may have uniform names corresponding to the virtual main bodies; entities corresponding to the same virtual principal typically have the same or similar roles, etc.

Based on this, in the embodiment of the present application, the processing device may analyze whether different entities correspond to the same virtual body from entity information for embodying the virtual body. The processing device may first obtain first entity information and second entity information, where the first entity information is used to represent a virtual main body corresponding to the first entity, the second entity information is used to represent a virtual main body corresponding to the second entity, the first entity and the second entity may be any entity with a corresponding virtual main body, and entity forms of the first entity and the second entity may be the same, and the entity forms refer to entity presentation forms, for example, may be novels, movies, and the like; or may be different, e.g. the first entity may be a novel, the second entity may be a movie, etc. For example, the three entities of the movies "xx legend" and "xx legend 2" and "xx legend 3" each correspond to a virtual body of "xx legend", and the entity information acquired by the processing device may be movie names of three movies, and since the three movie names each include "xx legend", it can be assumed that the three entities correspond to the same virtual body to some extent.

S202: a first similarity between the first entity information and the second entity information is determined.

Through the first similarity, the processing device can determine the similarity between the entities from the virtual main body dimensions corresponding to the entities. The entity information capable of representing the virtual main body corresponding to the entity may include various types. For example, in one possible implementation manner, the first entity information may include any one or more of first entity trunk information, first entity role information, first entity alias information and first entity quaternary part number information, where the first entity trunk information is a core part in the entity name information, and the entity trunk information has the most intimate relationship with the virtual main body corresponding to the entity; the first entity role information is used for identifying the role information associated with the first entity, and when different entities correspond to the same virtual main body, similar role compositions are often formed in the entities; the first entity alias information can embody other common names of the first entity except entity trunk information; the first entity quaternary part number information can represent the quaternary part number corresponding to the first entity, and the quaternary part number is a mark that the entity is used as a first set or a continuation set under a certain virtual main body. The information can reflect the virtual body corresponding to the first entity to a certain extent.

S203: and determining whether the first entity and the second entity correspond to the same virtual main body according to the first similarity.

The above description indicates that, through the first similarity, the similarity between the entities can be represented from the dimension of the virtual body corresponding to the entity, so based on the first similarity, the processing device can determine whether the first entity and the second entity correspond to the same virtual body more accurately.

According to the technical scheme, when determining whether the first entity and the second entity correspond to the same virtual main body, the first entity information and the second entity information can be obtained, the first entity information is used for reflecting the virtual main body corresponding to the first entity, and the second entity information is used for reflecting the virtual main body corresponding to the second entity, so that the similarity between the virtual main bodies corresponding to the first entity and the second entity can be more prominently reflected based on the similarity determined by the first entity information and the second entity information, and further, whether the first entity and the second entity correspond to the same virtual main body can be more accurately determined through the similarity, and more attached entities can be provided for a user when the user searches for the entity under the condition of accurate virtual main body matching, and finally, the entity browsing experience of the user is improved.

The above description may include multiple types of entity information, and the manner of acquiring the entity information by the processing device may be different for different entity information, and the detailed description will be given below for the manner of acquiring the different entity information.

In one possible implementation, the first entity information may include first entity backbone information. It will be appreciated that when naming an entity, a certain name rule is generally followed, for example, "[ D: entity trunk information ] [ D: quaternary part number ] [ D: suffix character ]", "[ D: entity trunk information ] [ D: separator ] [ D: suffix character ]", "[ D: entity trunk information ] [ D: quaternary part number ]", may be included. Therefore, for the part of the entities, the processing device may acquire entity name information corresponding to the first entity, and then determine an information structure corresponding to the entity name information according to an entity name rule, where the information structure is used to embody a position corresponding to the first entity trunk information in the entity name information, so that, through the entity name information, the processing device may determine a position corresponding to the first entity trunk information of the first entity in the entity name information, and further determine the corresponding first entity trunk information. Wherein, the quaternary part number refers to the quaternary part of the entity; the separator refers to the symbol of the recomposition work, separating entity trunk information and suffix characters which are commonly appeared in follow-up work, for example, the symbol can be: "," -characters, etc.

It will be appreciated that there are also some entities whose entity names may not conform to conventional entity name rules. For these entities, in order to accurately determine the entity backbone information therein, the processing device may also perform extraction of the backbone information by means of a model.

In one possible implementation manner, the processing device may acquire entity name information corresponding to the first entity, and then determine, according to a trunk extraction model, a trunk start parameter, a trunk end parameter, and a trunk probability parameter corresponding to the entity name information, where the trunk start parameter is used to identify a probability that each character in the entity name information is a start character of the first entity trunk information, and the trunk end parameter is used to identify a probability that each character in the entity name information is an end character of the first entity trunk information. Therefore, through the trunk starting parameter and the trunk ending parameter, the processing equipment can determine one group and a plurality of groups of starting characters and ending characters in the entity name information, and then can determine a plurality of pieces of entity trunk information to be determined, wherein the plurality of pieces of entity trunk information to be determined are based on the trunk starting parameter and the trunk ending parameter, and are determined from the entity name information, for example, each group of starting characters and ending characters can determine one piece of entity trunk information to be determined. The trunk probability parameter is used for identifying the probability that the trunk information of the plurality of pending entities is the trunk information of the first entity, so that according to the trunk start parameter, the trunk end parameter and the trunk probability parameter, the processing device can determine the trunk information of the first entity corresponding to the first entity according to the entity name information, for example, the trunk information of the first entity with the highest probability in the trunk information of the plurality of pending entities can be determined as the trunk information of the first entity.

For example, the processing device may formalize the representation of entity name information as a word sequence w=w ₁ w ₂ …w _M Where M represents the length of the entity name information. The physical features of the word sequence are then extracted using a pre-trained RoBerta model, formalized as h=h ₁ h ₂ …H _M ，h _i ∈R ^d Wherein d represents a feature dimension; finally, entity backbone information can be intercepted from entity name information using a pointer network comprising three parts: 1) Starting position subscript probability sequences (i.e., trunk starting parameters), each of which represents whether the current character is a starting position of the entity trunk information, formalized as s=s ₁ s ₂ …s _M ，s _i E {0,1}, where 1 represents the starting position of the character as entity trunk information, and 0 identifies the starting position of the character as not entity trunk information; end position subscript probability sequences (i.e., stem end parameters), each of which represents whether the current character is an end position formalized representation of entity stem information, e=e ₁ e ₂ …e _M ，e _i E {0,1}, where 1 represents the ending position of the character as entity trunk information, 0 identifies the ending position of the character as not entity trunk information; the skeleton probability matrix (i.e., skeleton probability parameters) may be formally represented as P.epsilon.R ^M×M Wherein P is _ij E {0,1} represents the probability that the segment from i to j is entity backbone information.

The method for determining the entity trunk information based on the entity name rule is simple and effective, and has high accuracy; the model-based approach is more versatile. The two are complementary, and entity trunk information corresponding to the entity can be better extracted.

It can be appreciated that when searching for an entity, for the same entity, different users may input different keywords to search for the entity, as the entity may have multiple names; and, when a plurality of users search for an entity with the same keyword, since the entity may have various names, the search results determined by the users may also be different. Therefore, through the searching information of the user when searching the entity, various names of the same entity can be reflected to a certain extent, and then the entity alias information corresponding to the entity can be reflected.

Based on this, in one possible implementation manner, the first entity information may include first entity alias information and first entity trunk information, and when acquiring the first entity information, the processing device may acquire search data corresponding to the first entity and first entity trunk information, where the search data includes a plurality of pending alias information corresponding to when searching the first entity. For example, in one possible implementation, the processing device may obtain the pending alias information corresponding to the same search result content as the first entity backbone information, i.e., the user may obtain the same search result when inputting the first entity backbone information and the pending alias information. And/or the processing device may acquire the pending alias information corresponding to the same search keyword as the first entity backbone information, that is, when the user searches the first entity backbone information and the pending alias information, the input search keyword is the same. Thus, on the one hand, through the dimension of the search result and on the other hand, through the dimension of the search input, the processing equipment can acquire accurate and comprehensive undetermined alias information.

The processing device may determine second similarity between the plurality of pending alias information and the first entity backbone information, and then determine the pending alias information with the second similarity greater than the preset threshold as first entity alias information corresponding to the first entity, so that the processing device may determine entity alias information with high similarity to the entity backbone information, and enable the entity alias information to more accurately represent the virtual body corresponding to the entity.

For example, first, the processing device may acquire search click table data in which search words entered by the user at the time of searching and titles clicked by the user after searching are recorded as a set { (q) _i ,t _i ) I=1, 2, …, N }, where (q _i ,t _i ) The search word (query) and the title (title) of the corresponding click information input in one user search behavior are represented, and N represents the size of the click table, i.e. the recorded number of times of user search and clicking.

The processing device may then construct a similar set of queries/titles based on the click table, and specifically, may aggregate queries clicking on the same title together and aggregate click titles corresponding to the same query together. Formalized representation is as For title t _j Its corresponding query is aggregated into set Q ^j I.e. the search term q in the set _i Are all corresponding to the title t _j Similarly, the title set is denoted +.>I.e. the title t in the collection _j Are clicked by the user after entering the same search term.

The processing device may perform entity name information extraction on the query/title data in the same collection, here using a pre-trained named entity name recognition model for a given q _i The actual name information extraction result is expressed as a setFor set Q ^j And (3) carrying out similarity calculation on the entity name information extracted from all the queries, wherein the specific calculation mode is as follows: given entity pair e _i 、e _j The corresponding editing distance is recorded as lev _ij The similarity is expressed as the following formula:

wherein len (e _i ) Represents the entity name length, max (len (e _i ),len(e _j ) A larger value in the length of the two entity names). The processing device may set a similarity threshold (set to 0.5) and determine the undetermined entity aliases in the entity pairs that satisfy the threshold as entity alias information corresponding to the entities.

In addition, for the entity quaternary part number information, the processing equipment can be similar to a method for extracting entity trunk information, on one hand, the information structure in the entity name information can be analyzed through entity name rules, so that the entity quaternary part number information in the entity quaternary part number information can be determined; on the other hand, the processing device may also perform the extraction by means of a model.

For entity role information, the processing device may obtain, through structural data corresponding to the entity, where the structural data includes various information related to the entity, such as information of roles, developers, authors, scenarios, and the like.

In addition to determining whether multiple entities correspond to the same virtual entity based on the information that can embody the virtual entities corresponding to the entities, in one possible implementation manner, the processing device may also perform entity analysis in combination with information that can embody association relationships between the entities. It will be appreciated that when two entities correspond to the same virtual body, the similarity between the two entities is generally higher, and therefore, the association between the two entities and other entities will be similar.

Based on this, the processing device may acquire first entity embedded information for identifying an association relationship between the first entity and the other entity and second entity embedded information for identifying an association relationship between the second entity and the other entity. When the association relationship between the first entity and the second entity and other entities is relatively close, the similarity of the first entity and the second entity on the entities is higher, so that the probability of corresponding to the same virtual main body is also higher. Based on this, the processing device may determine a third similarity between the first entity embedded information and the second entity embedded information, and then determine a corresponding integrated similarity between the first entity and the second entity based on the first similarity and the third similarity.

The processing device may determine, according to the integrated similarity, whether the first entity and the second entity correspond to the same virtual body, so as to integrate dimensions of the virtual body corresponding to the entity and dimensions of an association relationship between the entities, and analyze whether different entities correspond to the same virtual body, thereby improving accuracy of entity analysis. It can be understood that the association relationship between the first entity embedded information and the second entity embedded information and other entities may be the same batch of entities, so as to improve accuracy of similarity calculation.

For example, in one possible implementation, a processing device may build a graph roll-up network (GCN) based embedding model to obtain entity embedding information. The concrete model is formally described as follows:

first, the processing device may acquire a knowledge-graph composed of a plurality of entities, with the knowledge-graph as an input of a model. For the input knowledge graph, an adjacency matrix of the entity can be constructed, and A epsilon R can be expressed in a formalized manner ^N×N Wherein A is _ij E {0,1} represents entity e _i And e _j Whether an edge exists between the two, namely whether an association relation exists.

The graph convolution formula is defined as follows:

H _k ＝M(A；H _k-1 ；W _k-1 )

wherein H is _k Representing the physical embedding matrix, W, after the kth convolution _k-1 Representing the trainable parameters. And, for the first graph wrapping operation, H ₀ The matrix is embedded for the initialized entity. After obtaining matrix H by modeling _k The matrix may include an entity embedded vector x corresponding to each entity, where the entity embedded vector may be used as entity embedded information corresponding to the entity, and may represent an association relationship between the entity and other entities in the matrix.

Because the application can determine whether the first entity and the second entity correspond to the same virtual body based on various information, in one possible implementation manner, when the processing device determines the comprehensive similarity, different similarity weights can be set for different information based on manual experience or analysis of historical data, so as to determine more reasonable comprehensive similarity.

The processing device may determine a comprehensive similarity corresponding to the first entity and the second entity according to the first similarity and first weight information corresponding to the first similarity, where the first weight information is used to identify a weight of the first similarity when determining the comprehensive similarity, and according to the second similarity and second weight information corresponding to the second similarity, where the second weight information is used to identify a weight of the second similarity when determining the comprehensive similarity.

In addition to setting different weights for the first similarity and the second similarity, when the entity information includes multiple information, the processing device may also calculate the similarity for each type of information separately, and then integrate the similarities based on the weights corresponding to the types of information to obtain an integrated similarity to determine whether the first entity and the second entity correspond to the same virtual body.

In order to facilitate understanding of the technical solution provided by the embodiments of the present application, a data processing method provided by the embodiments of the present application will be described below with reference to an actual application scenario.

Referring to fig. 3, fig. 3 is a schematic diagram of a data processing method in an actual application scenario provided in an embodiment of the present application, where a processing device may be a server capable of performing entity analysis.

As shown in fig. 3, first, a server may perform feature extraction for a plurality of entities, determine entity trunk information, entity quaternary part information, entity alias information and entity role information corresponding to each entity, and then perform entity embedding processing for the entities to obtain entity embedding vectors corresponding to each entity.

The manner in which the server determines the similarity between the entity information may also be different for different types of entity information. For the entity trunk information, the entity trunk information contains core content in entity name information, and the data format is usually ultra-short text, so that the embodiment of the application can determine the similarity between the entity trunk information based on the similarity calculation mode of the editing distance. First, the server may calculate the edit distance between two entity backbones information as lev _ij Based on the edit distance, the similarity calculation formula is as follows:

for the physical quarter information, the physical quarter information is different from the physical trunk information,the quaternary part number has various expression forms, such as a Chinese character digital form, an Arabic digital form, a text form and the like. The embodiment of the application designs a similarity calculation mode based on the entity quaternary part information types, firstly, the entity quaternary part information is classified, and the quaternary part information types defined by the method comprise but are not limited to the following types: number series, separator series, text series, etc. For a given entity e _i 、e _j The quaternary part number information type is expressed asThe similarity calculation formula is as follows:

That is, the similarity between the entity quaternary information of the same quaternary information type may be determined to be 1, otherwise, 0.

For entity alias information, when a plurality of entity alias information exists in each entity, the server can determine entity alias information sets corresponding to two entities, then calculate alias similarities in pairs, and select the highest similarity as the similarity of the entity alias information between the two entities. Formally described as: given entity e _i Is a certain entity alias information of (1)Entity e _j Is->The similarity calculation is the same as that of the entity trunk information and is recorded asAnd selecting the highest similarity to determine the similarity of the entity alias information between the two entities, wherein the calculation formula is as follows:

for entity role information, the server can acquire entity role sets corresponding to two entities, and measure the similarity of the entity role information of the two entities by using the Jacquard distance. Formally described as: given entity e _i Corresponding entity Role set Role _i Entity e _j Corresponding entity Role set Role _j The entity role information similarity calculation formula is as follows:

aiming at the entity embedded vectors, the embodiment of the application can adopt an embedded vector similarity calculation scheme based on supervised learning, and the similarity between the entity embedded vectors is determined by utilizing a labeling data training model. The similarity calculation formula corresponding to the model is as follows:

Matrix, b is a trainable paranoid vector and σ is an activation function.

After the multiple similarities are determined, the similarities of different information represent the similarities of the entities on different characteristics, and the multiple similarities are fused to enable the calculation of the similarities of the entities to be more accurate. The embodiment of the application can be used for fusing various similarities in a weighted summation mode, and the weights of different features can be defined by manually observing the data features. The similarity calculation mode of the fusion multichannel is expressed as follows:

based on the data processing method provided by the foregoing embodiment, the embodiment of the present application further provides a data processing apparatus, referring to fig. 4, fig. 4 is a block diagram of a data processing apparatus 400 provided by the embodiment of the present application, where the apparatus 400 includes a first obtaining unit 401, a first determining unit 402, and a second determining unit 403:

the first obtaining unit 401 is configured to obtain first entity information and second entity information, where the first entity information is used to embody a virtual entity corresponding to a first entity, and the second entity information is used to embody a virtual entity corresponding to a second entity, and the virtual entity is a entity related to entity content in the entity;

the first determining unit 402 is configured to determine a first similarity between the first entity information and the second entity information;

The second determining unit 403 is configured to determine, according to the first similarity, whether the first entity and the second entity correspond to the same virtual body.

In one possible implementation manner, the first entity information includes first entity backbone information, and the first obtaining unit 401 is specifically configured to:

acquiring entity name information corresponding to the first entity;

In a possible implementation manner, the first entity information includes the first entity alias information and the first entity backbone information, and the first obtaining unit 401 is specifically configured to:

In one possible implementation manner, the first obtaining unit 401 is specifically configured to:

the second determining unit 403 is specifically configured to:

In a possible implementation manner, the second determining unit 403 is specifically configured to:

The embodiment of the application also provides computer equipment, and the equipment is described below with reference to the accompanying drawings. Referring to fig. 5, an embodiment of the present application provides a device, which may also be a terminal device, where the terminal device may be any intelligent terminal including a mobile phone, a tablet computer, a personal digital assistant (Personal Digital Assistant, PDA for short), a Point of Sales (POS for short), a vehicle-mounted computer, and the like, and the terminal device is taken as an example of the mobile phone:

fig. 5 is a block diagram showing a part of the structure of a mobile phone related to a terminal device provided by an embodiment of the present application. Referring to fig. 5, the mobile phone includes: radio Frequency (RF) circuitry 710, memory 720, input unit 730, display unit 740, sensor 750, audio circuitry 760, wireless fidelity (Wireless Fidelity, wiFi) module 770, processor 780, and power supply 790. Those skilled in the art will appreciate that the handset configuration shown in fig. 5 is not limiting of the handset and may include more or fewer components than shown, or may combine certain components, or may be arranged in a different arrangement of components.

The following describes the components of the mobile phone in detail with reference to fig. 5:

The RF circuit 710 may be configured to receive and transmit signals during a message or a call, and specifically, receive downlink information of a base station and process the downlink information with the processor 780; in addition, the data of the design uplink is sent to the base station. Generally, RF circuitry 710 includes, but is not limited to, an antenna, at least one amplifier, a transceiver, a coupler, a low noise amplifier (Low Noise Amplifier, LNA for short), a duplexer, and the like. In addition, the RF circuitry 710 may also communicate with networks and other devices through wireless communications. The wireless communication may use any communication standard or protocol, including but not limited to global system for mobile communications (Global System of Mobile communication, GSM for short), general packet radio service (General Packet Radio Service, GPRS for short), code division multiple access (Code Division Multiple Access, CDMA for short), wideband code division multiple access (Wideband Code Division Multiple Access, WCDMA for short), long term evolution (Long Term Evolution, LTE for short), email, short message service (Short Messaging Service, SMS for short), and the like.

The memory 720 may be used to store software programs and modules, and the processor 780 performs various functional applications and data processing of the handset by running the software programs and modules stored in the memory 720. The memory 720 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, application programs required for at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data (such as audio data, phonebook, etc.) created according to the use of the handset, etc. In addition, memory 720 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid-state storage device.

The input unit 730 may be used to receive input numeric or character information and generate key signal inputs related to user settings and function control of the handset. In particular, the input unit 730 may include a touch panel 731 and other input devices 732. The touch panel 731, also referred to as a touch screen, may collect touch operations thereon or thereabout by a user (e.g., operations of the user on or thereabout the touch panel 731 using any suitable object or accessory such as a finger, a stylus, etc.), and drive the corresponding connection device according to a predetermined program. Alternatively, the touch panel 731 may include two parts of a touch detection device and a touch controller. The touch detection device detects the touch azimuth of a user, detects a signal brought by touch operation and transmits the signal to the touch controller; the touch controller receives touch information from the touch detection device and converts it into touch point coordinates, which are then sent to the processor 780, and can receive commands from the processor 780 and execute them. In addition, the touch panel 731 may be implemented in various types such as resistive, capacitive, infrared, and surface acoustic wave. The input unit 730 may include other input devices 732 in addition to the touch panel 731. In particular, the other input devices 732 may include, but are not limited to, one or more of a physical keyboard, function keys (e.g., volume control keys, switch keys, etc.), a trackball, mouse, joystick, etc.

The display unit 740 may be used to display information input by a user or information provided to the user and various menus of the mobile phone. The display unit 740 may include a display panel 741, and optionally, the display panel 741 may be configured in the form of a liquid crystal display (Liquid Crystal Display, LCD) or an Organic Light-Emitting Diode (OLED) or the like. Further, the touch panel 731 may cover the display panel 741, and when the touch panel 731 detects a touch operation thereon or thereabout, the touch operation is transferred to the processor 780 to determine the type of touch event, and then the processor 780 provides a corresponding visual output on the display panel 741 according to the type of touch event. Although in fig. 5, the touch panel 731 and the display panel 741 are two separate components to implement the input and output functions of the mobile phone, in some embodiments, the touch panel 731 and the display panel 741 may be integrated to implement the input and output functions of the mobile phone.

The handset may also include at least one sensor 750, such as a light sensor, a motion sensor, and other sensors. Specifically, the light sensor may include an ambient light sensor and a proximity sensor, wherein the ambient light sensor may adjust the brightness of the display panel 741 according to the brightness of ambient light, and the proximity sensor may turn off the display panel 741 and/or the backlight when the mobile phone moves to the ear. As one of the motion sensors, the accelerometer sensor can detect the acceleration in all directions (generally three axes), and can detect the gravity and direction when stationary, and can be used for applications of recognizing the gesture of a mobile phone (such as horizontal and vertical screen switching, related games, magnetometer gesture calibration), vibration recognition related functions (such as pedometer and knocking), and the like; other sensors such as gyroscopes, barometers, hygrometers, thermometers, infrared sensors, etc. that may also be configured with the handset are not described in detail herein.

Audio circuitry 760, speaker 761, and microphone 762 may provide an audio interface between a user and a cell phone. The audio circuit 760 may transmit the received electrical signal converted from audio data to the speaker 761, and the electrical signal is converted into a sound signal by the speaker 761 to be output; on the other hand, microphone 762 converts the collected sound signals into electrical signals, which are received by audio circuit 760 and converted into audio data, which are processed by audio data output processor 780 for transmission to, for example, another cell phone via RF circuit 710 or for output to memory 720 for further processing.

WiFi belongs to a short-distance wireless transmission technology, and a mobile phone can help a user to send and receive emails, browse webpages, access streaming media and the like through a WiFi module 770, so that wireless broadband Internet access is provided for the user. Although fig. 5 shows a WiFi module 770, it is understood that it does not belong to the necessary constitution of the mobile phone, and can be omitted entirely as required within the scope of not changing the essence of the invention.

The processor 780 is a control center of the handset, connects various parts of the entire handset using various interfaces and lines, and performs various functions of the handset and processes data by running or executing software programs and/or modules stored in the memory 720, and invoking data stored in the memory 720. Optionally, the processor 780 may include one or more processing units; preferably, the processor 780 may integrate an application processor that primarily processes operating systems, user interfaces, applications, etc., with a modem processor that primarily processes wireless communications. It will be appreciated that the modem processor described above may not be integrated into the processor 780.

The handset further includes a power supply 790 (e.g., a battery) for powering the various components, which may preferably be logically connected to the processor 780 through a power management system, such as to provide for managing charging, discharging, and power consumption by the power management system.

Although not shown, the mobile phone may further include a camera, a bluetooth module, etc., which will not be described herein.

In this embodiment, the processor 780 included in the terminal device further has the following functions:

Referring to fig. 6, fig. 6 is a schematic diagram of a server 800 according to an embodiment of the present application, where the server 800 may have a relatively large difference due to different configurations or performances, and may include one or more central processing units (Central Processing Units, abbreviated as CPUs) 822 (e.g., one or more processors) and a memory 832, and one or more storage media 830 (e.g., one or more mass storage devices) storing application programs 842 or data 844. Wherein the memory 832 and the storage medium 830 may be transitory or persistent. The program stored in the storage medium 830 may include one or more modules (not shown), each of which may include a series of instruction operations on a server. Still further, the central processor 822 may be configured to communicate with the storage medium 830 to execute a series of instruction operations in the storage medium 830 on the server 800.

The Server 800 may also include one or more power supplies 826, one or more wired or wireless network interfaces 850, one or more input/output interfaces 858, and/or one or more operating systems 841, such as Windows Server ^TM ，Mac OS X ^TM ，Unix ^TM ,Linux ^TM ，FreeBSD ^TM Etc.

The steps performed by the server in the above embodiments may be based on the server structure shown in fig. 6.

The embodiments of the present application also provide a computer readable storage medium storing a computer program for executing any one of the data processing methods described in the foregoing embodiments.

The embodiments of the present application also provide a computer program product comprising instructions which, when run on a computer, cause the computer to perform the data processing method provided by any of the embodiments described above.

Those of ordinary skill in the art will appreciate that: all or part of the steps for implementing the above method embodiments may be implemented by hardware related to program instructions, where the above program may be stored in a computer readable storage medium, and when the program is executed, the program performs steps including the above method embodiments; and the aforementioned storage medium may be at least one of the following media: read-only memory (ROM), RAM, magnetic disk or optical disk, etc., which can store program codes.

It should be noted that, in the present specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment is mainly described in a different point from other embodiments. In particular, for the apparatus and system embodiments, since they are substantially similar to the method embodiments, the description is relatively simple, with reference to the description of the method embodiments in part. The apparatus and system embodiments described above are merely illustrative, in which elements illustrated as separate elements may or may not be physically separate, and elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present application without undue burden.

The foregoing is only one specific embodiment of the present application, but the scope of the present application is not limited thereto, and any changes or substitutions easily contemplated by those skilled in the art within the technical scope of the present application should be included in the scope of the present application. Therefore, the protection scope of the present application should be subject to the protection scope of the claims.

Claims

1. A method of data processing, the method comprising:

2. The method of claim 1, wherein the first entity information comprises any one or more of first entity backbone information, first entity role information, first entity alias information, first entity quaternary number information.

3. The method of claim 2, wherein the first entity information comprises first entity backbone information, and wherein the obtaining the first entity information comprises:

acquiring entity name information corresponding to the first entity;

4. The method of claim 2, wherein the first entity information comprises first entity backbone information, and wherein the obtaining the first entity information comprises:

acquiring entity name information corresponding to the first entity;

5. The method of claim 2, wherein the first entity information includes the first entity alias information and the first entity backbone information, and wherein the obtaining the first entity information includes:

6. The method of claim 5, wherein the obtaining the search data corresponding to the first entity comprises:

7. The method according to claim 1, wherein the method further comprises:

acquiring first entity embedded information and second entity embedded information, wherein the first entity embedded information is used for identifying the association relationship between the first entity and other entities, and the second entity embedded information is used for identifying the association relationship between the second entity and other entities;

Determining a third similarity between the first entity embedded information and the second entity embedded information;

the determining whether the first entity and the second entity correspond to the same virtual body according to the first similarity includes:

8. The method of claim 7, wherein the determining the corresponding integrated similarity between the first entity and the second entity based on the first similarity and the second similarity comprises:

9. A data processing apparatus, characterized in that the apparatus comprises a first acquisition unit, a first determination unit and a second determination unit:

10. A computer device, the device comprising a processor and a memory:

the processor is configured to perform the data processing method of any of claims 1-8 according to instructions in the program code.

11. A computer readable storage medium, characterized in that the computer readable storage medium is for storing a computer program for executing the data processing method according to any one of claims 1-8.

12. A computer program product comprising instructions which, when run on a computer, cause the computer to perform the data processing method of any of claims 1-8.