CN116304738A

CN116304738A - Data processing method, device and equipment

Info

Publication number: CN116304738A
Application number: CN202310264152.1A
Authority: CN
Inventors: 孙清清; 邹泊滔; 张晨景; 张天翼; 王爱凌
Original assignee: Alipay Hangzhou Information Technology Co Ltd
Current assignee: Alipay Hangzhou Information Technology Co Ltd
Priority date: 2023-03-10
Filing date: 2023-03-10
Publication date: 2023-06-23

Abstract

The embodiment of the specification provides a data processing method, a device and equipment, wherein the method comprises the following steps: acquiring a first entity pair to be detected; generating a matching model to be trained based on a preset model search space, and inputting the first entity pair into the matching model to obtain the prediction matching degree of the first entity pair; selecting a target entity pair from the first entity pair based on a preset matching degree threshold and the predicted matching degree of the first entity pair, and acquiring the labeling matching degree of the target entity pair; and carrying out iterative training on the matching model based on the labeling matching degree of the target entity pair and the prediction matching degree of the target entity pair to obtain a trained matching model, wherein the trained matching model is used for determining whether the data in the entity pair represent the same entity.

Description

Data processing method, device and equipment

Technical Field

The present disclosure relates to the field of data processing technologies, and in particular, to a data processing method, apparatus, and device.

Background

Because different data sources describe the behavior information of the user in different dimensions respectively, if the data sources are uniformly associated, the user can be more accurately known, and the data value can be exerted to a greater extent. For example, it may be determined manually whether the same entity exists in different data sources.

However, because the data size of the data to be matched is large and the data features are large, the efficiency and accuracy of entity matching are low by means of manual judgment, and therefore a scheme capable of improving the efficiency and accuracy of entity matching is needed.

Disclosure of Invention

The embodiment of the specification aims to provide a data processing method, device and equipment so as to provide a scheme capable of improving the efficiency and accuracy of entity matching.

In order to achieve the above technical solution, the embodiments of the present specification are implemented as follows:

in a first aspect, a data processing method includes: acquiring a first entity pair to be detected; generating a matching model to be trained based on a preset model search space, and inputting the first entity pair into the matching model to obtain the prediction matching degree of the first entity pair; selecting a target entity pair from the first entity pair based on a preset matching degree threshold and the predicted matching degree of the first entity pair, and acquiring the labeling matching degree of the target entity pair; and carrying out iterative training on the matching model based on the labeling matching degree of the target entity pair and the prediction matching degree of the target entity pair to obtain a trained matching model, wherein the trained matching model is used for determining whether the data in the entity pair represent the same entity.

In a second aspect, embodiments of the present disclosure provide a data processing apparatus, the apparatus comprising: the first acquisition module is used for acquiring a first entity pair to be detected; the model generation module is used for generating a matching model to be trained based on a preset model search space, and inputting the first entity pair into the matching model to obtain the prediction matching degree of the first entity pair; the data selection module is used for selecting a target entity pair from the first entity pair based on a preset matching degree threshold and the predicted matching degree of the first entity pair, and acquiring the labeling matching degree of the target entity pair; the model training module is used for carrying out iterative training on the matching model based on the labeling matching degree of the target entity pair and the prediction matching degree of the target entity pair to obtain a trained matching model, and the trained matching model is used for determining whether the data in the entity pair represent the same entity.

In a third aspect, embodiments of the present specification provide a data processing apparatus, the data processing apparatus comprising: a processor; and a memory arranged to store computer executable instructions that, when executed, cause the processor to: acquiring a first entity pair to be detected; generating a matching model to be trained based on a preset model search space, and inputting the first entity pair into the matching model to obtain the prediction matching degree of the first entity pair; selecting a target entity pair from the first entity pair based on a preset matching degree threshold and the predicted matching degree of the first entity pair, and acquiring the labeling matching degree of the target entity pair; and carrying out iterative training on the matching model based on the labeling matching degree of the target entity pair and the prediction matching degree of the target entity pair to obtain a trained matching model, wherein the trained matching model is used for determining whether the data in the entity pair represent the same entity.

In a fourth aspect, embodiments of the present description provide a storage medium for storing computer-executable instructions that, when executed, implement the following: acquiring a first entity pair to be detected; generating a matching model to be trained based on a preset model search space, and inputting the first entity pair into the matching model to obtain the prediction matching degree of the first entity pair; selecting a target entity pair from the first entity pair based on a preset matching degree threshold and the predicted matching degree of the first entity pair, and acquiring the labeling matching degree of the target entity pair; and carrying out iterative training on the matching model based on the labeling matching degree of the target entity pair and the prediction matching degree of the target entity pair to obtain a trained matching model, wherein the trained matching model is used for determining whether the data in the entity pair represent the same entity.

Drawings

In order to more clearly illustrate the embodiments of the present description or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described below, it being obvious that the drawings in the following description are only some of the embodiments described in the present description, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1A is a flowchart illustrating an embodiment of a data processing method according to the present disclosure;

FIG. 1B is a schematic diagram illustrating a data processing method according to the present disclosure;

FIG. 2 is a schematic diagram illustrating a processing procedure of another data processing method according to the present disclosure;

FIG. 3 is a schematic diagram of a data processing process according to the present disclosure;

FIG. 4 is a schematic diagram of an embodiment of a data processing apparatus according to the present disclosure;

fig. 5 is a schematic diagram of a data processing apparatus according to the present specification.

Detailed Description

The embodiment of the specification provides a data processing method, a device and equipment.

In order to make the technical solutions in the present specification better understood by those skilled in the art, the technical solutions in the embodiments of the present specification will be clearly and completely described below with reference to the drawings in the embodiments of the present specification, and it is obvious that the described embodiments are only some embodiments of the present specification, not all embodiments. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are intended to be within the scope of the present disclosure.

Example 1

As shown in fig. 1A and fig. 1B, the embodiment of the present disclosure provides a data processing method, where an execution body of the method may be a server, and the server may be an independent server or a server cluster formed by a plurality of servers. The method specifically comprises the following steps:

In S102, a first entity pair to be detected is acquired.

The first entity pair may include a plurality of entities that may be the same entity and are screened out from entity information provided by different data sources, the entities may be entities such as users, services, institutions, etc., the first entity pair may include entity information corresponding to each entity, the entity information may include feature information such as identification, type, contact information, etc. for describing the entity, the entity information of the plurality of entities in the first entity pair may be different, for example, the first entity pair may include entity 1 provided by the data source 1 and entity 2 provided by the data source 2, the entity information of the entity 1 may include identification and type of the entity 1, and the entity information of the entity 2 may include type, address and contact information of the entity 2.

In the implementation, because different data sources respectively describe the behavior information of the user in different dimensions, if the data sources are uniformly associated, the user can be more accurately known, and the data value can be exerted to a greater extent. For example, it may be determined manually whether the same entity exists in different data sources. However, because the data size of the data to be matched is large and the data features are large, the efficiency and accuracy of entity matching are low by means of manual judgment, and therefore a scheme capable of improving the efficiency and accuracy of entity matching is needed. For this reason, the embodiments of the present specification provide a technical solution that can solve the above-mentioned problems, and specifically, reference may be made to the following.

Because there may be a difference between the description information (i.e., entity information) of different data sources for the same entity, in order to improve the data utilization rate, entity matching may be performed on different entities, so as to perform subsequent data processing (such as data mining processing, risk detection processing, etc.) through the matched entities.

The server may perform a preliminary screening process on entity information provided by different data sources to obtain a plurality of entities (i.e., a first entity pair) that are preliminarily matched. The preliminary screening process may be a matching process for the entity based on the entity information.

For example, assuming that the entity information 1 provided by the data source 1 includes entity information as shown in table 1 below and the entity information 2 provided by the data source 2 includes entity information as shown in table 2 below, the server may perform a preliminary screening process on the entity provided by the data source 1 and the entity provided by the data source 2 based on the entity information in tables 1 and 2, to obtain a first entity pair to be detected.

TABLE 1

	Numbering device	Name of the name	Address of	Type(s)
					Entity 1	xx1	aa	Adress-1	person
Entity 2	xx2	bb	Adress-2	person
					Entity 3	xx3	cc	Adress-3	person

TABLE 2

	Numbering device	Contact means	Description of the invention
				Entity 4	xx1	Adress-1	aaaa
Entity 5	yy1	Adress-4	bbbb
				Entity 6	yy2	Adress-3	cccc

In table 1 and table 2, each row may be used to represent an entity, each column may be used to represent feature information of an entity, and the server may determine, when performing the preliminary screening process, a plurality of entities that may be the same entity according to the feature information, and determine the plurality of entities as a first entity pair.

For example, the server may determine a plurality of entities having the same characteristic information in the entity information 1 and the entity information 2 as the first entity pair, specifically, for example, the number of the entity 1 in the above table 1 is the same as the number of the entity 4 in the above table 2, and then the server may determine the entity 1 and the entity 4 as the first entity pair.

In addition, since there may be differences in the identification rules of different data sources to entities, the server may determine the first entity pair to be detected based on a plurality of feature information. For example, the number of entity 1 in table 1 is the same as the number of entity 4 in table 2, and the address of entity 1 is the same as the contact of entity 3, then the server may determine entity 1 and entity 4 as the first entity pair.

Alternatively, the first entity pair may be a plurality of entities that may be the same entity and are screened manually based on entity information provided by different data sources, and in addition, the obtaining manner of the first entity pair may be various and may be different according to different practical application scenarios, which is not limited in the embodiment of the present disclosure.

In S104, a matching model to be trained is generated based on a preset model search space, and the first entity pair is input into the matching model to obtain a predicted matching degree of the first entity pair.

The preset model search space may include a plurality of deep learning algorithms determined based on a preset service scenario, for example, the preset search space may include a decision tree algorithm, a random forest algorithm, a support vector machine (Support Vector Machine, SVM) algorithm, a logistic regression algorithm, a deep matcher algorithm, and the like, and different preset model search spaces may be configured according to an actual service scenario, which is not particularly limited in the embodiments of the present disclosure.

In an implementation, the server may select a deep learning algorithm in a preset model search space based on automated machine learning (autopl) to construct a matching model to be trained, and then the server may input the first entity pair into the matching model to be trained to obtain a predicted matching degree of the first entity pair. Wherein the matching model may be used to determine a degree of matching of the plurality of entities in the first entity pair to the same entity.

In S106, a target entity pair is selected from the first entity pair based on a preset matching degree threshold and a predicted matching degree of the first entity pair, and a labeling matching degree of the target entity pair is obtained.

The preset matching degree threshold may be a threshold set based on a preset service scene, or the preset matching degree threshold may be determined according to a predicted matching degree of the first entity pair, and may be different according to different service scenes.

In implementation, a first entity pair with a predicted matching degree smaller than a preset matching degree threshold may be determined as a target entity pair, and a labeling matching degree of the target entity pair may be obtained. In addition, to improve the model training efficiency, the ratio of the number of the selected target entity pairs to the number of the first entity pairs may be less than a preset ratio threshold.

For example, assuming that the first entity pairs to be detected include 100 entity pairs, a target entity pair may be selected from the first entity pairs based on the preset matching degree and the preset matching degree threshold value of each first entity pair, and specifically, for example, a first entity pair with a predicted matching degree smaller than the preset matching degree threshold value may be determined as the target entity pair, for example, there may be 60 target entity pairs.

In addition, if the ratio of the number of the selected target entity pairs to the number of the first entity pairs is not less than the preset ratio threshold, for improving the training efficiency of the model, the first entity pairs with the predicted matching degree less than the preset matching degree threshold may be screened based on the preset ratio threshold, and the screened first entity pairs are determined to be the target entity pairs. For example, assuming that there are 100 first entity pairs and 60 first entity pairs with a predicted matching degree smaller than a preset matching degree threshold, if the preset ratio threshold is 0.3, then a filtering process may be performed on the 60 first entity pairs with a predicted matching degree smaller than the preset matching degree threshold based on the preset ratio threshold, for example, 100×0.3=30 first entity pairs may be randomly selected from the 60 first entity pairs, and the 30 first entity pairs may be determined as target entity pairs, or the 60 first entity pairs may be ranked based on the predicted matching degree, and 30 first entity pairs may be selected and determined as target entity pairs based on the ranked first entity pairs.

The above method for determining the target entity pair is an optional and implementable method, and in an actual application scenario, there may be a plurality of different determining methods, and may be different according to the actual application scenario, which is not specifically limited in the embodiment of the present disclosure.

After the target entity pair is determined, the annotation matching degree of the target entity pair can be obtained, for example, the annotation matching degree of the target entity pair can be determined by a manual annotation mode, and the prediction matching degree of the target entity pair is updated by an active learning mode, namely, the annotation matching degree of the target entity pair is determined.

In addition, there may be a plurality of determining methods for the annotation matching degree of the target entity, for example, the annotation matching degree of the target entity pair may be determined based on a matching degree determining rule corresponding to a preset service scene, and different determining methods may be selected according to different actual application scenes, which is not specifically limited in the embodiment of the present disclosure.

In S108, based on the labeled matching degree of the target entity pair and the predicted matching degree of the target entity pair, iterative training is performed on the matching model, so as to obtain a trained matching model.

The trained matching model is used for determining whether the data in the entity pair represents the same entity.

In implementation, the server may determine whether the matching model converges based on the labeling matching degree of the entity pair and the prediction matching degree of the target entity pair, if it is determined that the matching model does not converge, the server may continue to execute S106 to S108 until the matching model converges, to obtain a trained matching model, that is, in iterative training of the matching model, the prediction matching degree of the selected target entity pair may be updated to obtain the labeling matching degree of the target entity pair, and in the labeling matching degree based on the target entity pair and the prediction matching degree of the target entity pair, whether the matching model converges is determined.

On the one hand, the target entity pairs are entity pairs selected from the first entity pairs, namely, the number of the target entity pairs is smaller than that of the first entity pairs, so that whether the matching model is converged is determined based on the target entity pairs with the smaller number, the problem of lower determination efficiency of whether the matching model is converged based on the first entity pairs due to the fact that the number of the first entity pairs is larger is avoided, iteration times can be saved, data processing resources can be saved, and model training efficiency is improved.

On the other hand, in each iteration process, the server needs to update the predicted matching degree of the target entity pair, namely, the labeling matching degree of the target entity pair is determined, so that the training accuracy of the matching model can be improved while the data processing resources are saved.

In addition, the generated matching model may use default hyper-parameter settings when training the matching model. The server can also generate a matching model according to the evaluation function selected by the user, and optimize the parameters of the matching model in an automatic tuning mode under an AutoML framework.

The embodiment of the specification provides a data processing method, which comprises the steps of obtaining a first entity pair to be detected, generating a matching model to be trained based on a preset model search space, inputting the first entity pair into the matching model to obtain the predicted matching degree of the first entity pair, selecting a target entity pair from the first entity pair based on a preset matching degree threshold and the predicted matching degree of the first entity pair, obtaining the labeling matching degree of the target entity pair, performing iterative training on the matching model based on the labeling matching degree of the target entity pair and the predicted matching degree of the target entity pair to obtain a trained matching model, and determining whether data in the entity pair represent the same entity or not by the trained matching model. In addition, in each iteration process, the server needs to update the predicted matching degree of the target entity pair, namely, determine the labeling matching degree of the target entity pair, so as to iteratively train the matching model based on the labeling matching degree and the predicted matching degree of the target entity pair, and improve the training accuracy of the matching model while saving data processing resources.

Example two

As shown in fig. 2, the embodiment of the present disclosure provides a data processing method, where an execution body of the method may be a server, where the server may be an independent server or may be a server cluster formed by a plurality of servers. The method specifically comprises the following steps:

in S202, a first data set and a second data set to be matched are acquired.

The first data set and the second data set may be data sets from different data sources, for example, the first data set may be a data set generated based on a resource transfer service, the second data set may be a data set generated based on an instant messaging service, i.e., the first data set may include resource transfer data, and the second data set may include instant messaging data.

In S204, the first data set is divided into a plurality of first sub-data and the second data set is divided into a plurality of second sub-data based on a preset data dividing algorithm.

In implementation, the preset data splitting algorithm may be various, for example, the server may determine each piece of data included in the first data set as a first piece of sub-data, and determine each piece of data included in the second data set as a second piece of sub-data, where each piece of sub-data (i.e., the first sub-data and the second sub-data) is one piece of entity information that can be used to characterize an entity.

Or, the server may further perform clustering processing on the data in the first data set based on the entity information in the first data set, and determine a plurality of first sub-data based on the clustering result, for example, assuming that the first data set includes entity information 1, entity information 2 and entity information 3, may perform clustering processing on the 3 entity information based on a preset clustering algorithm, assuming that the clustering result is that the entity information 1 and the entity information 2 are in a class, and assuming that the entity information 3 is in a class, the server may determine the entity information 1 and the entity information 2 as the first sub-data 1, and determine the entity information 3 as the first sub-data 2. Similarly, the server may partition the second data set into a plurality of second sub-data.

The above-mentioned segmentation method is an optional and realizable segmentation method, and in the actual application scenario, there may be a plurality of different segmentation methods, and they may be different according to the actual application scenario, which is not specifically limited in the embodiment of the present disclosure.

In addition, before the data is divided, the server may further perform preprocessing on the entity information in the first data set and the second data set based on a preprocessing rule, and divide the preprocessed first data set into a plurality of first sub-data, and divide the preprocessed second data set into a plurality of second sub-data.

For example, the server may adjust the data formats of the entity information in the first data set and the second data set, such as adjusting the formats of the data corresponding to the dates in the first data set and the second data set to a uniform format, or the server may perform desensitization processing on the privacy data in the first data set and the second data set, or the like.

In S206, sample alignment processing is performed on the first sub-data and the second sub-data to obtain a sample alignment result, and a first entity pair is determined based on the sample alignment result.

The first entity pair comprises first sub-data and second sub-data with corresponding relations.

In an implementation, the server may perform sample alignment processing on the first sub-data and the second sub-data based on the feature information of the entity in the first sub-data and the feature information of the entity in the second sub-data, to obtain a sample alignment result.

For example, the first sub data and the second sub data may be subjected to sample alignment processing based on the same feature information as the first sub data and the second sub data, to obtain a sample alignment result. For example, assuming that the feature information of the entity in the first sub-data includes a number, a name, an address and a type, and the feature information of the entity in the second sub-data includes a number, a contact way and a description, the server may perform sample alignment processing on the first sub-data and the second sub-data based on the address and the contact way to obtain a sample alignment result, that is, may perform alignment processing on the entity with the same address and contact way to obtain the sample alignment result.

In addition, there may be multiple sample alignment methods, and different methods may be selected according to different practical application scenarios, which are not specifically limited in the embodiment of the present disclosure.

In addition, before the sample alignment processing, the first sub data and the second sub data may be subjected to data preprocessing, that is, the first sub data and the second sub data may be subjected to missing values and unbalanced data, or if the language forms of the first sub data and the second sub data are different, the first sub data and/or the second sub data may be subjected to translation processing so that the language forms of the first sub data and the second sub data are the same, for example, the missing values may be subjected to padding processing.

In addition, in a practical scenario, there are various methods for determining the first entity pair, and the following provides an alternative implementation manner, which can be specifically referred to the following steps one to three:

step one, based on a sample alignment result, first sub-data and second sub-data with an alignment relation are determined.

In implementation, the first sub-data and the second sub-data with the same certain or several pieces of characteristic information can be determined as the first sub-data and the second sub-data with the alignment relation.

And step two, obtaining the similarity between the first sub data and the second sub data with the alignment relation.

In an implementation, the server may determine a similarity between the first sub data and the second sub data having an alignment relationship based on a preset similarity algorithm, where the preset similarity algorithm may be a minHash algorithm, a similarity and distance block based algorithm, and the like.

And thirdly, determining first sub-data and second sub-data with corresponding relations based on the similarity, and determining the first sub-data and the second sub-data with the corresponding relations as a first entity pair.

In an implementation, the first entity pair (i.e., the candidate entity pair) may be determined based on the similarity and the preset similarity threshold, that is, the first sub-data and the second sub-data having a correspondence relationship with the similarity not smaller than the preset similarity threshold may be determined as the first entity pair.

In S104, a matching model to be trained is generated based on the preset model search space.

In S208, a token vector of the first entity pair is determined based on a preset token extraction algorithm and a preset feature selection rule, and the token vector of the first entity pair is input into a matching model to obtain a predicted matching degree of the first entity pair.

The preset feature extraction algorithm may be any algorithm that can be used for performing feature extraction, for example, the preset feature extraction algorithm may be a neural network algorithm, etc.

In implementation, because the data processing requirements of different service scenes are different, a preset feature selection rule corresponding to the current service scene can be obtained, feature item information corresponding to the entities in the first entity pair is selected based on the preset feature selection rule, and the characterization vector of the first entity pair is determined based on a preset characterization extraction algorithm.

For example, assuming that the feature information of the entity in the first sub-data includes a number, a name, an address and a type, and the feature information of the entity in the second sub-data includes a number, a contact way and a description, if it is determined that the feature to be detected is the name and the contact way based on a preset feature selection rule corresponding to the current service scene, then the feature information corresponding to the name in the first sub-data and the feature information corresponding to the contact way in the second sub-data may be selected, and then the server may perform feature extraction processing on the feature information corresponding to the name in the first sub-data and the feature information corresponding to the contact way in the second sub-data based on a preset feature extraction algorithm, so as to obtain a feature vector of the first entity pair.

Or, the preset feature selection rule may also be to select the same feature information of the first sub-data and the second sub-data in the first entity pair, and determine the prediction matching degree of the first entity pair based on the similarity between the same feature information and a preset feature extraction algorithm. For example, the similarity between the feature information corresponding to the address in the first sub-data and the feature information corresponding to the contact in the second sub-data may be obtained, and then a token vector corresponding to the similarity may be determined based on a preset token extraction algorithm, and the token vector may be determined as a token vector of the first entity pair, that is, the token vector may be used to token the similarity between multiple entities in the first entity pair.

The method for determining the token vector of the first entity pair is an optional and implementable determination method, and in an actual application scenario, there may also be a plurality of different determination methods, and different determination methods may be selected according to different actual application scenarios, which is not specifically limited in the embodiment of the present disclosure.

In S210, a difference between the preset matching degree threshold and the predicted matching degree of each first entity pair is obtained.

In S212, a first entity pair corresponding to a difference value smaller than a preset difference threshold is determined as a target entity pair.

In implementation, the duty ratio of the target entity to the first entity pair can be adjusted by adjusting the magnitude of the preset difference threshold, and the magnitude of the preset difference threshold can be adjusted according to different actual service scenes so as to meet the training requirements of the actual service scenes.

For example, since the model accuracy requirement of the risk detection scenario is higher and the model training timeliness requirement is lower, the preset difference threshold may be set to a larger value to increase the duty ratio of the target entity to the first entity pair, that is, increase the number of iterations on the matching model, and improve the accuracy of the matching model, while the preset difference threshold may be set to a smaller value for the recommended scenario with lower model accuracy requirement and higher model training timeliness requirement, to decrease the duty ratio of the target entity to the first entity pair, that is, decrease the number of iterations on the matching model, and improve the training efficiency of the matching model.

In S106, the annotation matching degree of the target entity pair is obtained.

In S108, based on the labeled matching degree of the target entity pair and the predicted matching degree of the target entity pair, determining whether the matching model converges, and on the condition that the matching model is determined not to converge, adjusting model parameters of the matching model based on a preset parameter search space to obtain an updated matching model, and continuing to iteratively train the updated matching model based on the first entity pair to obtain a trained matching model.

In implementation, the preset parameter search space may include a random search algorithm, a grid search algorithm, a bayesian optimization algorithm, and the like, reinforcement learning may be performed through the preset parameter search space to optimize parameter search, and super parameters (such as a tree in the random forest algorithm or a kernel in the support vector machine algorithm) of the matching model are adjusted, so that better model performance is obtained.

As shown in fig. 3, first, entity Blocking (Blocking) may be performed on the first data set and the second data set, that is, a subset d×d 'may be selected for the first data set and the second data set, where D may be a column number of the subset (i.e., a feature corresponding to an entity), and D' may be a row number of the subset (i.e., a number of entities included in the subset), to obtain a first entity pair, so that most of data pairs in the entity pair corresponding to the predicted matching degree output by the matching model may be ensured to be in the subset.

Then, the server may determine a matching model through an automatic learning model development framework constructed by an automatic machine learning algorithm, and train the matching model to obtain a trained matching model, where a feature vector of the first entity pair (i.e., feature engineering) may be determined based on a preset feature extraction algorithm and a preset feature selection rule, then a matching model to be trained (i.e., model selection) is generated based on a preset model search space, and a predicted matching degree of the target entity pair is generated based on the matching model, and the server may determine whether the matching model converges based on the predicted matching degree and the labeled matching degree of the target entity pair, and in the case that the matching model does not converge, the server may adjust model parameters of the matching model (i.e., parameter tuning) through a preset parameter search space, and continue training the matching model to obtain the trained matching model.

In S214, a target entity pair to be detected is obtained, and a token vector of the target entity pair is determined based on a preset token extraction algorithm and a preset feature selection rule.

In an implementation, the target entity pair may include a plurality of entities, feature selection may be performed on entity information corresponding to the plurality of entities based on a preset feature extraction algorithm and a preset feature selection rule, and feature extraction processing is performed based on the selected features, so as to obtain a feature vector of the target entity pair.

In S216, the token vector of the target entity pair is input into the trained matching model, so as to obtain the predicted matching degree of the target entity pair.

In S218, it is determined whether the data in the target entity pair characterizes the same entity based on the predicted match of the target entity pair.

In an implementation, taking an account identity detection scenario as an example, in order to avoid multiple malicious registrations of the same user in the same application program, the server may perform entity matching on multiple account information with respect to input data of the user, for example, may determine multiple account information with the same login address as a target entity pair, then determine a predicted matching degree of the target entity pair based on a trained matching model, determine whether data in the target entity pair represents the same entity based on the predicted matching degree of the target entity pair, and if it is determined that data in the target entity pair represents the same entity, determine that the user has malicious registration behaviors, and output preset alarm information with respect to the multiple account information.

In addition, in the case that there are a plurality of account information, any two account information can be determined as one target entity pair, and the prediction matching degree of each target entity pair is determined based on the trained matching model.

Example III

The data processing method provided in the embodiment of the present disclosure is based on the same concept, and the embodiment of the present disclosure further provides a data processing device, as shown in fig. 4.

The data processing apparatus includes: a first acquisition module 401, a model generation module 402, a data selection module 403, and a model training module 404, wherein:

a first obtaining module 401, configured to obtain a first entity pair to be detected;

the model generating module 402 is configured to generate a matching model to be trained based on a preset model search space, and input the first entity pair into the matching model to obtain a predicted matching degree of the first entity pair;

a data selection module 403, configured to select a target entity pair from the first entity pair based on a preset matching degree threshold and a predicted matching degree of the first entity pair, and obtain a labeling matching degree of the target entity pair;

the model training module 404 is configured to iteratively train the matching model based on the labeling matching degree of the target entity pair and the prediction matching degree of the target entity pair, to obtain a trained matching model, where the trained matching model is used to determine whether the data in the entity pair represents the same entity.

In the embodiment of the present disclosure, the model training module 404 is configured to:

determining whether the matching model is converged based on the labeling matching degree of the target entity pair and the prediction matching degree of the target entity pair, adjusting model parameters of the matching model based on a preset parameter search space under the condition that the matching model is not converged, obtaining an updated matching model, and continuing to iteratively train the updated matching model based on the first entity pair to obtain the trained matching model.

In this embodiment of the present disclosure, the first obtaining module 401 is configured to:

acquiring a first data set and a second data set to be matched;

dividing the first data set into a plurality of first sub-data and dividing the second data set into a plurality of second sub-data based on a preset data dividing algorithm;

and carrying out sample alignment processing on the first sub-data and the second sub-data to obtain a sample alignment result, and determining the first entity pair based on the sample alignment result, wherein the first entity pair comprises the first sub-data and the second sub-data with corresponding relations.

determining the first sub-data and the second sub-data having an alignment relationship based on the sample alignment result;

acquiring similarity between the first sub data and the second sub data with alignment relation;

and determining the first sub-data and the second sub-data with the corresponding relation based on the similarity, and determining the first sub-data and the second sub-data with the corresponding relation as the first entity pair.

In an embodiment of the present disclosure, the data selecting module is configured to:

acquiring a difference value between the preset matching degree threshold value and the predicted matching degree of each first entity pair;

and determining the first entity pair corresponding to the difference value, of which the difference value is smaller than a preset difference value threshold, as the target entity pair.

In the embodiment of the present specification, the model generating module 402 is configured to:

and determining the characterization vector of the first entity pair based on a preset characterization extraction algorithm and a preset feature selection rule, and inputting the characterization vector of the first entity pair into the matching model to obtain the prediction matching degree of the first entity pair.

In an embodiment of the present disclosure, the apparatus further includes:

the second acquisition module is used for acquiring a target entity pair to be detected and determining a characterization vector of the target entity pair based on the preset characterization extraction algorithm and the preset feature selection rule;

the matching degree determining module is used for inputting the characterization vector of the target entity pair into the trained matching model to obtain the prediction matching degree of the target entity pair;

and the entity matching module is used for determining whether the data in the target entity pair represents the same entity or not based on the predicted matching degree of the target entity pair.

The embodiment of the specification provides a data processing device, which is used for acquiring a first entity pair to be detected, generating a matching model to be trained based on a preset model search space, inputting the first entity pair into the matching model to obtain the predicted matching degree of the first entity pair, selecting a target entity pair from the first entity pair based on a preset matching degree threshold and the predicted matching degree of the first entity pair, acquiring the labeling matching degree of the target entity pair, performing iterative training on the matching model based on the labeling matching degree of the target entity pair and the predicted matching degree of the target entity pair to obtain a trained matching model, and determining whether data in the entity pair represent the same entity or not by the trained matching model. In addition, in each iteration process, the server needs to update the predicted matching degree of the target entity pair, namely, determine the labeling matching degree of the target entity pair, so as to iteratively train the matching model based on the labeling matching degree and the predicted matching degree of the target entity pair, and improve the training accuracy of the matching model while saving data processing resources.

Example IV

Based on the same idea, the embodiment of the present disclosure further provides a data processing device, as shown in fig. 5.

The data processing apparatus may vary considerably in configuration or performance and may include one or more processors 501 and memory 502, in which memory 502 may store one or more stored applications or data. Wherein the memory 502 may be transient storage or persistent storage. The application programs stored in memory 502 may include one or more modules (not shown) each of which may include a series of computer executable instructions for use in a data processing apparatus. Still further, the processor 501 may be arranged to communicate with the memory 502 and execute a series of computer executable instructions in the memory 502 on a data processing apparatus. The data processing device may also include one or more power supplies 503, one or more wired or wireless network interfaces 504, one or more input/output interfaces 505, and one or more keyboards 506.

In particular, in this embodiment, the data processing apparatus includes a memory, and one or more programs, wherein the one or more programs are stored in the memory, and the one or more programs may include one or more modules, and each module may include a series of computer-executable instructions for the data processing apparatus, and the one or more programs configured to be executed by the one or more processors comprise instructions for:

In this specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for federal learning-based data processing apparatus embodiments, the description is relatively simple as it is substantially similar to method embodiments, as relevant points are found in the partial description of method embodiments.

The embodiment of the specification provides data processing equipment, which is used for acquiring a first entity pair to be detected, generating a matching model to be trained based on a preset model search space, inputting the first entity pair into the matching model to obtain the predicted matching degree of the first entity pair, selecting a target entity pair from the first entity pair based on a preset matching degree threshold and the predicted matching degree of the first entity pair, acquiring the labeling matching degree of the target entity pair, performing iterative training on the matching model based on the labeling matching degree of the target entity pair and the predicted matching degree of the target entity pair to obtain a trained matching model, and determining whether data in the entity pair represent the same entity or not by the trained matching model. In addition, in each iteration process, the server needs to update the predicted matching degree of the target entity pair, namely, determine the labeling matching degree of the target entity pair, so as to iteratively train the matching model based on the labeling matching degree and the predicted matching degree of the target entity pair, and improve the training accuracy of the matching model while saving data processing resources.

Example five

The embodiments of the present disclosure further provide a computer readable storage medium, where a computer program is stored, where the computer program when executed by a processor implements each process of the embodiments of the data processing method, and the same technical effects can be achieved, and for avoiding repetition, a detailed description is omitted herein. Wherein the computer readable storage medium is selected from Read-Only Memory (ROM), random access Memory (Random Access Memory, RAM), magnetic disk or optical disk.

The embodiment of the specification provides a computer readable storage medium, which is used for acquiring a first entity pair to be detected, generating a matching model to be trained based on a preset model search space, inputting the first entity pair into the matching model to obtain a predicted matching degree of the first entity pair, selecting a target entity pair from the first entity pair based on a preset matching degree threshold and the predicted matching degree of the first entity pair, acquiring a labeling matching degree of the target entity pair, performing iterative training on the matching model based on the labeling matching degree of the target entity pair and the predicted matching degree of the target entity pair to obtain a trained matching model, and determining whether data in the entity pair represent the same entity or not by the trained matching model. In addition, in each iteration process, the server needs to update the predicted matching degree of the target entity pair, namely, determine the labeling matching degree of the target entity pair, so as to iteratively train the matching model based on the labeling matching degree and the predicted matching degree of the target entity pair, and improve the training accuracy of the matching model while saving data processing resources.

The foregoing describes specific embodiments of the present disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims can be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.

In the 90 s of the 20 th century, improvements to one technology could clearly be distinguished as improvements in hardware (e.g., improvements to circuit structures such as diodes, transistors, switches, etc.) or software (improvements to the process flow). However, with the development of technology, many improvements of the current method flows can be regarded as direct improvements of hardware circuit structures. Designers almost always obtain corresponding hardware circuit structures by programming improved method flows into hardware circuits. Therefore, an improvement of a method flow cannot be said to be realized by a hardware entity module. For example, a programmable logic device (Programmable Logic Device, PLD) (e.g., field programmable gate array (Field Programmable Gate Array, FPGA)) is an integrated circuit whose logic function is determined by the programming of the device by a user. A designer programs to "integrate" a digital system onto a PLD without requiring the chip manufacturer to design and fabricate application-specific integrated circuit chips. Moreover, nowadays, instead of manually manufacturing integrated circuit chips, such programming is mostly implemented by using "logic compiler" software, which is similar to the software compiler used in program development and writing, and the original code before the compiling is also written in a specific programming language, which is called hardware description language (Hardware Description Language, HDL), but not just one of the hdds, but a plurality of kinds, such as ABEL (Advanced Boolean Expression Language), AHDL (Altera Hardware Description Language), confluence, CUPL (Cornell University Programming Language), HDCal, JHDL (Java Hardware Description Language), lava, lola, myHDL, PALASM, RHDL (Ruby Hardware Description Language), etc., VHDL (Very-High-Speed Integrated Circuit Hardware Description Language) and Verilog are currently most commonly used. It will also be apparent to those skilled in the art that a hardware circuit implementing the logic method flow can be readily obtained by merely slightly programming the method flow into an integrated circuit using several of the hardware description languages described above.

The controller may be implemented in any suitable manner, for example, the controller may take the form of, for example, a microprocessor or processor and a computer readable medium storing computer readable program code (e.g., software or firmware) executable by the (micro) processor, logic gates, switches, application specific integrated circuits (Application Specific Integrated Circuit, ASIC), programmable logic controllers, and embedded microcontrollers, examples of which include, but are not limited to, the following microcontrollers: ARC625D, atmel AT91SAM, microchip PIC18F26K20, and Silicone Labs C8051F320, the memory controller may also be implemented as part of the control logic of the memory. Those skilled in the art will also appreciate that, in addition to implementing the controller in a pure computer readable program code, it is well possible to implement the same functionality by logically programming the method steps such that the controller is in the form of logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers, etc. Such a controller may thus be regarded as a kind of hardware component, and means for performing various functions included therein may also be regarded as structures within the hardware component. Or even means for achieving the various functions may be regarded as either software modules implementing the methods or structures within hardware components.

The system, apparatus, module or unit set forth in the above embodiments may be implemented in particular by a computer chip or entity, or by a product having a certain function. One typical implementation is a computer. In particular, the computer may be, for example, a personal computer, a laptop computer, a cellular telephone, a camera phone, a smart phone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or a combination of any of these devices.

For convenience of description, the above devices are described as being functionally divided into various units, respectively. Of course, the functionality of the units may be implemented in one or more software and/or hardware when implementing one or more embodiments of the present description.

It will be appreciated by those skilled in the art that embodiments of the present description may be provided as a method, system, or computer program product. Accordingly, one or more embodiments of the present description may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Moreover, one or more embodiments of the present description can take the form of a computer program product on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.

Embodiments of the present description are described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the specification. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

In one typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include volatile memory in a computer-readable medium, random Access Memory (RAM) and/or nonvolatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of computer-readable media.

Computer readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device. Computer-readable media, as defined herein, does not include transitory computer-readable media (transmission media), such as modulated data signals and carrier waves.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article or apparatus that comprises the element.

One or more embodiments of the present specification may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. One or more embodiments of the present description may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.

In this specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for system embodiments, since they are substantially similar to method embodiments, the description is relatively simple, as relevant to see a section of the description of method embodiments.

The foregoing is merely exemplary of the present disclosure and is not intended to limit the disclosure. Various modifications and alterations to this specification will become apparent to those skilled in the art. Any modifications, equivalent substitutions, improvements, or the like, which are within the spirit and principles of the present description, are intended to be included within the scope of the claims of the present description.

Claims

1. A data processing method, comprising:

acquiring a first entity pair to be detected;

generating a matching model to be trained based on a preset model search space, and inputting the first entity pair into the matching model to obtain the prediction matching degree of the first entity pair;

selecting a target entity pair from the first entity pair based on a preset matching degree threshold and the predicted matching degree of the first entity pair, and acquiring the labeling matching degree of the target entity pair;

and carrying out iterative training on the matching model based on the labeling matching degree of the target entity pair and the prediction matching degree of the target entity pair to obtain a trained matching model, wherein the trained matching model is used for determining whether the data in the entity pair represent the same entity.

2. The method of claim 1, wherein the iteratively training the matching model based on the labeled matching degree of the target entity pair and the predicted matching degree of the target entity pair to obtain a trained matching model comprises:

3. The method of claim 2, the acquiring a first entity pair to be detected, comprising:

acquiring a first data set and a second data set to be matched;

4. The method of claim 3, the determining the first entity pair based on the sample alignment result comprising:

5. The method of claim 4, wherein selecting a target entity pair from the first entity pair based on a preset matching degree threshold and a predicted matching degree of the first entity pair, comprises:

6. The method of claim 5, the inputting the first entity pair into the matching model to obtain the predicted matching degree of the first entity pair, comprising:

7. The method of claim 6, the method further comprising:

acquiring a target entity pair to be detected, and determining a characterization vector of the target entity pair based on the preset characterization extraction algorithm and the preset feature selection rule;

inputting the characterization vector of the target entity pair into the trained matching model to obtain the predicted matching degree of the target entity pair;

Based on the predicted match of the target entity pair, it is determined whether the data in the target entity pair characterizes the same entity.

8. A data processing apparatus comprising:

the first acquisition module is used for acquiring a first entity pair to be detected;

the model generation module is used for generating a matching model to be trained based on a preset model search space, and inputting the first entity pair into the matching model to obtain the prediction matching degree of the first entity pair;

the data selection module is used for selecting a target entity pair from the first entity pair based on a preset matching degree threshold and the predicted matching degree of the first entity pair, and acquiring the labeling matching degree of the target entity pair;

the model training module is used for carrying out iterative training on the matching model based on the labeling matching degree of the target entity pair and the prediction matching degree of the target entity pair to obtain a trained matching model, and the trained matching model is used for determining whether the data in the entity pair represent the same entity.

9. The apparatus of claim 8, the model training module 604 to:

10. The apparatus of claim 9, the first obtaining module 601 configured to:

acquiring a first data set and a second data set to be matched;

11. The apparatus of claim 10, the first obtaining module 601 configured to:

12. The apparatus of claim 11, the data selection module to:

13. The apparatus of claim 12, the model generation module 602 to:

14. The apparatus of claim 13, the apparatus further comprising:

15. A data processing apparatus, the data processing apparatus comprising:

a processor; and

a memory arranged to store computer executable instructions that, when executed, cause the processor to:

acquiring a first entity pair to be detected;