CN116432106A - Data processing method, device, equipment and medium based on model self-distillation - Google Patents

Data processing method, device, equipment and medium based on model self-distillation Download PDF

Info

Publication number
CN116432106A
CN116432106A CN202310479166.5A CN202310479166A CN116432106A CN 116432106 A CN116432106 A CN 116432106A CN 202310479166 A CN202310479166 A CN 202310479166A CN 116432106 A CN116432106 A CN 116432106A
Authority
CN
China
Prior art keywords
sub
neural network
training
network
target
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310479166.5A
Other languages
Chinese (zh)
Inventor
高睿哲
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alipay Hangzhou Information Technology Co Ltd
Original Assignee
Alipay Hangzhou Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alipay Hangzhou Information Technology Co Ltd filed Critical Alipay Hangzhou Information Technology Co Ltd
Priority to CN202310479166.5A priority Critical patent/CN116432106A/en
Publication of CN116432106A publication Critical patent/CN116432106A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/082Learning methods modifying the architecture, e.g. adding, deleting or silencing nodes or connections
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/06Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
    • G06Q10/063Operations research, analysis or management
    • G06Q10/0635Risk analysis of enterprise or organisation activities

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Business, Economics & Management (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Human Resources & Organizations (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Molecular Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Economics (AREA)
  • Strategic Management (AREA)
  • Development Economics (AREA)
  • Game Theory and Decision Science (AREA)
  • Educational Administration (AREA)
  • Marketing (AREA)
  • Operations Research (AREA)
  • Quality & Reliability (AREA)
  • Tourism & Hospitality (AREA)
  • General Business, Economics & Management (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The specification discloses a data processing method, device, equipment and medium based on model self-distillation. The method comprises the following steps: inputting a training sample marked with a classification label into a target classification model to obtain a classification prediction result of the target classification model for the training sample, wherein the training sample is a relationship map of a sample merchant group, and the target classification model comprises a graphic neural network composed of a plurality of sub-networks, and the sub-networks are used for coding the training sample. And determining a training gradient of a target classification model based on a loss function between a classification prediction result and the classification label and a preset constraint term, wherein the preset constraint term is used for constraining the non-smoothness of a sub-network coded later in the graph neural network to be larger than the non-smoothness of a sub-network coded earlier. Based on the training gradient, the graph neural network is adjusted to improve the coding capability of the graph neural network on the relation map.

Description

Data processing method, device, equipment and medium based on model self-distillation
Technical Field
The document belongs to the technical field of artificial intelligence, and particularly relates to a data processing method, device, equipment and medium based on model self-distillation.
Background
With the development of artificial intelligence technology, wind control applications based on model execution have gained increasing popularity. At present, most risk identification schemes aiming at merchants mainly predict risks according to some characteristic information of individual merchants. Although such predictive approaches can also incorporate some merchant-to-merchant relationship features, in essence, merchants are taken as individual samples for stand-alone calculations, and further analysis of risk from the perspective of relationship networks and risk conduction is lacking.
The graph algorithm is obviously a more suitable choice of the model for showing the characteristics of the relation network and the risk conduction among merchants. The graph algorithm can map the relationship data among the merchants into a relationship graph, not only extract attribute information of the merchants as a main body, but also mine risk conductivity of the association relationship among the merchants, and can be combined with a traditional mode to form complementary advantages. Therefore, how to realize accurate feature coding on the relationship graph of the merchant population so as to be applied to downstream risk identification tasks is a technical problem to be solved currently.
Disclosure of Invention
The embodiment of the specification provides a data processing method, device, equipment and medium based on model self-distillation, which can realize accurate feature coding on a relationship map of a merchant group so as to be applied to downstream risk identification tasks.
For the above purpose, the embodiments of the present specification are implemented as follows:
in a first aspect, a data processing method based on model self-distillation is provided, including:
inputting a training sample marked with a classification label into a target classification model to obtain a classification prediction result of the target classification model for the training sample, wherein the training sample is a relationship graph of a sample commercial tenant group, and the target classification model comprises a graph neural network composed of a plurality of sub-networks, and the sub-networks are used for coding the training sample;
determining a training gradient of the target classification model based on a loss function between a classification prediction result and the classification label and a preset constraint term, wherein the preset constraint term is used for constraining the non-smoothness of a sub-network coded later in the graph neural network to be larger than the non-smoothness of a sub-network coded earlier;
and adjusting the graph neural network based on the training gradient to improve the coding capability of the graph neural network on the relation map.
In a second aspect, a data processing apparatus based on model self-distillation is provided, comprising:
the prediction module inputs a training sample marked with a classification label into a target classification model to obtain a classification prediction result of the target classification model for the training sample, wherein the training sample is a relationship map of a sample merchant group, the target classification model comprises a graphic neural network composed of a plurality of sub-networks, and the sub-networks are used for encoding the training sample;
the training determining module is used for determining a training gradient of the target classification model based on a loss function between a classification prediction result and the classification label and a preset constraint term, wherein the preset constraint term is used for constraining the non-smoothness of a sub-network coded later in the graph neural network to be larger than the non-smoothness of a sub-network coded earlier;
and the training execution module is used for adjusting the graph neural network based on the training gradient so as to improve the coding capability of the graph neural network on the relation map.
In a third aspect, an electronic device is provided, comprising: a processor; and a memory arranged to store computer executable instructions that, when executed, cause the processor to:
inputting a training sample marked with a classification label into a target classification model to obtain a classification prediction result of the target classification model for the training sample, wherein the training sample is a relationship graph of a sample commercial tenant group, and the target classification model comprises a graph neural network composed of a plurality of sub-networks, and the sub-networks are used for coding the training sample;
determining a training gradient of the target classification model based on a loss function between a classification prediction result and the classification label and a preset constraint term, wherein the preset constraint term is used for constraining the non-smoothness of a sub-network coded later in the graph neural network to be larger than the non-smoothness of a sub-network coded earlier;
and adjusting the graph neural network based on the training gradient to improve the coding capability of the graph neural network on the relation map.
In a fourth aspect, a computer-readable storage medium is presented for storing computer-executable instructions that, when executed by a processor, perform the following:
inputting a training sample marked with a classification label into a target classification model to obtain a classification prediction result of the target classification model for the training sample, wherein the training sample is a relationship graph of a sample commercial tenant group, and the target classification model comprises a graph neural network composed of a plurality of sub-networks, and the sub-networks are used for coding the training sample;
determining a training gradient of the target classification model based on a loss function between a classification prediction result and the classification label and a preset constraint term, wherein the preset constraint term is used for constraining the non-smoothness of a sub-network coded later in the graph neural network to be larger than the non-smoothness of a sub-network coded earlier;
and adjusting the graph neural network based on the training gradient to improve the coding capability of the graph neural network on the relation map.
According to the scheme, the relationship atlas of the sample commercial tenant group is used as a training sample to conduct supervised training on the target classification model. The target classification model comprises a graphic neural network composed of a plurality of sub-networks, and each sub-network is responsible for encoding training samples. In the process of supervised training, on the basis of the loss between the classification prediction result of the training sample and the classification label, the training gradient is determined by combining the constraint item that the non-smoothness of the sub-network coded later is larger than that of the sub-network coded earlier, so that after the graph neural network is adjusted according to the training gradient, the graph neural network has the coding capability of retaining topological structure information in the relation map on the basis of avoiding the problem of over-smoothness. And the relationship map of the commercial tenant group in actual application can be encoded through the adjusted graph neural network, so that risk identification is performed according to the characteristics obtained by encoding, and the effect of judging the risk according to the angles of the relationship network and risk conduction is achieved.
Drawings
The accompanying drawings, which are included to provide a further understanding of the specification, illustrate and explain the exemplary embodiments of the present specification and their description, are not intended to limit the specification unduly. In the drawings:
FIG. 1 is a schematic flow chart of a data processing method based on model self-distillation according to an embodiment of the present disclosure.
Fig. 2 is a schematic diagram of a first application of the model self-distillation-based data processing method according to the embodiment of the present specification.
Fig. 3 is a schematic diagram of a second application of the model self-distillation based data processing method according to the embodiment of the present disclosure.
Fig. 4 is a schematic diagram of a third application of the model self-distillation-based data processing method according to the embodiment of the present disclosure.
Fig. 5 is a schematic structural diagram of a data processing apparatus based on model self-distillation according to an embodiment of the present disclosure.
Fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure.
Detailed Description
For the purposes, technical solutions and advantages of this document, the technical solutions of this specification will be clearly and completely described below with reference to specific embodiments of this specification and corresponding drawings. It will be apparent that the embodiments described are only some, but not all, of the embodiments of this document. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are intended to be within the scope of the present disclosure.
As previously mentioned, the graph algorithm is clearly a more appropriate choice of model for a merchant risk identified scenario. The graph algorithm can map the relationship data among the merchants into a relationship graph, not only extract attribute information of the merchants as a main body, but also mine risk conductivity of the association relationship among the merchants, and can be combined with a traditional mode to form complementary advantages.
Therefore, the document aims to provide a technical scheme which can accurately code the relation graph of the commercial tenant group and execute a risk identification task according to the characteristics obtained by coding.
In one aspect, one embodiment of the present specification provides a data processing method based on model self-distillation. Fig. 1 is a schematic flow chart of a data processing method, which specifically includes the following steps:
s102, inputting a training sample marked with a classification label into a target classification model to obtain a classification prediction result of the target classification model for the training sample, wherein the training sample is a relationship map of a sample commercial tenant group, and the target classification model comprises a graphic neural network composed of a plurality of sub-networks, and the sub-networks are used for coding the training sample.
In this embodiment, the information of the relationship map is the basic structure of the "entity-relationship-entity" triplet. The entity is a node in the relation map, and is used for representing a merchant, and the node presentation information can contain attribute information of the merchant as a main body; the relationship is the side in the relationship map, represents the association relationship between merchants, and the side presentation information can comprise transaction relationship, transaction execution medium relationship, social relationship and the like.
In the embodiment, the target classification model is subjected to supervised training through the training samples marked with the classification labels.
The target classification model consists of a graph neural network (Graphic Nuaral Network, GNN) and a classifier. The graph neural network is responsible for coding the relation graph of the training sample, and the coding result is the characteristic data extracted from the relation graph; the classifier is responsible for classifying and calculating according to the coding result of the graph neural network, so as to obtain a classification prediction result of the training sample. For supervised training, a training sample label classification label is used to indicate the classification of the training sample truth. The principle of supervised training is to calculate the loss between the classification prediction result and the classification label, and adjust the graph neural network and/or the classifier by taking the reduction loss as a training gradient, so that the classification prediction result is close to the true value result marked by the classification label.
Here, the present embodiment adopts a self-distillation manner to train the graph neural network. Self-distillation is a classification of knowledge distillation. Traditional knowledge distillation provides a trained teacher model and a not yet trained student model, which is simplified relative to the teacher model and migrates the knowledge of the teacher model into the student model at the expense of a slight performance penalty, thereby completing the training of the student model. Self-distillation is to migrate knowledge of the deep network structure to the shallow network structure in the same model.
Based on the principle of self-distillation, the present embodiment sets the graph neural network as a structure composed of a plurality of sub-networks, each for encoding training samples. The present embodiment further subdivides the multiple sub-networks into a shallow sub-network and a deep sub-network. The shallow sub-network, as the name implies, encodes the training samples prior to the deep sub-network; and the deep sub-network encodes the encoding result of the shallow sub-network. In a subsequent step, the present embodiment migrates knowledge of one part of the subnetworks of the graph neural network to another part of the subnetworks.
And S104, determining a training gradient of the target classification model based on a loss function between the classification prediction result and the classification label and a preset constraint term, wherein the preset constraint term is used for constraining the non-smoothness of a sub-network coded later in the graph neural network to be larger than the non-smoothness of a sub-network coded earlier.
It will be appreciated that there is a problem with over-smoothing from distillation. The overcomplete refers to the situation that the learned node characterization is more and more similar along with deepening of the sub-network in the coding process of the graph neural network, so that the model effect is greatly reduced, namely the problem of overcomplete of the graph neural network.
The present embodiment herein embodies the problem of sub-network overcomplete with non-smoothness. That is, the greater the non-smoothness of one sub-network, the less the over-smoothness problem; conversely, the smaller the non-smoothness of a sub-network, the greater the over-smoothness problem. For this reason, regularized preset constraints need to be introduced during the supervised training process to ensure that each sub-network maintains a certain degree of non-smoothness. That is, in determining the training gradient, in addition to considering the loss between the classification prediction result and the classification label, a constraint in combination with a preset constraint term is required.
In particular, it is found through practice that in the process of encoding the relationship map, the deeper the encoding result of the sub-network is, the less likely the topology information is retained. For this reason, unlike the conventional self-distillation method, the present embodiment reverse-selects to migrate knowledge of the shallow sub-network into the deep sub-network. Correspondingly, the preset constraint item in this embodiment is specifically used to constrain the non-smoothness of the sub-network encoded later in the graph neural network to be greater than the non-smoothness of the sub-network encoded earlier, so as to conform to the direction of migrating the shallow sub-network knowledge to the deep sub-network.
As an exemplary introduction, the preset constraint term in this embodiment is used to calculate the non-smoothness difference between all neighboring sub-networks in the graph neural network, where the neighboring sub-networks conform to the adaptive difference retention policy, and sum the calculation results. Wherein the adaptive difference preservation policy comprises that the non-smoothness of the sub-network of the later encoding is smaller than the non-smoothness of the corresponding sub-network of the earlier encoding, i.e. that only two adjacent sub-networks where the non-smoothness exhibits a sliding down trend are constrained. Correspondingly, in the supervised training process, the training gradient of the target classification model can be determined by carrying out minimized solution on the target function.
And S106, adjusting the graph neural network based on the training gradient so as to improve the coding capability of the graph neural network on the relation map.
As previously described, after determining the training gradient, the neural network may be adjusted according to the training gradient. It should be understood that, by performing multiple iterations on the graph neural network in this adjustment manner, the classification prediction result of the training sample can be close to the true value result of the classification label, and meanwhile, the non-smoothness of the sub-network encoded later in the graph neural network is also suppressed to be smaller than that of the sub-network encoded earlier, so that the problem of over-smoothness is avoided.
It should be understood that the adjusted coding result of the graph neural network for the relationship graph of the merchant population can better reserve the association relationship between merchants, that is, when the coding result for the graph neural network body is used for risk prediction, risk determination can be performed in terms of the relationship network and risk conduction.
For example, if risk identification needs to be performed on a target merchant, the present embodiment prepares a relationship graph including a merchant population of the target merchant. Then, inputting the relation map to a graph neural network for coding so as to obtain a corresponding target coding result; and then, carrying out risk identification on the target merchant group based on the target coding result. The identified risk may include, but is not limited to, inter-connected merchant risk, industry merchant risk, offending transaction risk, and the like. If the risk of the illegal transaction is taken as an example, the embodiment can determine whether the target merchant has the risk of the illegal transaction only by further importing the target coding result output by the graphic neural network into a classifier aiming at the risk of the illegal transaction.
In addition, if the structure of the graph neural network is required to be simplified, the embodiment can also set a plurality of sub-networks to be serial structures; since the last sub-network acquired knowledge of the previous sub-network, the last sub-network can be used as an independent encoder to encode the relationship graph.
Based on the above, it can be seen that: the method of the embodiment of the specification takes the relationship graph of the sample commercial tenant group as a training sample to carry out supervised training on the target classification model. The target classification model comprises a graphic neural network composed of a plurality of sub-networks, and each sub-network is responsible for encoding training samples. In the process of supervised training, on the basis of the loss between the classification prediction result of the training sample and the classification label, the training gradient is determined by combining the constraint item that the non-smoothness of the sub-network coded later is larger than that of the sub-network coded earlier, so that after the graph neural network is adjusted according to the training gradient, the graph neural network has the coding capability of retaining topological structure information in the relation map on the basis of avoiding the problem of over-smoothness. And the relationship map of the commercial tenant group in actual application can be encoded through the adjusted graph neural network, so that risk identification is performed according to the characteristics obtained by encoding, and the effect of judging the risk according to the angles of the relationship network and risk conduction is achieved.
The following describes the method of this embodiment in detail.
As shown in fig. 1, the object classification model of the present embodiment is composed of a graph neural network and a classifier. The neural network includes sub-networks of L-layer serial, i.e., sub-networks 1, 2, and 3 … … L (fig. 1 illustrates l=4).
Here, a neighborhood difference rate (Neighborhood Discrepancy Rate, NDR) is employed to quantify the non-smoothness of the sub-networks of each layer. That is, training sample X of relationship atlas (0) After the target classification model is input, the L-layer sub-network sequentially codes, and the corresponding coding result is X (1) 、X (2) 、X (3) ……X (L)
The neighborhood difference rate is used for reflecting the difference degree between the coding result of each node in the training sample and the coding result of the corresponding adjacent node, and can be determined based on the data distance between the coding result of each node in the training sample and the coding result of the corresponding virtual node of the whole neighborhood. The coding results of the virtual nodes of the whole neighborhood are obtained by aggregating the coding results of all adjacent nodes.
For ease of understanding, FIG. 3 illustrates the structure of a relationship graph in a training sample. Node v in FIG. 3 is defined herein 3 For the target node, first acquireIts adjacent node v 1 ,v 4 ,v 5 Is aggregated to create a virtual node v representing the overall neighborhood of the target node * . Definition of the definition
Figure BDA0004221871310000061
For the aggregation expression of the sub-network of the first layer for the neighbor node of node v, then +.>
Figure BDA0004221871310000062
The formula is as follows:
Figure BDA0004221871310000063
wherein D is a node degree matrix, A is an adjacent matrix, X is node expression, D -1 It can be understood that a normalization factor:
after obtaining the expression of the neighborhood virtual node, calculating a central node v 3 And cosine similarity between virtual nodes, and converting the cosine similarity into distance measurement, which can also be regarded as difference degree, and the calculation formula is as follows:
Figure BDA0004221871310000064
Figure BDA0004221871310000065
wherein the method comprises the steps of
Figure BDA0004221871310000066
The larger the difference degree is, the stronger the expression capacity of the node is, and the weaker the over-smoothing problem in the training process is. s is(s) (l) The neighborhood difference expression of the whole L layer network is represented, namely the non-smoothness description of the L layer GNN.
It should be understood that the idea of the self-distillation of the present embodiment is embodied in a field-based mannerAdaptive difference preserving constraint term L is designed by non-smoothness information extracted by difference rate NDR in shallow GNN ADR
Adaptive difference preserving constraint L ADR The purpose of (a) is to migrate non-smooth knowledge from shallow to deep networks, considering for this purpose the following:
first, noise in the original data may cause inaccurate calculation of NDR of the previous layers of networks, so that a position where supervision starts needs to be adaptively determined according to the NDR obtained by calculation, that is, it is determined from which layer of network the NDR obtained by calculation starts to be accurate, and the NDR can be extracted as teacher information. Since the non-smoothness in GNN will decrease with increasing number of layers of the sub-network, the number of layers of the sub-network with the largest NDR can be used as the number of layers from which the initial supervision information starts, where the number of layers for determining the initial supervision information has the following calculation formula:
l * =argmax k {||s (k) |||k∈{1,…,L-1}}
furthermore, the distillation transfer of knowledge is progressive and the design of ADR constraints requires comparison of the differences between the deep network and its previous network NDR. Then, the choice of teacher knowledge needs to be adaptive, generating constraints when the NDR of the current layer network is greater than the deep layer network; when the NDR of the network of the later layer is larger than that of the network of the former layer, the differential expression of the node is improved, and the node is not required to be restrained. Meanwhile, taking the difference of node degrees into consideration, adding the normalized node degree to weight the matching loss. Based on this, the calculation formula of the ADR constraint is as follows:
Figure BDA0004221871310000067
Figure BDA0004221871310000071
wherein I is an indicator function, d 2 (. Cndot.) is a mean square error function based on node degree weighting, and the loss of knowledge matching between different network layers is calculated. SG (·) represents Sthe top gradient operation, i.e. the gradient of the tensor, does not need to be calculated during the back propagation.
Assuming that the initial supervision location is a layer 2 subnetwork, referring to fig. 4, the present embodiment first determines whether the layer number in the neural network is greater than 2. If the result is greater than 2, comparing NDR of the two layers of sub-networks, namely if s (l) ||>||s (l+1) I. If the NDR of the shallow sub-network is greater than the NDR of the deep sub-network, then the constraint is satisfied and L will be ADR Adding the training optimization process; conversely, if the shallow NDR is less than the deep NDR, then L is not added ADR
The training optimization target of the overall model is a cross entropy loss function L ce =CE(g(X (L) ) Y) plus L ADR Constraint:
L Total =CE(g(X (L) ),y)+αL ADR
in supervised training, by applying a training algorithm to L Total And the coding capacity with better performance can be realized by optimizing the graph neural network.
It can be seen that the method of the embodiment is based on the merchant risk characterization scheme of the self-distilled graph neural network, does not depend on an independent teacher model, and restricts learning of the deep network by extracting non-smooth information in the shallow network as a supervised signal. Compared with the traditional knowledge distillation scheme, the self-distillation mode of the embodiment does not depend on a trained classroom model, so that the training cost is low and the learning efficiency can be improved; meanwhile, the adaptive difference reservation constraint is added, so that the problem of over-smoothness in the depth GNN training process can be relieved. Compared with the traditional CNN deep-to-shallow self-distillation method, the method of the embodiment extracts the learning of the deep network constrained by the knowledge of the shallow network, and can better keep the topological structure information in the relation map.
Corresponding to the method shown in fig. 1, one embodiment of the present specification provides a data processing apparatus based on model self-distillation. FIG. 5 is a schematic diagram of a data processing apparatus, including:
the prediction module 510 inputs a training sample marked with a classification label into a target classification model to obtain a classification prediction result of the target classification model for the training sample, wherein the training sample is a relationship map of a sample merchant group, the target classification model comprises a graph neural network composed of a plurality of sub-networks, and the plurality of sub-networks are used for encoding the training sample;
the training determining module 520 determines a training gradient of the target classification model based on a loss function between a classification prediction result and the classification label and a preset constraint term, wherein the preset constraint term is used for constraining non-smoothness of a sub-network encoded later in the graph neural network to be greater than non-smoothness of a sub-network encoded earlier;
the training execution module 530 adjusts the graph neural network based on the training gradient to promote the coding capability of the graph neural network on the relationship map.
The device of the embodiment of the specification takes the relation graph of the sample commercial tenant group as a training sample to carry out supervised training on the target classification model. The target classification model comprises a graphic neural network composed of a plurality of sub-networks, and each sub-network is responsible for encoding training samples. In the process of supervised training, on the basis of the loss between the classification prediction result of the training sample and the classification label, the training gradient is determined by combining the constraint item that the non-smoothness of the sub-network coded later is larger than that of the sub-network coded earlier, so that after the graph neural network is adjusted according to the training gradient, the graph neural network has the coding capability of retaining topological structure information in the relation map on the basis of avoiding the problem of over-smoothness. And the relationship map of the commercial tenant group in actual application can be encoded through the adjusted graph neural network, so that risk identification is performed according to the characteristics obtained by encoding, and the effect of judging the risk according to the angles of the relationship network and risk conduction is achieved.
Optionally, the nodes in the relationship graph of the sample commercial tenant group represent individuals, the edges in the relationship graph of the sample commercial tenant group represent association relationships between individuals, and the non-smoothness of any target sub-network in the graph neural network is determined based on a neighborhood difference rate of the coding result of the target sub-network corresponding to the training sample, where the neighborhood difference rate is used to reflect the difference degree between the coding result of each node in the training sample and the coding result of the corresponding adjacent node.
Optionally, the neighborhood difference rate is determined based on a data distance between a coding result of each node in the training sample and a coding result of a virtual node of a corresponding whole neighborhood, where the coding result of the virtual node of the whole neighborhood is obtained by aggregating coding results of all neighboring nodes.
Optionally, the preset constraint term is used for starting with a sub-network with the largest neighborhood difference rate in the graph neural network, and constraining the non-smoothness of the sub-network coded later to be larger than the non-smoothness of the sub-network coded earlier.
Optionally, the training determining module 520 determines the training gradient of the target classification model based on the objective function, including: and carrying out minimized solution on an objective function, and determining a training gradient of the objective classification model, wherein the preset constraint term is used for calculating non-smoothness differences between all adjacent sub-networks in the graph neural network, which conform to an adaptive difference retention strategy, and summing calculation results, and the adaptive difference retention strategy comprises that the non-smoothness of a sub-network of a later code is smaller than that of a corresponding sub-network of a previous code.
Optionally, the association relationship includes at least one of a transaction relationship between merchants, a transaction execution medium relationship, and a social relationship.
Optionally, the apparatus of this embodiment further includes:
the risk identification module is used for encoding a relationship graph of a commercial tenant group containing target commercial tenants based on the graph neural network after the graph neural network is adjusted based on the training gradient, so as to obtain a target encoding result; and carrying out risk identification on the target merchant based on the target coding result.
Optionally, the plurality of subnetworks are in a serial structure in the graph neural network; the risk identification module specifically codes a relationship graph of a merchant group containing target merchants based on the last subnetwork in the graphic neural network to obtain a target coding result; and carrying out risk identification on the target merchant based on the target coding result.
It should be understood that the data processing apparatus according to the embodiments of the present disclosure may be used as an execution body of the method shown in fig. 1, so that corresponding steps and functions in the method can be implemented, and will not be described in detail herein.
Fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure. Referring to fig. 6, at the hardware level, the electronic device includes a processor, and optionally an internal bus, a network interface, and a memory. The Memory may include a Memory, such as a Random-Access Memory (RAM), and may further include a non-volatile Memory (non-volatile Memory), such as at least 1 disk Memory. Of course, the electronic device may also include hardware required for other services.
The processor, network interface, and memory may be interconnected by an internal bus, which may be an ISA (Industry Standard Architecture ) bus, a PCI (Peripheral Component Interconnect, peripheral component interconnect standard) bus, or EISA (Extended Industry Standard Architecture ) bus, among others. The buses may be classified as address buses, data buses, control buses, etc. For ease of illustration, only one bi-directional arrow is shown in FIG. 6, but not only one bus or type of bus.
And the memory is used for storing programs. In particular, the program may include program code including computer-operating instructions. The memory may include memory and non-volatile storage and provide instructions and data to the processor.
The processor reads the corresponding computer program from the nonvolatile memory into the memory and then runs the computer program, and the data processing device based on the model self-distillation is formed on a logic level. The processor is used for executing the programs stored in the memory and is specifically used for executing the following operations:
inputting a training sample marked with a classification label into a target classification model to obtain a classification prediction result of the target classification model for the training sample, wherein the training sample is a relationship map of a sample merchant group, and the target classification model comprises a graphic neural network composed of a plurality of sub-networks, and the sub-networks are used for coding the training sample.
And determining a training gradient of the target classification model based on a loss function between a classification prediction result and the classification label and a preset constraint term, wherein the preset constraint term is used for constraining the non-smoothness of a sub-network coded later in the graph neural network to be larger than the non-smoothness of a sub-network coded earlier.
And adjusting the graph neural network based on the training gradient to improve the coding capability of the graph neural network on the relation map.
The method suggested by the embodiment shown in fig. 1 of the present specification can be applied to a processor or implemented by the processor. The processor may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware in a processor or by instructions in the form of software. The processor may be a general-purpose processor, including a central processing unit (Central Processing Unit, CPU), a network processor (Network Processor, NP), etc.; but also digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), field programmable gate arrays (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components. The various methods, steps, and logic blocks disclosed in one or more embodiments of the present description may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of a method disclosed in connection with one or more embodiments of the present disclosure may be embodied directly in a hardware decoding processor or in a combination of hardware and software modules in a decoding processor. The software modules may be located in a random access memory, flash memory, read only memory, programmable read only memory, or electrically erasable programmable memory, registers, etc. as well known in the art. The storage medium is located in a memory, and the processor reads the information in the memory and, in combination with its hardware, performs the steps of the above method.
The electronic device may also perform the method described in fig. 1, which is not described in detail herein.
Of course, in addition to the software implementation, the electronic device in this specification does not exclude other implementations, such as a logic device or a combination of software and hardware, that is, the execution subject of the following process is not limited to each logic unit, but may also be hardware or a logic device.
The present specification embodiment also proposes a computer-readable storage medium storing one or more programs.
Wherein the one or more programs include instructions, which when executed by a portable electronic device comprising a plurality of application programs, enable the portable electronic device to perform the method of the embodiment of fig. 1, and in particular to:
inputting a training sample marked with a classification label into a target classification model to obtain a classification prediction result of the target classification model for the training sample, wherein the training sample is a relationship map of a sample merchant group, and the target classification model comprises a graphic neural network composed of a plurality of sub-networks, and the sub-networks are used for coding the training sample.
And determining a training gradient of the target classification model based on a loss function between a classification prediction result and the classification label and a preset constraint term, wherein the preset constraint term is used for constraining the non-smoothness of a sub-network coded later in the graph neural network to be larger than the non-smoothness of a sub-network coded earlier.
And adjusting the graph neural network based on the training gradient to improve the coding capability of the graph neural network on the relation map.
In summary, the foregoing description is only a preferred embodiment of the present invention and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, improvement, or the like, which is within the spirit and principles of one or more embodiments of the present disclosure, is intended to be included within the scope of one or more embodiments of the present disclosure.
The system, apparatus, module or unit set forth in the above embodiments may be implemented in particular by a computer chip or entity, or by a product having a certain function. One typical implementation is a computer. In particular, the computer may be, for example, a personal computer, a laptop computer, a cellular telephone, a camera phone, a smart phone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or a combination of any of these devices.
Computer readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device. Computer-readable media, as defined herein, does not include transitory computer-readable media (transmission media), such as modulated data signals and carrier waves.
It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article or apparatus that comprises the element.
In this specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for system embodiments, since they are substantially similar to method embodiments, the description is relatively simple, as relevant to see a section of the description of method embodiments.

Claims (11)

1. A data processing method based on model self-distillation, comprising:
inputting a training sample marked with a classification label into a target classification model to obtain a classification prediction result of the target classification model for the training sample, wherein the training sample is a relationship graph of a sample commercial tenant group, and the target classification model comprises a graph neural network composed of a plurality of sub-networks, and the sub-networks are used for coding the training sample;
determining a training gradient of the target classification model based on a loss function between a classification prediction result and the classification label and a preset constraint term, wherein the preset constraint term is used for constraining the non-smoothness of a sub-network coded later in the graph neural network to be larger than the non-smoothness of a sub-network coded earlier;
and adjusting the graph neural network based on the training gradient to improve the coding capability of the graph neural network on the relation map.
2. The method according to claim 1,
nodes in the relation graph of the sample commercial tenant group represent individuals, edges in the relation graph of the sample commercial tenant group represent association relations among the individuals, the non-smoothness of any target sub-network in the graph neural network is determined based on a neighborhood difference rate of the coding result of the target sub-network corresponding to the training sample, and the neighborhood difference rate is used for reflecting the difference degree between the coding result of each node in the training sample and the coding result of the corresponding adjacent node.
3. The method according to claim 2,
the neighborhood difference rate is determined based on the data distance between the coding result of each node in the training sample and the coding result of the virtual node of the corresponding whole neighborhood, wherein the coding result of the virtual node of the whole neighborhood is obtained by aggregating the coding results of all adjacent nodes.
4. The method according to claim 2,
and the preset constraint item is used for starting with the sub-network with the largest neighborhood difference rate in the graph neural network, and the non-smoothness of the sub-network which is constrained to be encoded later is larger than that of the sub-network which is encoded earlier.
5. The method according to claim 1,
the determining the training gradient of the target classification model based on the target function comprises:
and carrying out minimized solution on an objective function, and determining a training gradient of the objective classification model, wherein the preset constraint term is used for calculating non-smoothness differences between all adjacent sub-networks in the graph neural network, which conform to an adaptive difference retention strategy, and summing calculation results, and the adaptive difference retention strategy comprises that the non-smoothness of a sub-network of a later code is smaller than that of a corresponding sub-network of a previous code.
6. The method according to claim 2 to 4,
the association relationship includes at least one of a transaction relationship between merchants, a transaction execution medium relationship, and a social relationship.
7. The method according to claim 1 to 5,
after the adjustment of the graph neural network based on the training gradient, the method further comprises:
coding a relation map of a commercial tenant group containing target commercial tenants based on the graphic neural network to obtain a target coding result; and carrying out risk identification on the target merchant based on the target coding result.
8. The method according to claim 1 to 5,
the plurality of sub-networks are in a serial structure in the graph neural network;
after the adjustment of the graph neural network based on the training gradient, the method further comprises:
based on the last subnetwork in the graphic neural network, encoding a relationship graph of a commercial tenant group containing target commercial tenants to obtain a target encoding result; and carrying out risk identification on the target merchant based on the target coding result.
9. A data processing apparatus based on model self-distillation, comprising:
the prediction module inputs a training sample marked with a classification label into a target classification model to obtain a classification prediction result of the target classification model for the training sample, wherein the training sample is a relationship map of a sample merchant group, the target classification model comprises a graphic neural network composed of a plurality of sub-networks, and the sub-networks are used for encoding the training sample;
the training determining module is used for determining a training gradient of the target classification model based on a loss function between a classification prediction result and the classification label and a preset constraint term, wherein the preset constraint term is used for constraining the non-smoothness of a sub-network coded later in the graph neural network to be larger than the non-smoothness of a sub-network coded earlier;
and the training execution module is used for adjusting the graph neural network based on the training gradient so as to improve the coding capability of the graph neural network on the relation map.
10. An electronic device, comprising: a processor; and a memory arranged to store computer executable instructions that, when executed, cause the processor to:
inputting a training sample marked with a classification label into a target classification model to obtain a classification prediction result of the target classification model for the training sample, wherein the training sample is a relationship graph of a sample commercial tenant group, and the target classification model comprises a graph neural network composed of a plurality of sub-networks, and the sub-networks are used for coding the training sample;
determining a training gradient of the target classification model based on a loss function between a classification prediction result and the classification label and a preset constraint term, wherein the preset constraint term is used for constraining the non-smoothness of a sub-network coded later in the graph neural network to be larger than the non-smoothness of a sub-network coded earlier;
and adjusting the graph neural network based on the training gradient to improve the coding capability of the graph neural network on the relation map.
11. A computer-readable storage medium for storing computer-executable instructions that when executed by a processor perform the operations of:
inputting a training sample marked with a classification label into a target classification model to obtain a classification prediction result of the target classification model for the training sample, wherein the training sample is a relationship graph of a sample commercial tenant group, and the target classification model comprises a graph neural network composed of a plurality of sub-networks, and the sub-networks are used for coding the training sample;
determining a training gradient of the target classification model based on a loss function between a classification prediction result and the classification label and a preset constraint term, wherein the preset constraint term is used for constraining the non-smoothness of a sub-network coded later in the graph neural network to be larger than the non-smoothness of a sub-network coded earlier;
and adjusting the graph neural network based on the training gradient to improve the coding capability of the graph neural network on the relation map.
CN202310479166.5A 2023-04-26 2023-04-26 Data processing method, device, equipment and medium based on model self-distillation Pending CN116432106A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310479166.5A CN116432106A (en) 2023-04-26 2023-04-26 Data processing method, device, equipment and medium based on model self-distillation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310479166.5A CN116432106A (en) 2023-04-26 2023-04-26 Data processing method, device, equipment and medium based on model self-distillation

Publications (1)

Publication Number Publication Date
CN116432106A true CN116432106A (en) 2023-07-14

Family

ID=87087259

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310479166.5A Pending CN116432106A (en) 2023-04-26 2023-04-26 Data processing method, device, equipment and medium based on model self-distillation

Country Status (1)

Country Link
CN (1) CN116432106A (en)

Similar Documents

Publication Publication Date Title
US11551068B2 (en) Processing system and method for binary weight convolutional neural network
CN109657696B (en) Multi-task supervised learning model training and predicting method and device
CN109389151B (en) Knowledge graph processing method and device based on semi-supervised embedded representation model
WO2022052997A1 (en) Method and system for training neural network model using knowledge distillation
WO2019218748A1 (en) Insurance service risk prediction processing method, device and processing equipment
CN111352965B (en) Training method of sequence mining model, and processing method and equipment of sequence data
CN109766557B (en) Emotion analysis method and device, storage medium and terminal equipment
CN110009474B (en) Credit risk assessment method and device and electronic equipment
CN113837370B (en) Method and apparatus for training a model based on contrast learning
CN112861522B (en) Aspect-level emotion analysis method, system and model based on dual-attention mechanism
US20140147034A1 (en) Information processing apparatus, control method therefor, and electronic device
CN114817538B (en) Training method of text classification model, text classification method and related equipment
CN113505792A (en) Multi-scale semantic segmentation method and model for unbalanced remote sensing image
CN116306868B (en) Model processing method, device and equipment
CN112446888A (en) Processing method and processing device for image segmentation model
CN110347853B (en) Image hash code generation method based on recurrent neural network
CN116912923B (en) Image recognition model training method and device
CN105740916B (en) Characteristics of image coding method and device
CN112200666A (en) Feature vector processing method and related device
CN116432106A (en) Data processing method, device, equipment and medium based on model self-distillation
CN113704466B (en) Text multi-label classification method and device based on iterative network and electronic equipment
CN113033212B (en) Text data processing method and device
CN116185843B (en) Two-stage neural network testing method and device based on neuron coverage rate guidance
CN113284027B (en) Training method of partner recognition model, abnormal partner recognition method and device
WO2023173556A1 (en) Deep learning-based named entity recognition method and apparatus, device, and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination