CN116861345A

CN116861345A - Data processing method, device and equipment

Info

Publication number: CN116861345A
Application number: CN202310503770.7A
Authority: CN
Inventors: 林鑫
Original assignee: Alipay Hangzhou Information Technology Co Ltd
Current assignee: Alipay Hangzhou Information Technology Co Ltd
Priority date: 2023-05-04
Filing date: 2023-05-04
Publication date: 2023-10-10

Abstract

The embodiment of the specification provides a data processing method, a device and equipment, wherein the method comprises the following steps: acquiring a first text data sample for training a first model, acquiring a type label corresponding to the first text data sample, constructing target graph structure data corresponding to the first text data sample based on feature data contained in the first text data sample and correlation among the feature data, performing node screening processing on a plurality of nodes contained in a connected subgraph contained in the target graph structure data to obtain screened target graph structure data, inputting the screened target graph structure data into the first model to obtain a prediction label corresponding to the first text data sample, and performing iterative training on the first model based on the prediction label corresponding to the first text data sample and the type label corresponding to the first text data sample until the first model converges to obtain a trained first model.

Description

Data processing method, device and equipment

Technical Field

The present disclosure relates to the field of data processing technologies, and in particular, to a data processing method, apparatus, and device.

Background

With the rapid development of computer technology, the types and the number of application services provided by enterprises for users are also increasing, and accordingly, the data volume of user data is increasing, and the data structure is becoming complex, which results in higher complexity of detecting anomalies of user data or service data waiting for detecting data.

In the case of performing anomaly detection, since the data to be detected contains more noise feature data and redundant feature data, which results in poor detection efficiency and detection accuracy of anomaly detection, a solution capable of improving the detection efficiency and detection accuracy of anomaly detection of data is required.

Disclosure of Invention

An object of the embodiments of the present disclosure is to provide a data processing method, apparatus, and device, so as to provide a solution capable of improving detection efficiency and detection accuracy of anomaly detection on data.

In order to achieve the above technical solution, the embodiments of the present specification are implemented as follows:

in a first aspect, embodiments of the present disclosure provide a data processing method, including: acquiring a first text data sample for training a first model and a type label corresponding to the first text data sample, wherein the first model is constructed based on a preset machine learning algorithm and is used for carrying out feature screening processing on feature data contained in the text data sample based on model parameters corresponding to the feature data contained in the text data sample; constructing target graph structure data corresponding to the first text data sample based on feature data contained in the first text data sample and correlation among the feature data; acquiring a connected subgraph contained in the target graph structure data, and performing node screening treatment on a plurality of nodes contained in the connected subgraph to obtain screened target graph structure data; inputting the screened target graph structure data into the first model to obtain a prediction label corresponding to the first text data sample, and performing iterative training on the first model based on the prediction label corresponding to the first text data sample and the type label corresponding to the first text data sample until the first model converges to obtain a trained first model.

In a second aspect, embodiments of the present disclosure provide a data processing apparatus, the apparatus comprising: the first acquisition module is used for acquiring a first text data sample for training a first model and a type label corresponding to the first text data sample, wherein the first model is constructed based on a preset machine learning algorithm and is used for carrying out feature screening processing on feature data contained in the text data sample based on model parameters corresponding to the feature data contained in the text data sample; the diagram data construction module is used for constructing target diagram structure data corresponding to the first text data sample based on the feature data contained in the first text data sample and the correlation between the feature data; the first screening module is used for acquiring a connected subgraph contained in the target graph structure data, and carrying out node screening treatment on a plurality of nodes contained in the connected subgraph to obtain screened target graph structure data; and the first training module is used for inputting the screened target graph structure data into the first model to obtain a prediction label corresponding to the first text data sample, and carrying out iterative training on the first model based on the prediction label corresponding to the first text data sample and the type label corresponding to the first text data sample until the first model converges to obtain a trained first model.

In a third aspect, embodiments of the present specification provide a data processing apparatus, the data processing apparatus comprising: a processor; and a memory arranged to store computer executable instructions that, when executed, cause the processor to: acquiring a first text data sample for training a first model and a type label corresponding to the first text data sample, wherein the first model is constructed based on a preset machine learning algorithm and is used for carrying out feature screening processing on feature data contained in the text data sample based on model parameters corresponding to the feature data contained in the text data sample; constructing target graph structure data corresponding to the first text data sample based on feature data contained in the first text data sample and correlation among the feature data; acquiring a connected subgraph contained in the target graph structure data, and performing node screening treatment on a plurality of nodes contained in the connected subgraph to obtain screened target graph structure data; inputting the screened target graph structure data into the first model to obtain a prediction label corresponding to the first text data sample, and performing iterative training on the first model based on the prediction label corresponding to the first text data sample and the type label corresponding to the first text data sample until the first model converges to obtain a trained first model.

In a fourth aspect, embodiments of the present description provide a storage medium for storing computer-executable instructions that, when executed, implement the following: acquiring a first text data sample for training a first model and a type label corresponding to the first text data sample, wherein the first model is constructed based on a preset machine learning algorithm and is used for carrying out feature screening processing on feature data contained in the text data sample based on model parameters corresponding to the feature data contained in the text data sample; constructing target graph structure data corresponding to the first text data sample based on feature data contained in the first text data sample and correlation among the feature data; acquiring a connected subgraph contained in the target graph structure data, and performing node screening treatment on a plurality of nodes contained in the connected subgraph to obtain screened target graph structure data; inputting the screened target graph structure data into the first model to obtain a prediction label corresponding to the first text data sample, and performing iterative training on the first model based on the prediction label corresponding to the first text data sample and the type label corresponding to the first text data sample until the first model converges to obtain a trained first model.

Drawings

In order to more clearly illustrate the embodiments of the present description or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described below, it being obvious that the drawings in the following description are only some of the embodiments described in the present description, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic diagram of a data processing system of the present specification;

FIG. 2A is a flow chart of an embodiment of a data processing method of the present disclosure;

FIG. 2B is a schematic diagram illustrating a data processing method according to the present disclosure;

FIG. 3 is a schematic diagram of a target graph structure data according to the present disclosure;

FIG. 4 is a schematic diagram of the structure data of a filtered target graph according to the present disclosure;

FIG. 5 is a schematic diagram illustrating a processing procedure of another data processing method according to the present disclosure;

FIG. 6 is a schematic diagram of a model training process of the present disclosure;

FIG. 7 is a schematic diagram illustrating a processing procedure of another data processing method according to the present disclosure;

FIG. 8 is a schematic diagram of an embodiment of a data processing apparatus according to the present disclosure;

Fig. 9 is a schematic diagram of a data processing apparatus of the present specification.

Detailed Description

The embodiment of the specification provides a data processing method, a device and equipment.

In order to make the technical solutions in the present specification better understood by those skilled in the art, the technical solutions in the embodiments of the present specification will be clearly and completely described below with reference to the drawings in the embodiments of the present specification, and it is obvious that the described embodiments are only some embodiments of the present specification, not all embodiments. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are intended to be within the scope of the present disclosure.

The technical scheme of the specification can be applied to a data processing system, as shown in fig. 1, the data processing system can be provided with terminal equipment and a server, wherein the server can be an independent server or a server cluster formed by a plurality of servers, and the terminal equipment can be equipment such as a personal computer and the like or mobile terminal equipment such as a mobile phone, a tablet personal computer and the like.

The data processing system may include n terminal devices and m servers, where n and m are positive integers greater than or equal to 1, where the terminal devices may be configured to collect data samples, for example, the terminal devices may obtain corresponding data samples for different anomaly detection scenarios, for example, for a data anomaly detection scenario of the question-answering system, the terminal devices may collect feedback information of a user-needle session as the data samples, for a data anomaly detection scenario of a preset service, the terminal devices may collect service data corresponding to the preset service (such as data required for executing the preset service) as the data samples, and so on.

The terminal device can send the collected data sample to any server in the data processing system, the server can preprocess the received data sample, and the preprocessed data sample is stored as a text data sample. Among other things, the preprocessing operations may include text conversion preprocessing (i.e., converting audio data into text data, etc.), text format conversion processing (e.g., converting english text into chinese text, etc.), and the like.

The server may perform anomaly detection based on the preprocessed data, where anomaly detection refers to a process of discovering an outlier using a technical means. Outliers, also known as outliers, refer to data points that are significantly different from, or do not conform to, the expected normal pattern represented by most of the data points. Outliers are generally characterized by two characteristics: few and different. That is, the outliers have a small duty cycle in the overall sample, and the outliers are different from most samples. The anomaly detection can be applied to the scenes of fraud detection, account theft, fault investigation and the like.

In addition, the terminal device can also send the collected data samples to the corresponding service end based on the application scene corresponding to the data samples. For example, assuming that, in the data processing system, the server 1 and the server 2 are used for processing data anomaly detection of the question-answering system, and the server 3 and the server 4 are used for processing data anomaly detection of the preset service, the terminal device may send the collected data samples corresponding to the question-answering system to the server 1 and the server 2, and send the collected data samples corresponding to the preset service to the server 3 and the server 4.

In this way, the server side can train the first model by taking the text data sample corresponding to the first model as the first text data sample under the condition that the training instruction for the first model is received.

In addition, there may be a central server (e.g., server 1) in the data processing system, where the central server is configured to train a first model to be trained based on a first text data sample sent by another server (e.g., server 2 and server 3) when a model training period is reached, and return model parameters of the trained first model to the corresponding server after the trained first model is obtained. In this way, other service ends in the data processing system can provide business services for users without interruption, and meanwhile, the center service end can update and upgrade the first model based on the model training period.

Because noise possibly exists in the text data sample acquired by the server, namely the acquired first text data sample possibly contains noise characteristic data and redundant characteristic data, the problem that the characteristic screening accuracy of the first model is poor due to the fact that the first model constructed based on a machine learning algorithm is subjected to characteristic screening is solved, the characteristic screening accuracy of the first model is further affected due to the fact that a plurality of important characteristics with strong correlation are eliminated in the characteristic screening process, and therefore when the first model is trained, the communication subgraphs in target graph structure data corresponding to the first text data sample can be screened to screen the characteristic data with high correlation, the problem that the characteristic screening accuracy of the first model is poor due to the fact that a plurality of important characteristics with strong correlation are eliminated in the characteristic screening process is solved, namely the characteristic screening accuracy of the first model can be improved through training the screened target graph structure data, and further the detection efficiency and the detection accuracy of anomaly detection of data can be improved through the first model after training.

The data processing method in the following embodiments can be implemented based on the above-described data processing system configuration.

Example 1

As shown in fig. 2A and fig. 2B, the embodiment of the present disclosure provides a data processing method, where an execution body of the method may be a server, and the server may be a server, where the server may be an independent server or may be a server cluster formed by a plurality of servers. The method specifically comprises the following steps:

in S202, a first text data sample for training a first model and a type tag corresponding to the first text data sample are acquired.

The first model may be a model constructed based on a preset machine learning algorithm and used for performing feature screening processing on feature data included in a text data sample based on model parameters corresponding to feature data included in the text data sample, where the first text data sample may be any text data sample to be detected, for example, the first text data sample may be any user data and/or preset service data acquired in a model training period, and the type label corresponding to the first text data sample may be used to represent an abnormal situation of the first text data sample, for example, the type label corresponding to the first text data sample may be high risk, medium risk, low risk, no risk, and the like.

In implementation, with the rapid development of computer technology, the types and the number of application services provided by enterprises for users are also increasing, and accordingly, the data volume of user data is increasing, and the data structure is becoming complex, which results in higher complexity of anomaly detection on user data or service data waiting for detection data.

In the case of performing anomaly detection, since the data to be detected contains more noise feature data and redundant feature data, which results in poor detection efficiency and detection accuracy of anomaly detection, a solution capable of improving the detection efficiency and detection accuracy of anomaly detection of data is required. For this reason, the embodiments of the present specification provide a technical solution that can solve the above-mentioned problems, and specifically, reference may be made to the following.

When the model training period is reached, the server acquires a first text data sample for training the first model, for example, taking the first model as a feature screening model of a data anomaly detection scene for a preset service, and the server can receive service data corresponding to the preset service sent by the terminal device and/or other servers in the data processing system, perform preprocessing on the received service data and determine the text data obtained by preprocessing as the first text data sample for training the first model.

In S204, the target graph structure data corresponding to the first text data sample is constructed based on the feature data included in the first text data sample and the correlation between the feature data.

Wherein the target graph structure data may include nodes determined based on the feature data included in the first text data sample and edges between the nodes determined based on the correlation between the feature data included in the first text data sample, the target graph structure data may be used to characterize the correlation between the feature data.

In implementation, when abnormality detection is performed, feature construction is required, that is, related staff can construct feature data according to own experience and professional knowledge in related fields, and the constructed feature set can be used as an alternative pool for subsequent feature screening. Because the feature set is constructed manually, useless features, noise features, redundant features and the like may exist in the feature set inevitably, and further screening of the feature data is needed to ensure the effect of subsequent anomaly detection.

The server may determine a correlation coefficient between every two feature data included in the first text data sample based on a preset correlation algorithm, and construct an edge between nodes corresponding to the two feature data whose correlation coefficient is greater than a preset correlation threshold, where the preset correlation algorithm may include a pearson correlation coefficient method, and the preset correlation threshold may be determined based on an anomaly detection scenario corresponding to the first model.

The above method for constructing the target graph structure data corresponding to the first text data sample is an optional and implementable method, and in the actual application scenario, there may also be a plurality of different constructing methods, and different constructing methods may be selected according to the different actual application scenarios, which is not specifically limited in this embodiment of the present disclosure.

In addition, before constructing the target graph structure data corresponding to the first text data sample, the server may perform a preliminary screening process on the feature data included in the first text data sample, and construct the target graph structure data corresponding to the first text data sample based on the first text data sample obtained by the preliminary screening.

For example, the information value (Information Value, IV) corresponding to the feature data included in the first text data sample may be obtained, where the IV value may be used to measure the importance of the feature data, and the larger the IV value of the feature data, the larger the influence of the feature data on the type tag of the first text data sample, and the IV value may be used to describe the independence of the variable (i.e., the feature data) and the tag (i.e., the type tag of the first text data sample), so that the server may perform the preliminary screening process on the feature data through the IV value of the feature data to screen out the feature data with high importance.

In S206, a connected subgraph included in the target graph structure data is obtained, and node screening processing is performed on a plurality of nodes included in the connected subgraph, so as to obtain screened target graph structure data.

In implementations, the server may determine the connected subgraph contained in the target graph structure data based on edges between nodes in the target graph structure data. Because the edges between the nodes contained in the connected subgraph are constructed based on the correlation between the corresponding feature data, the feature data corresponding to the nodes contained in the connected subgraph can be regarded as a group of feature clusters with higher correlation, and in order to avoid the co-linearity problem of the first model in the training process, the server can screen the feature data with higher correlation, namely, the server can screen a plurality of nodes contained in the connected subgraph, so as to obtain the screened target graph structure data.

For example, as shown in fig. 3, the target graph structure data may include 12 nodes, that is, the first text data sample may include 12 feature data, where the target graph structure data may include two connected subgraphs, that is, the connected subgraphs 1 including the node 1, the node 2, the node 3 and the node 4, the connected subgraphs 2 including the node 6, the node 7, the node 8, the node 9 and the node 11, and the server may perform node screening processing on the nodes in the two connected subgraphs to obtain screened target graph structure data.

Specifically, for example, the server may perform node screening processing on the nodes in the communication subgraph based on the number of edges between each node and other nodes in the communication subgraph, for example, may perform screening processing on the nodes in the communication subgraph where the number of edges between each node and other nodes is smaller than a preset number threshold to obtain screened target graph structure data, where the screened target graph structure data may be as shown in fig. 4, that is, the nodes where the number of edges between the nodes in the communication subgraph and other nodes is smaller than 1 may be removed, and the nodes where the number of edges between the nodes in the communication subgraph and other nodes is not smaller than 1 may be reserved.

In addition, the method for performing node screening processing on the plurality of nodes included in the communication subgraph is an optional and implementable screening method, and in an actual application scenario, there may be a plurality of different screening methods, and may be different according to the actual application scenario, which is not specifically limited in the embodiment of the present disclosure.

In S208, the filtered target graph structure data is input into the first model to obtain a prediction tag corresponding to the first text data sample, and the first model is iteratively trained based on the prediction tag corresponding to the first text data sample and the type tag corresponding to the first text data sample until the first model converges to obtain a trained first model.

In implementation, the server may input the filtered target graph structure data into the first model to obtain a prediction tag corresponding to the first text data sample, determine a loss value based on the type tag and the prediction tag corresponding to the first text data sample, and determine whether the first model converges based on the loss value, where the training of the first model may be continued based on the filtered target graph structure data until the first model converges to obtain the trained first model.

After the first trained model is obtained, the central server can send the model parameters of the first trained model to other servers in the data processing system, so that the servers in the data processing system can update the first local model based on the model parameters to obtain the first trained model, perform feature screening processing on the data to be detected based on the first trained model, and perform anomaly detection processing on the data to be detected after the feature screening processing.

Taking the first model as a tree model as an example, in the process of training the first model, three feature data with high importance are assumed, the correlation among the three feature data with high importance is high, in the training process, each tree needs to randomly select one feature data from the three feature data, after one feature data is selected, the correlation between the other two feature data and the selected feature data is high, so that when the selected feature data is split in the next time, the rest two feature data cannot bring about larger splitting gain, the final splitting times or splitting gain values of the three feature data with high importance are small, the feature screening effect of the first model is influenced, and the anomaly detection effect is further influenced.

The embodiment of the specification provides a data processing method, which is implemented by acquiring a first text data sample for training a first model and type labels corresponding to the first text data sample, wherein the first model is constructed based on a preset machine learning algorithm, is used for carrying out feature screening processing on feature data contained in the text data sample based on model parameters corresponding to feature data contained in the text data sample, is used for constructing target graph structure data corresponding to the first text data sample based on the feature data contained in the first text data sample and correlation among the feature data, acquiring a communication sub-graph contained in the target graph structure data, carrying out node screening processing on a plurality of nodes contained in the communication sub-graph to obtain screened target graph structure data, inputting the screened target graph structure data into the first model to obtain a prediction label corresponding to the first text data sample, carrying out iterative training on the first model based on the prediction label corresponding to the first text data sample and the type label corresponding to the first text data sample until the first model converges to obtain the trained first model. Therefore, the feature data with high correlation can be screened by screening the connected subgraphs in the target graph structure data corresponding to the first text data sample, so that the problem that the feature screening accuracy of the first model is poor due to the fact that a plurality of important features with strong correlation are eliminated in the feature screening process is solved, namely, the first model is trained through the screened target graph structure data, the feature screening accuracy of the first model can be improved, and further, the detection efficiency and the detection accuracy of anomaly detection on the data can be improved through the trained first model.

Example two

As shown in fig. 5, the embodiment of the present disclosure provides a data processing method, where an execution body of the method may be a server, and the server may be an independent server or may be a server cluster formed by a plurality of servers. The method specifically comprises the following steps:

The first model may be a model constructed based on a preset machine learning algorithm and used for performing feature screening processing on feature data included in the text data sample based on model parameters corresponding to the feature data included in the text data sample.

In implementations, the first model may be a model constructed based on a preset rule learning algorithm and a preset machine learning algorithm, and the preset rule learning algorithm may include a sequential coverage algorithm.

The sequential coverage algorithm may be to delete the training data samples covered by a rule every time a rule is learned on the training set, then compose a new training set with the remaining training data samples, and repeat the above procedure. For example, as shown in fig. 6, the training data may include a plurality of data samples, when the model learns rule 1, the data sample corresponding to rule 1 may be deleted, and model training may be performed based on the deleted training data, after learning rule 2, the data sample corresponding to rule 2 may be deleted, and model training may be performed based on the deleted training data until the model converges.

The sequential coverage algorithm can effectively avoid the problem of repeated coverage of the same training data, and can be used for strategically reducing the order of rules. Thus, the server can screen out duplicate feature data through a sequential coverage algorithm, while extracting important feature data.

In S502, a plurality of nodes included in the communication subgraph are screened, to obtain a screened communication subgraph.

The number of the nodes contained in the screened connected subgraph is smaller than a preset node threshold, and the screened connected subgraph comprises one or more nodes which can be obtained based on random sampling, nodes which can be obtained based on screening of information value of each node in the connected subgraph, and nodes which can be obtained based on screening of correlation between each node in the connected subgraph and type labels corresponding to the first text data sample.

In an implementation, the server may perform random sampling processing on the nodes in the communication subgraph based on a preset node threshold to obtain a filtered communication subgraph, for example, assuming that the preset node threshold is 2, the server may perform random sampling processing on the communication subgraph to obtain a filtered communication subgraph including one node.

Or, the server may further perform screening processing on the nodes in the communication subgraph based on the information value of each node in the communication subgraph and a preset node threshold value, so as to obtain a screened communication subgraph, for example, the server may rank the nodes based on the IV value of each node in the communication subgraph, and the server may determine the screened communication subgraph based on the preset node threshold value and the ranked nodes, and in particular, if the preset node threshold value is 2, the server may determine the screened communication subgraph based on the node corresponding to the IV with the largest IV value of the node in each communication subgraph.

Or, the server may further determine a correlation degree between each node in the communication subgraph and the type label corresponding to the first text data sample based on a preset correlation degree determining algorithm, where the preset correlation degree determining algorithm may include a pearson correlation coefficient method, and the server may screen the nodes in the communication subgraph based on a preset node threshold and a correlation degree between each node in the communication subgraph and the type label corresponding to the first text data sample, to obtain a screened communication subgraph, for example, may sort the nodes based on a correlation degree between each node in the communication subgraph and the type label corresponding to the first text data sample, and determine the screened communication subgraph based on the preset node threshold and the sorted nodes.

In addition, the server can also control the number of nodes contained in the filtered connected subgraph based on a preset node threshold value so as to obtain the effect of removing the related features (namely the redundant features).

In S504, the filtered target graph structure data is determined based on the filtered connected subgraph.

In an implementation, the server may obtain the filtered target graph structure data based on the nodes in the filtered connected subgraph and the nodes in the target graph structure data except for the connected subgraph.

In S506, a second text data sample for training the second model is acquired, and a type tag corresponding to the second text data sample.

The second model may be a type detection model constructed based on a preset machine learning algorithm, for example, the second model may be a tree model (such as an iferst model), a density-based model (such as a DBSACN-based model, a LOF-based model), a linear algorithm-based model (such as an OCSVM-based or PCA-based model), a neural network algorithm-based model (such as an Auto Encoder-based or DAGMM-based model), a statistical distribution algorithm-based model (such as a gaussian distribution algorithm-based or HBOS-based model), and the like, and the second text data sample may be a data sample different from the first text data sample.

In an implementation, the second model may be a type detection model locally stored by the server, and may be used for type detection of the data to be detected, and the server may determine whether the data to be detected has an anomaly based on a type detection result.

In S508, based on the trained first model, screening feature data included in the second text data sample, to obtain a screened second text data sample.

In implementation, for example, as shown in fig. 1, in the data processing system, a central server (such as server 1) may be configured to train a first model based on first text data samples sent by other servers and/or terminal devices in the data processing system, to obtain a trained first model, and send model parameters of the trained first model to other servers in the data processing system, where the other servers may update a local first model based on model parameters of the trained first model, to obtain the trained first model.

And when the model training period of the second model is obtained, the server can screen the characteristic data contained in the locally stored second text data sample based on the trained first model to obtain a screened second text data sample.

Or the central server can receive the second text data samples sent by other servers, and based on the trained second model, screening the feature data contained in the second text data samples to obtain screened second text data samples. The center server can send the screened second text data sample to the corresponding server so as to save the data storage resources of other processing servers in the data processing system and improve the data processing efficiency of the other servers.

The method for screening the feature data contained in the second text data sample may be various, and taking the first model as a tree model as an example, the method may be implemented by the following steps one to three, where the screening process is implemented on the feature data contained in the second text data sample:

step one, inputting feature data contained in the second text data sample into a trained first model, and acquiring the splitting times and splitting gains corresponding to the feature data contained in the target text data in model parameters of the trained first model.

And step two, determining the importance degree of the feature data contained in the target text data based on the splitting times and the splitting gain corresponding to the feature data contained in the target text data.

And thirdly, screening the feature data contained in the target text data based on the importance degree of the feature data contained in the target text data to obtain screened target text data.

In S510, based on the screened second text data sample and the type label corresponding to the second text data sample, iterative training is performed on the second model until the second model converges, so as to obtain a trained second model.

In an implementation, the server may input the screened second text data sample into the second model to obtain a prediction tag corresponding to the second text data sample, determine a loss value based on the type tag and the prediction tag corresponding to the second text data sample, and determine whether the second model converges based on the loss value, where the second model may continue training the second model based on the screened second text data sample until the second model converges to obtain a trained second model.

In addition, the training process of the second model may also be performed by a central server in the data processing system, that is, the central server may iteratively train the second model based on the screened second text data sample and the type label corresponding to the second text data sample until the second model converges to obtain a trained second model, and after obtaining the trained second model, the central server may store the trained second model corresponding to the server, or the central server may further send the model parameters of the trained second model to the corresponding server, so that the server updates the local second model based on the model parameters of the trained second model to obtain the trained second model.

In S512, target text data to be detected is acquired.

The target text data may include service data corresponding to the execution target service, for example, assuming that the target service is a resource transfer service, the target text data may include data such as interaction data, resource transfer time, and resource transfer number between the user and the resource transfer object.

In implementation, when the user triggers the execution of the target service through the terminal device, the terminal device may collect service data corresponding to the execution of the target service and send the collected service data to the corresponding server. The server can preprocess the received service data to obtain target text data to be detected.

Or when the server detects that the target service is abnormal in operation, the server can acquire service data corresponding to the target service, and can preprocess the service data to obtain target text data to be detected.

In S514, the type detection process is performed on the target text data through the trained second model, so as to obtain a prediction tag corresponding to the target text data.

In implementation, the server may directly input the target text data into the trained second model, so as to perform type detection on the target text data through the trained second model, and determine a prediction tag corresponding to the target text data.

Or, the server may further perform screening processing on the feature data included in the target text data based on the trained first model, so as to obtain screened target text data. And performing type detection processing on the screened target text data through the trained second model to obtain a prediction tag corresponding to the target text data.

The method for screening the feature data contained in the target feature data can be various, and the following provides an optional implementation manner, taking the first model as a tree model as an example, specifically, the following steps one to three can be referred to:

step one, inputting feature data contained in target text data into a trained first model, and obtaining splitting times and splitting gains corresponding to the feature data contained in the target text data in model parameters of the trained first model.

In S516, it is determined whether there is a risk in executing the target service based on the predictive label corresponding to the target text data.

In implementation, the central server in the data processing system may perform type detection on the target text data to be detected sent by the other servers based on the trained second model, obtain a prediction tag corresponding to the target text data, determine whether the execution target service has a risk based on the prediction tag corresponding to the target text data, and send preset alarm information to the server corresponding to the target text data under the condition that the execution target service has a risk.

Or, the central server may further send the prediction tag corresponding to the target text data to the server corresponding to the target text data, the server corresponding to the target text data may determine whether the execution target service is at risk based on the prediction tag corresponding to the target text data, and in the case that it is determined that the execution target service is at risk, the server may send preset alarm information to the terminal device corresponding to the target text data.

Example III

As shown in fig. 7, the embodiment of the present disclosure provides a data processing method, where an execution body of the method may be a server, and the server may be an independent server or may be a server cluster formed by a plurality of servers. The method specifically comprises the following steps:

In S702, target text data to be detected is acquired.

In S704, a predictive label corresponding to the target text data is determined based on the trained first model.

In implementation, the server may input the target text data into the trained first model to obtain a prediction tag corresponding to the target text data.

In S706, based on the trained first model, screening processing is performed on the feature data included in the target text data, so as to obtain screened feature data.

In an implementation, taking the first model as a tree model as an example, the server may input feature data included in the target text data into the trained first model, and obtain the number of splitting times and the splitting gain corresponding to the feature data included in the target text data in model parameters of the trained first model. The importance of the feature data contained in the target text data is determined based on the number of splits and the split gain corresponding to the feature data contained in the target text data. And screening the feature data contained in the target text data based on the importance of the feature data contained in the target text data to obtain screened feature data.

In S708, it is determined whether there is a risk in executing the target service based on the predictive label corresponding to the target text data and the filtered feature data.

In implementation, the filtered feature data may be used to assist in determining whether the execution target service is at risk, and the server may determine whether the execution target service is at risk based on the filtered feature data, the corresponding feature threshold, and the prediction tag corresponding to the target text data.

For example, taking the target text data as the data corresponding to the resource transfer service as an example, assuming that the filtered feature data is the resource transfer number, the feature threshold corresponding to the feature data may be a threshold 1, and if the resource transfer number is not less than the threshold 1 and the prediction label corresponding to the target text data is of a risk type, it may be determined that the risk exists in executing the target service.

The method for determining whether the risk exists in the execution target service is an optional and implementable determination method, and in the actual application scenario, there may be a plurality of different determination methods, and may be different according to the actual application scenario, which is not specifically limited in the embodiment of the present disclosure.

Example IV

The data processing method provided in the embodiment of the present disclosure is based on the same concept, and the embodiment of the present disclosure further provides a data processing device, as shown in fig. 8.

The data processing apparatus includes: a first acquisition module 801, a graph data construction module 802, a first screening module 803, and a first training module 804, wherein:

a first obtaining module 801, configured to obtain a first text data sample for training a first model, and a type tag corresponding to the first text data sample, where the first model is constructed based on a preset machine learning algorithm, and is used to perform feature screening processing on feature data included in the text data sample based on model parameters corresponding to feature data included in the text data sample;

a graph data construction module 802, configured to construct target graph structure data corresponding to the first text data sample based on feature data included in the first text data sample and correlation between the feature data;

the first screening module 803 is configured to obtain a connected subgraph included in the target graph structure data, and perform node screening processing on a plurality of nodes included in the connected subgraph, so as to obtain screened target graph structure data;

The first training module 804 is configured to input the filtered target graph structure data into the first model to obtain a prediction tag corresponding to the first text data sample, and perform iterative training on the first model based on the prediction tag corresponding to the first text data sample and the type tag corresponding to the first text data sample until the first model converges to obtain a trained first model.

In an embodiment of the present disclosure, the apparatus further includes:

the second acquisition module is used for acquiring a second text data sample for training a second model and a type label corresponding to the second text data sample, wherein the second model is a type detection model constructed based on a preset machine learning algorithm;

the second screening module is used for screening the characteristic data contained in the second text data sample based on the trained first model to obtain a screened second text data sample;

and the second training module is used for carrying out iterative training on the second model based on the screened second text data sample and the type label corresponding to the second text data sample until the second model converges to obtain a trained second model.

In an embodiment of the present disclosure, the apparatus further includes:

the second acquisition module is used for acquiring target text data to be detected, wherein the target text data comprises service data corresponding to an execution target service;

the first determining module is used for performing type detection processing on the target text data through the trained second model to obtain a prediction tag corresponding to the target text data;

and the first detection module is used for determining whether the target business is executed with risk or not based on the prediction label corresponding to the target text data.

In an embodiment of the present disclosure, the first determining module is configured to:

screening feature data contained in the target text data based on the trained first model to obtain screened target text data;

and performing type detection processing on the screened target text data through the trained second model to obtain a prediction tag corresponding to the target text data.

inputting the characteristic data contained in the target text data into the trained first model, and acquiring the splitting times and splitting gain corresponding to the characteristic data contained in the target text data in model parameters of the trained first model;

Determining the importance of the feature data contained in the target text data based on the splitting times and the splitting gain corresponding to the feature data contained in the target text data;

and screening the feature data contained in the target text data based on the importance of the feature data contained in the target text data to obtain the screened target text data.

In an embodiment of the present disclosure, the apparatus further includes:

the third acquisition module is used for acquiring target text data to be detected, wherein the target text data comprises service data corresponding to an execution target service;

the second determining module is used for determining a prediction tag corresponding to the target text data based on the trained first model;

the third screening module is used for screening the characteristic data contained in the target text data based on the trained first model to obtain screened characteristic data;

and the second detection module is used for determining whether the target service is executed with risk or not based on the prediction label corresponding to the target text data and the screened characteristic data.

In this embodiment of the present disclosure, the first screening module 803 is configured to:

Screening a plurality of nodes contained in the communication subgraph to obtain a screened communication subgraph, wherein the number of the nodes contained in the screened communication subgraph is smaller than a preset node threshold;

and determining the structure data of the screened target graph based on the screened communication subgraph.

In this embodiment of the present disclosure, the filtered connected subgraph includes one or more nodes obtained by performing a filtering process based on a node obtained by performing a random sampling process, a node obtained by performing a filtering process based on an information value of each node in the connected subgraph, and a node obtained by performing a filtering process based on a correlation between each node in the connected subgraph and a type label corresponding to the first text data sample.

In this embodiment of the present disclosure, the first model is a model constructed based on a preset rule learning algorithm and the preset machine learning algorithm, where the preset rule learning algorithm includes a sequential coverage algorithm.

The embodiment of the specification provides a data processing device, through obtaining a first text data sample for training a first model, and type labels corresponding to the first text data sample, the first model is constructed based on a preset machine learning algorithm, and is used for carrying out feature screening processing on feature data contained in the text data sample based on model parameters corresponding to feature data contained in the text data sample, constructing target graph structure data corresponding to the first text data sample based on the feature data contained in the first text data sample and correlation among the feature data, obtaining a communication sub-graph contained in the target graph structure data, carrying out node screening processing on a plurality of nodes contained in the communication sub-graph, obtaining screened target graph structure data, inputting the screened target graph structure data into the first model, obtaining a prediction label corresponding to the first text data sample, carrying out iterative training on the first model based on the prediction label corresponding to the first text data sample and the type labels corresponding to the first text data sample, and obtaining the trained first model. Therefore, the feature data with high correlation can be screened by screening the connected subgraphs in the target graph structure data corresponding to the first text data sample, so that the problem that the feature screening accuracy of the first model is poor due to the fact that a plurality of important features with strong correlation are eliminated in the feature screening process is solved, namely, the first model is trained through the screened target graph structure data, the feature screening accuracy of the first model can be improved, and further, the detection efficiency and the detection accuracy of anomaly detection on the data can be improved through the trained first model.

Example five

Based on the same idea, the embodiment of the present disclosure further provides a data processing apparatus, as shown in fig. 9.

The data processing apparatus may vary widely in configuration or performance, may include one or more processors 901 and memory 902, and may store one or more storage applications or data in memory 902. Wherein the memory 902 may be transient storage or persistent storage. The application programs stored in the memory 902 may include one or more modules (not shown) each of which may include a series of computer executable instructions for use in a data processing apparatus. Still further, the processor 901 may be arranged to communicate with a memory 902 and execute a series of computer executable instructions in the memory 902 on a data processing device. The data processing device may also include one or more power supplies 903, one or more wired or wireless network interfaces 904, one or more input output interfaces 905, and one or more keyboards 906.

In particular, in this embodiment, the data processing apparatus includes a memory, and one or more programs, wherein the one or more programs are stored in the memory, and the one or more programs may include one or more modules, and each module may include a series of computer-executable instructions for the data processing apparatus, and the one or more programs configured to be executed by the one or more processors comprise instructions for:

Acquiring a first text data sample for training a first model and a type label corresponding to the first text data sample, wherein the first model is constructed based on a preset machine learning algorithm and is used for carrying out feature screening processing on feature data contained in the text data sample based on model parameters corresponding to the feature data contained in the text data sample;

constructing target graph structure data corresponding to the first text data sample based on feature data contained in the first text data sample and correlation among the feature data;

acquiring a connected subgraph contained in the target graph structure data, and performing node screening treatment on a plurality of nodes contained in the connected subgraph to obtain screened target graph structure data;

inputting the screened target graph structure data into the first model to obtain a prediction label corresponding to the first text data sample, and performing iterative training on the first model based on the prediction label corresponding to the first text data sample and the type label corresponding to the first text data sample until the first model converges to obtain a trained first model.

In this specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for data processing apparatus embodiments, the description is relatively simple, as it is substantially similar to method embodiments, with reference to the description of method embodiments in part.

The embodiment of the specification provides data processing equipment, through obtaining a first text data sample for training a first model, and type labels corresponding to the first text data sample, the first model is constructed based on a preset machine learning algorithm, and is used for carrying out feature screening processing on feature data contained in the text data sample based on model parameters corresponding to feature data contained in the text data sample, constructing target graph structure data corresponding to the first text data sample based on the feature data contained in the first text data sample and correlation among the feature data, obtaining a communication sub-graph contained in the target graph structure data, carrying out node screening processing on a plurality of nodes contained in the communication sub-graph, obtaining screened target graph structure data, inputting the screened target graph structure data into the first model, obtaining a prediction label corresponding to the first text data sample, carrying out iterative training on the first model based on the prediction label corresponding to the first text data sample and the type labels corresponding to the first text data sample, and obtaining the trained first model. Therefore, the feature data with high correlation can be screened by screening the connected subgraphs in the target graph structure data corresponding to the first text data sample, so that the problem that the feature screening accuracy of the first model is poor due to the fact that a plurality of important features with strong correlation are eliminated in the feature screening process is solved, namely, the first model is trained through the screened target graph structure data, the feature screening accuracy of the first model can be improved, and further, the detection efficiency and the detection accuracy of anomaly detection on the data can be improved through the trained first model.

Example six

The embodiments of the present disclosure further provide a computer readable storage medium, where a computer program is stored, where the computer program when executed by a processor implements each process of the embodiments of the data processing method, and the same technical effects can be achieved, and for avoiding repetition, a detailed description is omitted herein. Wherein the computer readable storage medium is selected from Read-Only Memory (ROM), random access Memory (Random Access Memory, RAM), magnetic disk or optical disk.

The embodiment of the specification provides a computer readable storage medium, through obtaining a first text data sample for training a first model and type labels corresponding to the first text data sample, the first model is constructed based on a preset machine learning algorithm, and is used for carrying out feature screening processing on feature data contained in the text data sample based on model parameters corresponding to feature data contained in the text data sample, constructing target graph structure data corresponding to the first text data sample based on the feature data contained in the first text data sample and correlation among the feature data, obtaining a communication subgraph contained in the target graph structure data, carrying out node screening processing on a plurality of nodes contained in the communication subgraph to obtain screened target graph structure data, inputting the screened target graph structure data into the first model to obtain a prediction label corresponding to the first text data sample, carrying out iterative training on the first model based on the prediction label corresponding to the first text data sample and the type labels corresponding to the first text data sample until the first model is converged, and obtaining a trained first model. Therefore, the feature data with high correlation can be screened by screening the connected subgraphs in the target graph structure data corresponding to the first text data sample, so that the problem that the feature screening accuracy of the first model is poor due to the fact that a plurality of important features with strong correlation are eliminated in the feature screening process is solved, namely, the first model is trained through the screened target graph structure data, the feature screening accuracy of the first model can be improved, and further, the detection efficiency and the detection accuracy of anomaly detection on the data can be improved through the trained first model.

The foregoing describes specific embodiments of the present disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims can be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.

In the 90 s of the 20 th century, improvements to one technology could clearly be distinguished as improvements in hardware (e.g., improvements to circuit structures such as diodes, transistors, switches, etc.) or software (improvements to the process flow). However, with the development of technology, many improvements of the current method flows can be regarded as direct improvements of hardware circuit structures. Designers almost always obtain corresponding hardware circuit structures by programming improved method flows into hardware circuits. Therefore, an improvement of a method flow cannot be said to be realized by a hardware entity module. For example, a programmable logic device (Programmable Logic Device, PLD) (e.g., field programmable gate array (Field Programmable Gate Array, FPGA)) is an integrated circuit whose logic function is determined by the programming of the device by a user. A designer programs to "integrate" a digital system onto a PLD without requiring the chip manufacturer to design and fabricate application-specific integrated circuit chips. Moreover, nowadays, instead of manually manufacturing integrated circuit chips, such programming is mostly implemented by using "logic compiler" software, which is similar to the software compiler used in program development and writing, and the original code before the compiling is also written in a specific programming language, which is called hardware description language (Hardware Description Language, HDL), but not just one of the hdds, but a plurality of kinds, such as ABEL (Advanced Boolean Expression Language), AHDL (Altera Hardware Description Language), confluence, CUPL (Cornell University Programming Language), HDCal, JHDL (Java Hardware Description Language), lava, lola, myHDL, PALASM, RHDL (Ruby Hardware Description Language), etc., VHDL (Very-High-Speed Integrated Circuit Hardware Description Language) and Verilog are currently most commonly used. It will also be apparent to those skilled in the art that a hardware circuit implementing the logic method flow can be readily obtained by merely slightly programming the method flow into an integrated circuit using several of the hardware description languages described above.

The controller may be implemented in any suitable manner, for example, the controller may take the form of, for example, a microprocessor or processor and a computer readable medium storing computer readable program code (e.g., software or firmware) executable by the (micro) processor, logic gates, switches, application specific integrated circuits (Application Specific Integrated Circuit, ASIC), programmable logic controllers, and embedded microcontrollers, examples of which include, but are not limited to, the following microcontrollers: ARC 625D, atmel AT91SAM, microchip PIC18F26K20, and Silicone Labs C8051F320, the memory controller may also be implemented as part of the control logic of the memory. Those skilled in the art will also appreciate that, in addition to implementing the controller in a pure computer readable program code, it is well possible to implement the same functionality by logically programming the method steps such that the controller is in the form of logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers, etc. Such a controller may thus be regarded as a kind of hardware component, and means for performing various functions included therein may also be regarded as structures within the hardware component. Or even means for achieving the various functions may be regarded as either software modules implementing the methods or structures within hardware components.

The system, apparatus, module or unit set forth in the above embodiments may be implemented in particular by a computer chip or entity, or by a product having a certain function. One typical implementation is a computer. In particular, the computer may be, for example, a personal computer, a laptop computer, a cellular telephone, a camera phone, a smart phone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or a combination of any of these devices.

For convenience of description, the above devices are described as being functionally divided into various units, respectively. Of course, the functionality of the units may be implemented in one or more software and/or hardware when implementing one or more embodiments of the present description.

It will be appreciated by those skilled in the art that embodiments of the present description may be provided as a method, system, or computer program product. Accordingly, one or more embodiments of the present description may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Moreover, one or more embodiments of the present description can take the form of a computer program product on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.

Embodiments of the present description are described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the specification. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

In one typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include volatile memory in a computer-readable medium, random Access Memory (RAM) and/or nonvolatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of computer-readable media.

Computer readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device. Computer-readable media, as defined herein, does not include transitory computer-readable media (transmission media), such as modulated data signals and carrier waves.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article or apparatus that comprises the element.

One or more embodiments of the present specification may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. One or more embodiments of the present description may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.

In this specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for system embodiments, since they are substantially similar to method embodiments, the description is relatively simple, as relevant to see a section of the description of method embodiments.

The foregoing is merely exemplary of the present disclosure and is not intended to limit the disclosure. Various modifications and alterations to this specification will become apparent to those skilled in the art. Any modifications, equivalent substitutions, improvements, or the like, which are within the spirit and principles of the present description, are intended to be included within the scope of the claims of the present description.

Claims

1. A data processing method, comprising:

2. The method of claim 1, the method further comprising:

acquiring a second text data sample for training a second model and a type label corresponding to the second text data sample, wherein the second model is a type detection model constructed based on a preset machine learning algorithm;

screening feature data contained in the second text data sample based on the trained first model to obtain a screened second text data sample;

and carrying out iterative training on the second model based on the screened second text data sample and the type label corresponding to the second text data sample until the second model converges to obtain a trained second model.

3. The method of claim 2, the method further comprising:

acquiring target text data to be detected, wherein the target text data comprises service data corresponding to an execution target service;

performing type detection processing on the target text data through the trained second model to obtain a prediction tag corresponding to the target text data;

and determining whether the target business is at risk or not based on the predictive label corresponding to the target text data.

4. The method according to claim 3, wherein the performing type detection processing on the target text data through the trained second model to obtain the prediction tag corresponding to the target text data includes:

5. The method of claim 4, wherein the first model is a tree model, and the filtering the feature data included in the target text data based on the trained first model to obtain filtered target text data includes:

6. The method of claim 1, the method further comprising:

determining a prediction tag corresponding to the target text data based on the trained first model;

screening the feature data contained in the target text data based on the trained first model to obtain screened feature data;

and determining whether the target service is at risk or not based on the predictive label corresponding to the target text data and the screened characteristic data.

7. The method of claim 1, wherein the node screening processing is performed on the plurality of nodes included in the connected subgraph to obtain screened target graph structure data, and the method comprises:

8. The method of claim 7, wherein the filtered connected subgraph includes one or more of nodes obtained based on random sampling, nodes obtained based on filtering based on information value of each node in the connected subgraph, and nodes obtained based on filtering based on correlation between each node in the connected subgraph and a type label corresponding to the first text data sample.

9. The method of claim 1, the first model being a model constructed based on a preset rule learning algorithm and the preset machine learning algorithm, the preset rule learning algorithm comprising a sequential coverage algorithm.

10. A data processing apparatus comprising:

the first acquisition module is used for acquiring a first text data sample for training a first model and a type label corresponding to the first text data sample, wherein the first model is constructed based on a preset machine learning algorithm and is used for carrying out feature screening processing on feature data contained in the text data sample based on model parameters corresponding to the feature data contained in the text data sample;

The diagram data construction module is used for constructing target diagram structure data corresponding to the first text data sample based on the feature data contained in the first text data sample and the correlation between the feature data;

the first screening module is used for acquiring a connected subgraph contained in the target graph structure data, and carrying out node screening treatment on a plurality of nodes contained in the connected subgraph to obtain screened target graph structure data;

and the first training module is used for inputting the screened target graph structure data into the first model to obtain a prediction label corresponding to the first text data sample, and carrying out iterative training on the first model based on the prediction label corresponding to the first text data sample and the type label corresponding to the first text data sample until the first model converges to obtain a trained first model.

11. A data processing apparatus, the data processing apparatus comprising:

a processor; and

a memory arranged to store computer executable instructions that, when executed, cause the processor to: