CN117271984A

CN117271984A - Target object risk identification method and device

Info

Publication number: CN117271984A
Application number: CN202311293998.4A
Authority: CN
Inventors: 欧阳春; 黄一卿; 高健; 管国亮
Original assignee: Industrial and Commercial Bank of China Ltd ICBC
Current assignee: Industrial and Commercial Bank of China Ltd ICBC
Priority date: 2023-10-08
Filing date: 2023-10-08
Publication date: 2023-12-22

Abstract

The specification relates to the technical field of artificial intelligence, and particularly discloses a target object risk identification method and device, wherein the method comprises the following steps: receiving a risk detection request; responding to the risk detection request, and acquiring target object data corresponding to the target object identifier; the target object data comprises first type attribute feature data and second type attribute feature data; under the condition that the second-class attribute feature data has missing index data and/or error index data, correcting the missing index data and/or error index data in the second-class attribute feature data by using a naive Bayesian decision algorithm to obtain the second-class attribute feature data after the data are filled and/or corrected; and performing cluster analysis on the first type attribute characteristic data and the supplemented and/or corrected second type attribute characteristic data to obtain a risk category corresponding to the target object identifier. The scheme can improve the accuracy of target object risk identification and can adapt to the change of the service environment.

Description

Target object risk identification method and device

Technical Field

The present disclosure relates to the field of artificial intelligence technologies, and in particular, to a target object risk identification method and apparatus.

Background

With the development of internet technology, a great deal of data is accumulated in various fields such as electronic commerce, social networks, finance, medical treatment, science and engineering, and the like, and an exponentially growing development trend is presented. The back of the vast amount of data implies rich and valuable knowledge, and thus how to extract meaningful, valuable, potential information from these complex, large-scale data will become particularly important. The input data is learned and analyzed, and the class labels of the unknown data are reasonably judged and predicted by utilizing the discovered rule, so that the method can be applied to application scenes such as network attack recognition, customer loss prediction, earthquake prediction, risk management, medical diagnosis, financial product risk recognition and the like.

However, some key index data in the input data may have a defect or error, which may cause a problem that risk identification may be inaccurate due to an inadaptation to a service environment change.

In view of the above problems, no effective solution has been proposed at present.

Disclosure of Invention

The embodiment of the specification provides a target object risk identification method and device, which are used for solving the problem that in the risk identification method in the prior art, the risk identification is inaccurate due to the fact that some key index data in input data possibly have deletion or error and the like.

The embodiment of the specification provides a target object risk identification method, which comprises the following steps:

receiving a risk detection request; the risk detection request carries a target object identifier;

responding to the risk detection request, and acquiring target object data corresponding to the target object identification; the target object data comprises first-type attribute characteristic data and second-type attribute characteristic data; the first type attribute feature data is used for representing service attribute features associated with the target object; the second type attribute feature data is used for representing transaction attribute features associated with the target object;

under the condition that the second-class attribute feature data has missing index data and/or wrong index data, correcting the missing index data in the second-class attribute feature data by using a naive Bayesian decision algorithm to obtain the second-class attribute feature data after the data is filled and/or corrected;

and performing cluster analysis on the first type attribute characteristic data and the supplemented and/or corrected second type attribute characteristic data to obtain a risk category corresponding to the target object identifier.

In one embodiment, correcting the index data with data alignment and/or error for the index data missing in the second type attribute feature data by using a naive bayes decision algorithm includes:

preprocessing the first type attribute feature data and the second type attribute feature data to obtain preprocessed first type attribute feature data and preprocessed second type attribute feature data;

detecting the second-type attribute characteristic data to determine whether missing index data and/or error index data exist in the second-type attribute characteristic data;

and under the condition that the fact that the index data and/or the error index data exist in the preprocessed second-class attribute characteristic data is determined, correcting the index data with the data complement and/or the error index data of the index data which are missing in the second-class attribute characteristic data by using a naive Bayesian decision algorithm.

In one embodiment of the present invention, in one embodiment,

the preprocessed second-type attribute feature data comprises a plurality of second-type attribute features, wherein the plurality of second-type attribute features comprise second-type attribute features with known and correct values and second-type attribute features with unknown or incorrect values;

Correspondingly, the correcting the index data with data filling and/or error is carried out on the index data missing in the second type attribute feature data by using a naive Bayesian decision algorithm, which comprises the following steps:

calculating the conditional probability when the second type attribute features with unknown or wrong values take different values on the premise that the second type attribute features with known values and correct values are calculated by using a naive Bayes decision algorithm;

and correcting the index data which is missing in the second type attribute feature data and/or the index data which is wrong based on the value of the second type attribute feature with unknown or wrong value with the maximum conditional probability.

In one embodiment, performing cluster analysis on the first type attribute feature data and the supplemented and/or corrected second type attribute feature data to obtain a risk category corresponding to the target object identifier, where the method includes:

performing feature extraction on the first type attribute feature data and the supplemented and/or corrected second type attribute feature data to obtain a target feature vector corresponding to the target object identifier;

calculating the distance between the target feature vector and the feature vector of the clustering center corresponding to each risk category in the multiple risk categories;

And determining a risk category corresponding to the clustering center with the minimum distance between the target feature vectors as a risk category corresponding to the target object identification.

In one embodiment, the method further comprises:

obtaining feature vectors corresponding to all object samples in a large number of object samples; the plurality of object samples comprises object samples of known risk categories;

performing cluster analysis on feature vectors corresponding to all object samples in the large number of object samples to obtain a plurality of cluster centers;

and calculating the distance between the feature vector of the object sample of the known risk category and each cluster center in the plurality of cluster centers to determine the risk category corresponding to each cluster center in the plurality of cluster centers.

In one embodiment, performing cluster analysis on feature vectors corresponding to each object sample in the plurality of object samples to obtain a plurality of cluster centers, including:

randomly selecting feature vectors corresponding to each object sample in a plurality of object samples from the plurality of object samples as initial clustering centers to obtain a plurality of clustering centers;

repeating the following steps until the feature vectors corresponding to the plurality of cluster centers are not changed any more: calculating the distance between the feature vector corresponding to the object sample except the object sample corresponding to the plurality of cluster centers in a large number of object samples and the feature vector corresponding to each cluster center in the plurality of cluster centers, so as to be distributed to the cluster center closest to the cluster center to obtain a plurality of clusters; a cluster center of each of the plurality of clusters is calculated.

calculating the average displacement of each object sample in the plurality of object samples;

translating each object sample in the plurality of object samples;

repeating the steps until the samples are converged, determining the object samples converged to the same point as object samples of the same cluster, and obtaining a plurality of clusters; and calculating a cluster center corresponding to each cluster in the clusters to obtain a plurality of cluster centers.

In one embodiment, the first type of attribute characteristic data includes at least one of:

customer object data, usage channel data, transaction flow data, transaction transparency data, and transaction property data.

In one embodiment, the second type of attribute characteristic data includes at least one of:

customer quantity data, high risk customer quantity data, transaction amount data.

The embodiment of the specification also provides a target object risk identification device, which comprises:

the receiving module is used for receiving the risk detection request; the risk detection request carries a target object identifier;

The acquisition module is used for responding to the risk detection request and acquiring target object data corresponding to the target object identification; the target object data comprises first-type attribute characteristic data and second-type attribute characteristic data; the first type attribute feature data is used for representing service attribute features associated with the target object; the second type attribute feature data is used for representing transaction attribute features associated with the target object;

the filling correction module is used for correcting the index data with data filling and/or error by using a naive Bayesian decision algorithm under the condition that the index data with the missing and/or error exists in the second-class attribute feature data, so as to obtain the filled and/or corrected second-class attribute feature data;

and the cluster analysis module is used for carrying out cluster analysis on the first type attribute characteristic data and the supplemented and/or corrected second type attribute characteristic data to obtain a risk category corresponding to the target object identifier.

The embodiments of the present specification also provide a computer device, including a processor and a memory for storing instructions executable by the processor, where the processor executes the instructions to implement the steps of the target object risk identification method described in any of the embodiments above.

The present description also provides a computer-readable storage medium having stored thereon computer instructions that, when executed, implement the steps of the target object risk identification method described in any of the above embodiments.

In this embodiment of the present disclosure, a target object risk identification method is provided, a server may receive a risk detection request sent by a client, in response to the risk detection request, the server may obtain target object data corresponding to a target object identifier, the target object data may be first type attribute feature data and second type attribute feature data of the target object, the server may correct index data missing in the second type attribute feature data by using a naive bayesian decision algorithm, to obtain the second type attribute feature data after the repair and/or correction, and then, the server may perform cluster analysis on the first type attribute feature data and the second type attribute feature data after the repair and/or correction, to obtain a risk category corresponding to the target object identifier. In the scheme, the risk of the target object is evaluated from the first type attribute feature data and the second type attribute feature data, so that the features contained in the target object data are more comprehensive, and the accuracy of risk identification can be improved. In addition, the second type attribute characteristic data in the target object data is subjected to data alignment or correction by using a naive Bayesian decision algorithm, so that the integrity and the correctness of the data in risk identification can be ensured, compared with other algorithms, the naive Bayesian decision algorithm is higher in efficiency and faster in speed, after the complete and correct object data is obtained, the object data can be subjected to clustering analysis to obtain the risk category corresponding to the target object, the accuracy of target object risk identification can be improved, the transaction safety is improved, the user interests are guaranteed, and the user experience is improved.

Drawings

The accompanying drawings are included to provide a further understanding of the specification, and are incorporated in and constitute a part of this specification. In the drawings:

fig. 1 is a schematic diagram illustrating an application scenario of a target object risk identification method according to an embodiment of the present disclosure;

FIG. 2 shows a flow chart of a target object risk identification method in an embodiment of the present disclosure;

fig. 3 is a schematic structural diagram of an apparatus for implementing a target object risk recognition method in an embodiment of the present disclosure;

FIG. 4 shows a flow chart of a target object risk identification method in an embodiment of the present disclosure;

FIG. 5 is a schematic diagram of a target object risk recognition device according to an embodiment of the present disclosure;

fig. 6 shows a schematic diagram of a computer device in an embodiment of the present description.

Detailed Description

The principles and spirit of the present specification will be described below with reference to several exemplary embodiments. It should be understood that these embodiments are presented merely to enable one skilled in the art to better understand and practice the present description, and are not intended to limit the scope of the present description in any way. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

Those skilled in the art will appreciate that the embodiments of the present description may be implemented as a system, apparatus, method, or computer program product. Accordingly, the present disclosure may be embodied in the form of: complete hardware, complete software (including firmware, resident software, micro-code, etc.), or a combination of hardware and software.

The embodiment of the specification provides a target object risk identification method. Fig. 1 is a schematic diagram illustrating an application scenario of a target object risk identification method according to an embodiment of the present disclosure. In one scenario example, as shown in fig. 1, the method in the present embodiment may be applied to a server. The server may receive a risk detection request sent by the client. The risk detection request may carry the target object identifier. In response to the risk detection request, the server may obtain target object data corresponding to the target object identification. The target object data may be a first type of attribute feature data and a second type of attribute feature data of the target object.

The first type attribute characteristic data is used for representing service attribute characteristics associated with the target object. In one embodiment, the first type of attribute characteristic data may include at least one of the following: customer object data, usage channel data, transaction flow data, transaction transparency data, and transaction property data. The second type attribute feature data is used to characterize transaction attribute features associated with the target object. In one embodiment, the second type of attribute characteristic data may include at least one of the following: customer quantity data, high risk customer quantity data, transaction amount data.

The server can correct the index data with data complement and/or error for the index data which is missing in the second type attribute feature data by using a naive Bayesian decision algorithm, and the second type attribute feature data after complement and/or correction is obtained. And then, the server can perform cluster analysis on the first type attribute characteristic data and the supplemented and/or corrected second type attribute characteristic data to obtain a risk category corresponding to the target object identifier.

The server may be a single server, a server cluster, or a cloud server, and the specific composition forms no limitation in the present application. The client may be a desktop computer, a notebook computer, a mobile phone terminal, a PDA, or the like, and the present application is not limited as long as the client is a device capable of displaying contents and receiving operation instructions to a user or a business person.

Fig. 2 shows a flowchart of a target object risk identification method in an embodiment of the present disclosure. Although the present description provides methods and apparatus structures as shown in the following examples or figures, more or fewer steps or modular units may be included in the methods or apparatus based on conventional or non-inventive labor. In the steps or the structures of the apparatuses, which logically do not have the necessary cause and effect relationship, the execution order or the structure of the modules of the apparatuses are not limited to the execution order or the structure of the modules shown in the drawings and described in the embodiments of the present specification. The described methods or module structures may be implemented sequentially or in parallel (e.g., in a parallel processor or multithreaded environment, or even in a distributed processing environment) in accordance with the embodiments or the method or module structure connection illustrated in the figures when implemented in a practical device or end product application.

Specifically, as shown in fig. 2, the target object risk identification method provided in an embodiment of the present disclosure may include the following steps.

Step S201, receiving a risk detection request; the risk detection request carries a target object identifier.

Step S202, responding to the risk detection request, and acquiring target object data corresponding to the target object identification; the target object data comprises first-type attribute characteristic data and second-type attribute characteristic data; the first type attribute feature data is used for representing service attribute features associated with the target object; the second type attribute feature data is used to characterize transaction attribute features associated with the target object.

The target object risk identification method in the embodiment can be applied to a server. The server may receive a risk detection request. The risk detection request may carry the target object identifier. The target object identification may be identification information of the target object to be classified into the risk category. The target object may be a financial product such as a financial product.

In response to the risk detection request, the server may obtain target object data corresponding to the target object identification. In one embodiment, the server may send an acquisition request to the database, where the acquisition request may carry the target object identifier. The server may receive the target object data returned by the database. The target object data may include a first type of attribute feature data and a second type of attribute feature data for the target object.

The first type attribute characteristic data is used for representing the business attribute characteristics associated with the target object, and the risk of the target object can be qualitatively estimated.

The second type of attribute feature data is used for characterizing transaction attribute features associated with the target object, and can quantitatively characterize risk of the target object.

In a financial scenario, the target object may be a financial product. Accordingly, in some embodiments of the present disclosure, the first type of attribute characteristic data may include at least one of the following data: customer object data, usage channel data, transaction flow data, transaction transparency data, and transaction property data. Wherein the customer object data may refer to whether the target object is a product developed for a particular customer group. Usage channel data may refer to whether the target object supports face-to-face transactions. The business process may characterize the complexity of the first business of the target object and the complexity of the non-first business. The transaction transparency data may characterize transaction opponent information transparency and transaction related party transparency. The transaction property data may characterize whether transactions of the target object support cross-rows, cross-border, and foreign currency.

In some embodiments of the present description, the second type of attribute characteristic data may include at least one of the following data: customer quantity data, high risk customer quantity data, transaction amount data. The client number data may include client number data corresponding to each of a plurality of time periods of the target object. The high risk client quantity data may include high risk client quantity data corresponding to each of a plurality of time periods for the target object. The transaction count data may include a transaction count corresponding to each of a plurality of time periods for the target object. The transaction amount data may include transaction amounts for each of a plurality of time periods for the target object.

Step S203, when the second type attribute feature data has missing index data and/or erroneous index data, correcting the missing index data and/or erroneous index data in the second type attribute feature data by using a naive bayes decision algorithm, to obtain the second type attribute feature data after the data is complemented and/or corrected.

The server may perform data alignment on the index data missing in the second type attribute feature data and/or correct the index data wrong in the second type attribute feature data by using a naive bayes decision algorithm, so as to obtain the aligned and/or corrected second type attribute feature data.

In some embodiments of the present disclosure, correcting the index data with the missing index data and/or the error index data in the second type attribute feature data by using a naive bayes decision algorithm includes: preprocessing the first type attribute feature data and the second type attribute feature data to obtain preprocessed first type attribute feature data and preprocessed second type attribute feature data; detecting the second-type attribute characteristic data to determine whether missing index data and/or error index data exist in the second-type attribute characteristic data; and under the condition that the fact that the index data and/or the error index data exist in the preprocessed second-class attribute characteristic data is determined, correcting the index data with the data complement and/or the error index data of the index data which are missing in the second-class attribute characteristic data by using a naive Bayesian decision algorithm.

The server can preprocess the first type attribute feature data and the second type attribute feature data to obtain preprocessed first type attribute feature data and preprocessed second type attribute feature data. Specifically, the server may perform data cleansing on the first type attribute feature data and the second type attribute feature data, for example, currency conversion, a unified quantity measurement unit, a transaction amount unit, and the like. The server can perform data screening on the first type attribute characteristic data and the second type attribute characteristic data, and whether the index data are empty screening complete data and missing index data is judged. The server may also determine whether there is erroneous second-type attribute feature data based on the second-type attribute feature data for a plurality of time periods. After determining the missing and/or erroneous second-class attribute feature data, a naive bayes decision algorithm may be utilized to correct the missing and/or erroneous index data in the second-class attribute feature data.

In some embodiments of the present disclosure, the preprocessed second-type attribute feature data includes a plurality of second-type attribute features, where the plurality of second-type attribute features includes a second-type attribute feature whose value is known and correct and a second-type attribute feature whose value is unknown or incorrect; correspondingly, the correcting the index data with data filling and/or error is carried out on the index data missing in the second type attribute feature data by using a naive Bayesian decision algorithm, which comprises the following steps: calculating the conditional probability when the second type attribute features with unknown or wrong values take different values on the premise that the second type attribute features with known values and correct values are calculated by using a naive Bayes decision algorithm; and correcting the index data which is missing in the second type attribute feature data and/or the index data which is wrong based on the value of the second type attribute feature with unknown or wrong value with the maximum conditional probability.

In this embodiment, the preprocessed second-type attribute feature data includes a plurality of second-type attribute features. The plurality of second-class attribute features includes a second-class attribute feature whose value is known and correct and a second-class attribute feature whose value is unknown or incorrect. The naive Bayesian algorithm is an extension of the Bayesian algorithm, and some assumptions are made on the basis of the Bayesian algorithm, so that the efficiency is higher and the speed is faster. The naive bayes are a method of assuming that feature conditions are mutually independent based on the principle of conditional probability, and firstly, through a given training set, on the premise that feature values are mutually independent, joint probability distribution from input to output is learned, and an output Y which enables the posterior probability to be maximum is obtained through an input X. The conditional probability when the second type attribute feature of which the value is unknown or wrong takes different values on the premise that the value is known and the second type attribute feature of which the value is correct can be calculated by using a naive Bayesian decision algorithm. And then, based on the value of the second type attribute feature with unknown or wrong value and the maximum conditional probability, carrying out data alignment and/or correction on the index data missing in the second type attribute feature data.

Step S204, performing cluster analysis on the first type attribute feature data and the supplemented and/or corrected second type attribute feature data to obtain a risk category corresponding to the target object identifier.

After the second-class attribute feature data is supplemented and/or corrected, cluster analysis can be performed on the first-class attribute feature data and the supplemented and/or corrected second-class attribute feature data to obtain a risk class corresponding to the target object identifier.

In some embodiments of the present disclosure, performing cluster analysis on the first-type attribute feature data and the supplemented and/or corrected second-type attribute feature data to obtain a risk category corresponding to the target object identifier, where the method includes: performing feature extraction on the first type attribute feature data and the supplemented and/or corrected second type attribute feature data to obtain a target feature vector corresponding to the target object identifier; calculating the distance between the target feature vector and the feature vector of the clustering center corresponding to each risk category in the multiple risk categories; and determining a risk category corresponding to the clustering center with the minimum distance between the target feature vectors as a risk category corresponding to the target object identification.

In this embodiment, feature extraction may be performed on the first type attribute feature data and the supplemented and/or corrected second type attribute feature data of the target object, so as to obtain a target feature vector corresponding to the target object identifier. In one embodiment, the plurality of cluster centers may be a cluster center 1, a cluster center 2, a cluster center 3, a cluster center 4 and a cluster center 5, the risk category corresponding to the cluster center 1 is a high risk category, the risk category corresponding to the cluster center 2 is a medium and high risk category, the risk category corresponding to the cluster center 3 is a medium and low risk category, the risk category corresponding to the cluster center 4 is a medium and low risk category, and the risk category corresponding to the cluster center 5 is a low risk category. The distance between the target feature vector and the feature vector corresponding to each of the five cluster centers may be calculated, and the risk category corresponding to the cluster center with the smallest distance between the target feature vectors is determined as the risk category corresponding to the target object identification. For example, if the distance between the target feature vector and the feature vector corresponding to the cluster center 3 is the smallest, it may be determined that the risk class of the target object is a risk class.

In the above embodiment, the risk of the target object is evaluated from the first type attribute feature data and the second type attribute feature data, so that the features contained in the target object data are more comprehensive, and the accuracy of risk identification can be improved. In addition, the second type attribute characteristic data in the target object data is subjected to data alignment or correction by using a naive Bayesian decision algorithm, so that the integrity and the correctness of the data in risk identification can be ensured, compared with other algorithms, the naive Bayesian decision algorithm is higher in efficiency and faster in speed, after the complete and correct object data is obtained, the object data can be subjected to clustering analysis to obtain the risk category corresponding to the target object, the accuracy of target object risk identification can be improved, the transaction safety is improved, the user interests are guaranteed, and the user experience is improved.

In some embodiments of the present description, the method may further comprise: obtaining feature vectors corresponding to all object samples in a large number of object samples; the plurality of object samples comprises object samples of known risk categories; performing cluster analysis on feature vectors corresponding to all object samples in the large number of object samples to obtain a plurality of cluster centers; and calculating the distance between the feature vector of the object sample of the known risk category and each cluster center in the plurality of cluster centers to determine the risk category corresponding to each cluster center in the plurality of cluster centers.

Specifically, the feature vector corresponding to the cluster center of each cluster in the plurality of clusters may be predetermined. Feature vectors corresponding to each object sample in a large number of object samples can be obtained. The feature vectors corresponding to the object samples in the plurality of object samples can be subjected to cluster analysis to obtain a plurality of cluster centers. Risk types corresponding to a plurality of object samples in a large number of object samples are known. The distance between the feature vector of the object sample of the known risk class and each of the plurality of cluster centers may be calculated. For example, the risk category corresponding to the object sample a is a high risk category, the risk category of the object sample B is a medium-high risk category, the risk category of the object sample C is a medium-high risk category, the risk category of the object sample D is a medium-low risk category, and the risk category of the object sample E is a low risk category. The plurality of cluster centers may be cluster center 1, cluster center 2, cluster center 3, cluster center 4, and cluster center 5. After calculating the distance, it can be known that the object sample a is closest to the cluster center 1, the object sample B is closest to the cluster center 2, the object sample C is closest to the cluster center 3, the object sample D is closest to the cluster center 4, and the object sample E is closest to the cluster center 5. Then, the risk category corresponding to the cluster center 1 is a high risk category, the risk category corresponding to the cluster center 2 is a medium and high risk category, the risk category corresponding to the cluster center 3 is a medium and low risk category, the risk category corresponding to the cluster center 4 is a medium and low risk category, and the risk category corresponding to the cluster center 5 is a low risk category. By the method, the risk categories corresponding to the plurality of clustering centers can be determined.

In some embodiments of the present disclosure, performing cluster analysis on feature vectors corresponding to each object sample in the plurality of object samples to obtain a plurality of cluster centers may include: randomly selecting feature vectors corresponding to each object sample in a plurality of object samples from the plurality of object samples as initial clustering centers to obtain a plurality of clustering centers; repeating the following steps until the feature vectors corresponding to the plurality of cluster centers are not changed any more: calculating the distance between the feature vector corresponding to the object sample except the object sample corresponding to the plurality of cluster centers in a large number of object samples and the feature vector corresponding to each cluster center in the plurality of cluster centers, so as to be distributed to the cluster center closest to the cluster center to obtain a plurality of clusters; a cluster center of each of the plurality of clusters is calculated.

In this embodiment, a K-Means algorithm may be used to perform cluster analysis on feature vectors corresponding to each object sample in a large number of object samples, so as to obtain a plurality of cluster centers. Specifically, feature vectors corresponding to each object sample in the plurality of object samples may be randomly selected from the plurality of object samples as an initial cluster center, so as to obtain a plurality of cluster centers. Calculating the distance between the feature vector corresponding to the object sample except the object sample corresponding to the plurality of cluster centers in a large number of object samples and the feature vector corresponding to each cluster center in the plurality of cluster centers, so as to be distributed to the cluster center closest to the cluster center to obtain a plurality of clusters; a cluster center of each of the plurality of clusters is calculated. For each sample assigned, the cluster center of the cluster is recalculated based on the existing objects in the cluster. This process is repeated until no objects are reassigned to different clusters and no cluster centers change again.

In some embodiments of the present disclosure, performing cluster analysis on feature vectors corresponding to each object sample in the plurality of object samples to obtain a plurality of cluster centers may include: calculating the average displacement of each object sample in the plurality of object samples; translating each object sample in the plurality of object samples; repeating the steps until the samples are converged, determining the object samples converged to the same point as object samples of the same cluster, and obtaining a plurality of clusters; and calculating a cluster center corresponding to each cluster in the clusters to obtain a plurality of cluster centers.

In this embodiment, a Mean Shift algorithm may be used to perform cluster analysis on feature vectors corresponding to each object sample in a large number of object samples, so as to obtain a plurality of cluster centers. An average displacement of each object sample in the plurality of object samples may be calculated; and translating each object sample in the plurality of object samples. Repeating the steps until the samples are converged, determining the object samples converged to the same point as the object samples of the same cluster, and obtaining a plurality of clusters. And calculating a cluster center corresponding to each cluster in the clusters to obtain a plurality of cluster centers.

In this specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. Specific reference may be made to the foregoing description of related embodiments of the related process, which is not described herein in detail.

The foregoing describes specific embodiments of the present disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims can be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.

The above method is described below in connection with a specific embodiment, however, it should be noted that this specific embodiment is only for better illustrating the present specification and should not be construed as unduly limiting the present specification.

The embodiment provides a product branching grade dividing method. Referring to fig. 3, a schematic structural diagram of an apparatus for implementing the method in this embodiment is shown. As shown in fig. 3, a qualitative assessment data acquisition module 101, a quantitative index data acquisition module 102, a base data processing module 103, a product classification module 104, and a product classification data output 105 may be included.

The qualitative assessment data acquisition module 101 may generate a qualitative data table of products based on the product client object, usage channel, transaction flow, transaction transparency, transaction properties, etc., as shown in table 1.

TABLE 1

The quantitative index data acquisition module 102 generates a product quantitative data table for the number of clients, the number of high risk clients, the number of transactions, the transaction amount, etc. respectively, which is composed of 4 index data in the time dimension, as shown in table 2.

TABLE 2

	Time period 1	Time period 2	Time period 3
				Number of clients	a1	a2	a3
High risk number of customers	b1	b2	b3
				Transaction number	c1	c2	c3
Transaction amount	d1	d2	d3

The basic data processing module 103 may perform data cleaning based on the raw data acquired by the module qualitative evaluation data acquisition module 101 and the quantitative index data acquisition module 102, define the number of measurement units (in number, in number of ten thousand), the transaction amount unit, and the currency conversion, and then predict the missing product key index data through a naive bayes decision algorithm, and complement the missing data and correct the error data, so as to ensure the integrity and correctness of the data when accessing the product classification module, as shown in fig. 4, which illustrates a flowchart of the product risk classification method in this embodiment.

Logistic regression (logistic regression) is a relatively popular machine learning algorithm used in the industry to estimate the likelihood of something and also to make classification predictions. Regression is in fact an estimation of unknown parameters of a known formula, i.e. a linear regression normalized by the regression equation. However, the data with complex characteristics is easy to be under fitted, the effect is relatively general, and the data is often applied to a relatively simple natural language processing task, and the data is usually in the case that the characteristics are relatively clear and the number is not large. Naive bayes have more stable classification efficiency and higher speed for large numbers of trains and queries than logical regression.

Bayesian algorithms were developed based on probabilistic models of Thomas Bayes. The bayesian algorithm calculates a probability model of unknown probability distribution based on the acquired current observation data, and updates the existing result according to the new observation data. The naive Bayesian algorithm is an extension of the Bayesian algorithm, and some assumptions are made on the basis of the Bayesian algorithm, so that the efficiency is higher and the speed is faster.

The naive bayes are a method of assuming that feature conditions are mutually independent based on the principle of conditional probability, and firstly, through a given training set, on the premise that feature values are mutually independent, joint probability distribution from input to output is learned, and an output Y which enables the posterior probability to be maximum is obtained through an input X.

General form of bayesian equation derivation:

wherein P (A, B) is a joint probability, P (A|B) is an edge probability and P (A) is a conditional probability.

Applying the conditional probability to the bayesian criterion may result in:

if P (c1|x, y) > P (c2|x, y), then it belongs to c1;

if P (c2|x, y) > P (c1|x, y), then it belongs to category c2.

As shown in fig. 4, the method in this particular embodiment may include the following.

(1) Data input, qualitative assessment data acquisition module 101 and quantitative index data acquisition module 102, as raw data input by module 103.

(2) Data cleansing, currency conversion, unified number of units (tens of thousands), transaction amount units, and the like.

(3) And (3) screening the data, namely screening complete data and missing index data by judging whether the index data are empty.

(4) And (3) predicting data, namely taking the complete transaction opponent information data as training sample data to construct a training sample data set.

From the above equation, the probability of event B in case of a can be obtained from the probability of event a, the probability of event B, and the probability of event a in case of B. All three probabilities on the right side of the equal sign are prior probabilities, which can be obtained by the training set. Then, the problem is converted into P (a|b) and P (B). In practical application, there are infinite values of a feature, and if a joint probability distribution is adopted, it is not practical to calculate a priori probability for each combination of each value of each feature.

To solve this problem, a conditional probability distribution may be used as an assumption of conditional independence. In the event independent case, P (a, B) =p (a) P (B), the original conditional probability can be converted into a conditional probability multiplication of multiple independent events:

under this assumption, the posterior probability is calculated using the bayesian theorem:

/>

Because x is the feature vector of the new input instance, the denominator is unchanged for Y to take on arbitrary values. The class with the highest posterior probability is output as the class of x, and the same denominator can be ignored when comparing the sizes. The last naive bayes formula can be converted into:

the prior probability P (y=ck) and the conditional probability P (X (j) =x (j) |y=ck) can be obtained from the training set. The above expression represents the calculation of the probability that the instance is a certain class in the case of a certain feature vector, and the class corresponding to the maximum probability is regarded as the class of the point. The problem can be converted into the calculation of the prior probability and the conditional probability, and the calculation method of the prior probability comprises the following steps:

wherein N is the total number of instances in the training set, I is an indication function, indicating that 1 is taken if the condition in brackets is satisfied, or 0 is taken otherwise.

Taking index data corresponding to the maximum probability value as a predicted value according to the deduced method, and supplementing the missing index data;

(5) And (5) data filling, namely filling the missing index data.

The product classification module 104 acquires complete and effective product data to be classified based on the basic data processing module 103, and adopts a K-Means cluster classification algorithm to classify the products into 5 categories of low risk, medium and high risk.

The K-Means algorithm is an iteratively solved cluster analysis algorithm, which then selects K objects as initial cluster centers, then calculates the distance between each object and each seed cluster center, and assigns each object to the cluster center closest to it. For each sample assigned, the cluster center of the cluster is recalculated based on the existing objects in the cluster. This process is repeated until no objects are reassigned to different clusters and no cluster centers change again.

Algorithm steps:

the input is a sample set d= { x ₁ ,x ₂ ,...,x _m Cluster tree k of clusters, maximum number of iterations N.

The output is the cluster partition c= { C ₁ ,C ₂ ,...,C _k }。

(1) Randomly selecting k samples from the data set D as the initial k centroid vectors { u } ₁ ,u ₂ ,...,u _k }。

(2) For n=1, 2.

a) Initializing cluster partition C tot＝1,2,...k。

b) For i=1, 2..m, sample x is calculated _i And respective centroid vector u _j Distance of (j=1, 2,..k): will x _i The minimum mark is d _ij The corresponding category lambda _i . At this time update->

c) For j=1, 2..k, for C _j New centroid is recalculated for all sample points in (a)

d) If all k centroid vectors have not changed, go to step (3).

(3) Output cluster division c= { C ₁ ,C ₂ ,...,C _k }。

The method in the embodiment can solve the problems of product key index data missing and inaccurate product classification grade division in the existing method, improves the integrity and accuracy of financial product data, and improves the pertinence and the effectiveness of the back money laundering work. By the qualitative evaluation data and the quantitative index data acquisition module, various data of financial products are effectively stored for a longer time, manual storage and the like are not relied on, the investment of human resources is reduced, and the working efficiency is improved. Compared with a product classification method based on a rule engine, the method is less dependent on manually preset rules, and can better adapt to the change of service environments. The classification of the product money laundering is more accurate, so that a financial institution is guided to more reasonably apply the concept of 'risk is the same', and different risk precautionary measures are adopted, so that money laundering risks of partial category products can be effectively prevented and avoided.

Based on the same inventive concept, the embodiments of the present disclosure also provide a target object risk identification apparatus, as described in the following embodiments. Since the principle of solving the problem by the target object risk recognition device is similar to that of the target object risk recognition method, the implementation of the target object risk recognition device can refer to the implementation of the target object risk recognition method, and the repetition is not repeated. As used below, the term "unit" or "module" may be a combination of software and/or hardware that implements the intended function. While the means described in the following embodiments are preferably implemented in software, implementation in hardware, or a combination of software and hardware, is also possible and contemplated. Fig. 5 is a block diagram of a target object risk recognition apparatus according to an embodiment of the present disclosure, and as shown in fig. 5, includes: the structure is described below, and the receiving module 501, the acquiring module 502, the patch correction module 503, and the cluster analysis module 504.

The receiving module 501 is configured to receive a risk detection request; the risk detection request carries a target object identifier.

The obtaining module 502 is configured to obtain, in response to the risk detection request, target object data corresponding to the target object identifier; the target object data comprises first-type attribute characteristic data and second-type attribute characteristic data; the first type attribute feature data is used for representing service attribute features associated with the target object; the second type attribute feature data is used to characterize transaction attribute features associated with the target object.

The patch correction module 503 is configured to, when the second type attribute feature data has missing index data and/or erroneous index data, correct the missing index data and/or erroneous index data in the second type attribute feature data by using a naive bayes decision algorithm, and obtain the patch and/or corrected second type attribute feature data.

The cluster analysis module 504 is configured to perform cluster analysis on the first-type attribute feature data and the supplemented and/or corrected second-type attribute feature data to obtain a risk category corresponding to the target object identifier.

In some embodiments of the present description, the patch correction module is specifically configured to: preprocessing the first type attribute feature data and the second type attribute feature data to obtain preprocessed first type attribute feature data and preprocessed second type attribute feature data; detecting the second-type attribute characteristic data to determine whether missing index data and/or error index data exist in the second-type attribute characteristic data; and under the condition that the fact that the index data and/or the error index data exist in the preprocessed second-class attribute characteristic data is determined, correcting the index data with the data complement and/or the error index data of the index data which are missing in the second-class attribute characteristic data by using a naive Bayesian decision algorithm.

In some embodiments of the present disclosure, the preprocessed second-type attribute feature data includes a plurality of second-type attribute features, where the plurality of second-type attribute features includes a second-type attribute feature whose value is known and correct and a second-type attribute feature whose value is unknown or incorrect; correspondingly, the patch correction module is specifically configured to: calculating the conditional probability when the second type attribute features with unknown or wrong values take different values on the premise that the second type attribute features with known values and correct values are calculated by using a naive Bayes decision algorithm; and correcting the index data which is missing in the second type attribute feature data and/or the index data which is wrong based on the value of the second type attribute feature with unknown or wrong value with the maximum conditional probability.

In some embodiments of the present disclosure, the apparatus further includes a clustering module, where the clustering module is specifically configured to: obtaining feature vectors corresponding to all object samples in a large number of object samples; the plurality of object samples comprises object samples of known risk categories; performing cluster analysis on feature vectors corresponding to all object samples in the large number of object samples to obtain a plurality of cluster centers; and calculating the distance between the feature vector of the object sample of the known risk category and each cluster center in the plurality of cluster centers to determine the risk category corresponding to each cluster center in the plurality of cluster centers.

In some embodiments of the present disclosure, performing cluster analysis on feature vectors corresponding to each object sample in the plurality of object samples to obtain a plurality of cluster centers, including: randomly selecting feature vectors corresponding to each object sample in a plurality of object samples from the plurality of object samples as initial clustering centers to obtain a plurality of clustering centers; repeating the following steps until the feature vectors corresponding to the plurality of cluster centers are not changed any more: calculating the distance between the feature vector corresponding to the object sample except the object sample corresponding to the plurality of cluster centers in a large number of object samples and the feature vector corresponding to each cluster center in the plurality of cluster centers, so as to be distributed to the cluster center closest to the cluster center to obtain a plurality of clusters; a cluster center of each of the plurality of clusters is calculated.

In some embodiments of the present disclosure, performing cluster analysis on feature vectors corresponding to each object sample in the plurality of object samples to obtain a plurality of cluster centers, including: calculating the average displacement of each object sample in the plurality of object samples; translating each object sample in the plurality of object samples; repeating the steps until the samples are converged, determining the object samples converged to the same point as object samples of the same cluster, and obtaining a plurality of clusters; and calculating a cluster center corresponding to each cluster in the clusters to obtain a plurality of cluster centers.

In some embodiments of the present description, the first type of attribute characteristic data includes at least one of: customer object data, usage channel data, transaction flow data, transaction transparency data, and transaction property data.

In some embodiments of the present description, the second type of attribute characteristic data includes at least one of: customer quantity data, high risk customer quantity data, transaction amount data.

From the above description, it can be seen that the following technical effects are achieved in the embodiments of the present specification: by evaluating the risk of the target object from the first type attribute feature data and the second type attribute feature data, the features contained in the target object data are more comprehensive, and therefore the accuracy of risk identification can be improved. In addition, the second type attribute characteristic data in the target object data is subjected to data alignment or correction by using a naive Bayesian decision algorithm, so that the integrity and the correctness of the data in product classification can be ensured, compared with other algorithms, the naive Bayesian decision algorithm is higher in efficiency and faster in speed, after the complete and correct product data is obtained, the product data can be subjected to clustering analysis to obtain the risk category corresponding to the target object, the accuracy of target object risk identification can be improved, the safety of product transaction is improved, the user rights and interests are guaranteed, and the user experience is improved.

The embodiment of the present disclosure further provides a schematic structural diagram of a computer device, which may specifically refer to fig. 6, where the schematic structural diagram of the computer device is based on the target object risk identification method provided by the embodiment of the present disclosure, and the computer device may specifically include an input device 61, a memory 62, and a processor 63. Wherein the memory 62 is configured to store processor-executable instructions. The processor 63, when executing the instructions, implements the steps of the target object risk identification method described in any of the embodiments above.

In this embodiment, the input device may specifically be one of the main apparatuses for exchanging information between the user and the computer system. The input device may include a keyboard, mouse, camera, scanner, light pen, handwriting input board, voice input device, etc.; the input device is used to input raw data and a program for processing these numbers into the computer. The input device may also acquire and receive data transmitted from other modules, units, and devices. The processor may be implemented in any suitable manner. For example, the processor may take the form of, for example, a microprocessor or processor, and a computer-readable medium storing computer-readable program code (e.g., software or firmware) executable by the (micro) processor, logic gates, switches, an application specific integrated circuit (Application Specific Integrated Circuit, ASIC), a programmable logic controller, and an embedded microcontroller, among others. The memory may in particular be a memory device for storing information in modern information technology. The memory may comprise a plurality of levels, and in a digital system, may be memory as long as binary data can be stored; in an integrated circuit, a circuit with a memory function without a physical form is also called a memory, such as a RAM, a FIFO, etc.; in the system, the storage device in physical form is also called a memory, such as a memory bank, a TF card, and the like.

In this embodiment, the specific functions and effects of the computer device may be explained in comparison with other embodiments, and will not be described herein.

There is further provided in an embodiment of the present specification a computer storage medium based on a target object risk identification method, the computer storage medium storing computer program instructions which, when executed, implement the steps of the target object risk identification method in any of the embodiments described above.

In the present embodiment, the storage medium includes, but is not limited to, a random access Memory (Random Access Memory, RAM), a Read-Only Memory (ROM), a Cache (Cache), a Hard Disk (HDD), or a Memory Card (Memory Card). The memory may be used to store computer program instructions. The network communication unit may be an interface for performing network connection communication, which is set in accordance with a standard prescribed by a communication protocol.

In this embodiment, the functions and effects of the program instructions stored in the computer storage medium may be explained in comparison with other embodiments, and are not described herein.

It will be apparent to those skilled in the art that the modules or steps of the embodiments described above may be implemented in a general purpose computing device, they may be concentrated on a single computing device, or distributed across a network of computing devices, they may alternatively be implemented in program code executable by computing devices, so that they may be stored in a storage device for execution by computing devices, and in some cases, the steps shown or described may be performed in a different order than herein, or they may be separately fabricated into individual integrated circuit modules, or multiple modules or steps within them may be fabricated into a single integrated circuit module. Thus, embodiments of the present specification are not limited to any specific combination of hardware and software.

It is to be understood that the above description is intended to be illustrative, and not restrictive. Many embodiments and many applications other than the examples provided will be apparent to those of skill in the art upon reading the above description. The scope of the disclosure should, therefore, be determined not with reference to the above description, but instead should be determined with reference to the appended claims, along with the full scope of equivalents to which such claims are entitled.

The above description is only of the preferred embodiments of the present invention and is not intended to limit the present invention, but various modifications and variations can be made to the embodiments of the present invention by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present specification should be included in the protection scope of the present specification.

Claims

1. A target object risk identification method, comprising:

2. The target object risk identification method according to claim 1, wherein correcting the index data of the missing index data and/or the error in the second type attribute feature data by using a naive bayes decision algorithm includes:

3. The target object risk identification method according to claim 2, wherein the preprocessed second-class attribute feature data includes a plurality of second-class attribute features including a second-class attribute feature whose value is known and correct and a second-class attribute feature whose value is unknown or incorrect;

4. The target object risk identification method according to claim 1, wherein performing cluster analysis on the first type attribute feature data and the supplemented and/or corrected second type attribute feature data to obtain a risk category corresponding to the target object identifier, includes:

5. The target object risk identification method of claim 4, further comprising:

6. The target object risk identification method according to claim 5, wherein performing cluster analysis on feature vectors corresponding to each object sample in the plurality of object samples to obtain a plurality of cluster centers, comprises:

7. The target object risk identification method according to claim 5, wherein performing cluster analysis on feature vectors corresponding to each object sample in the plurality of object samples to obtain a plurality of cluster centers, comprises:

translating each object sample in the plurality of object samples;

8. The target object risk identification method of claim 1, wherein the first type of attribute feature data comprises at least one of:

9. The target object risk identification method of claim 1, wherein the second type of attribute feature data comprises at least one of:

10. A target object risk recognition apparatus, comprising:

11. A computer device comprising a processor and a memory for storing processor-executable instructions which when executed by the processor implement the steps of the method of any one of claims 1 to 9.

12. A computer readable storage medium having stored thereon computer instructions, which when executed by a processor, implement the steps of the method of any of claims 1 to 9.