WO2021120845A1

WO2021120845A1 - Homogeneous risk unit feature set generation method, apparatus and device, and medium

Info

Publication number: WO2021120845A1
Application number: PCT/CN2020/123373
Authority: WO
Inventors: 张慧南; 张向阳; 周杭; 沈磊
Original assignee: 支付宝(杭州)信息技术有限公司
Priority date: 2019-12-19
Filing date: 2020-10-23
Publication date: 2021-06-24
Also published as: CN111126476A

Abstract

Disclosed are a homogeneous risk unit feature set generation method and apparatus. A particular embodiment of the method comprises: for a risk unit feature in a target risk unit feature set, inputting the risk unit feature into a pre-trained risk feature decision tree, and determining a leaf node into which the risk unit feature is classified; for a leaf node, in the risk feature decision tree, of a target risk category, constructing a similarity network corresponding to the leaf node; and for each constructed similarity network, generating a community of the similarity network by means of a preset community detection algorithm, and generating, by means of risk unit features corresponding to all nodes in each generated community, a homogeneous risk unit feature set corresponding to the community. By means of the embodiment, the generation of a homogeneous risk unit feature set is realized.

Description

Method, device, equipment and medium for generating feature set of homogeneous risk units

Technical field

This application relates to the field of computer technology, and in particular to a method, device, equipment and medium for generating a feature set of homogeneous risk units.

Background technique

Currently, various financial systems (for example, banking, securities, insurance, virtual currency) and related industries usually deploy different monitoring rules from multiple dimensions to avoid money laundering, so as to realize periodic monitoring of a single risk unit, and each period of monitoring A large amount of risk warning information may be generated.

Summary of the invention

In view of this, the embodiments of this specification provide a method, device, equipment, and medium for generating a feature set of homogeneous risk units to solve the problem that only a single risk unit can be monitored and the risk unit cannot be identified at the same time.

The embodiments of this specification adopt the following technical solutions.

The embodiment of this specification provides a method for generating a homogenous risk unit feature set, including: for a risk unit feature in a target risk unit feature set, input the risk unit feature into a pre-trained risk feature decision tree to determine the risk unit The leaf node to which the feature is divided, wherein the leaf node of the risk feature decision tree corresponds to the risk category; for the leaf node of the target risk category in the risk feature decision tree, a similarity network corresponding to the leaf node is constructed, wherein , Each node of the constructed similarity network corresponds to the characteristics of each risk unit divided into the leaf node, and the similarity between any two nodes in the constructed similarity network graph is between the two risk unit characteristics corresponding to the two nodes The first similarity is positively correlated, and the first similarity is the similarity between the corresponding risk feature units of the two nodes in the path vector corresponding to the leaf node; for each similarity network constructed, the preset community is used Discover the algorithm, generate the community of the similarity network, and use the generated risk unit features corresponding to each node in each community to generate a homogeneous risk unit feature set corresponding to the community.

The embodiment of this specification also provides a homogeneous risk unit feature set generation device, including: a risk unit feature division module, which is used to input the risk unit feature to the pre-trained risk unit feature in the target risk unit feature set The risk feature decision tree determines the leaf nodes to which the risk unit feature is divided, wherein the leaf nodes of the risk feature decision tree correspond to the risk category; the similarity network construction module is used to determine the target risk in the risk feature decision tree Category leaf nodes, construct a similarity network corresponding to the leaf node, wherein each node of the constructed similarity network corresponds to each risk unit feature of the leaf node, and any two of the constructed similarity network graph The similarity between nodes is positively correlated with the first similarity between the features of the two risk units corresponding to the two nodes, and the first similarity is that the risk feature units corresponding to the two nodes are between the path vectors corresponding to the leaf nodes. The similarity degree; the homogenous risk generation module is used for each constructed similarity network, using the preset community discovery algorithm to generate the community of the similarity network, and use the generated risk corresponding to each node in each community Unit features generate a set of homogenous risk unit features corresponding to the community.

The embodiment of the present specification also provides a device for generating a feature set of homogeneous risk units, which includes: at least one processor; and a memory communicatively connected with the at least one processor; wherein the memory stores data that can be used by the An instruction executed by at least one processor, the instruction being executed by the at least one processor, so that the at least one processor can: for the risk unit feature in the target risk unit feature set, input the risk unit feature to the pre- The trained risk feature decision tree determines the leaf node to which the risk unit feature is divided, wherein the leaf node of the risk feature decision tree corresponds to the risk category; for the leaf node of the target risk category in the risk feature decision tree, construct The similarity network corresponding to the leaf node, where each node of the constructed similarity network corresponds to the characteristics of each risk unit of the leaf node, and the similarity between any two nodes in the constructed similarity network graph is equal to The first similarity between the two risk unit features corresponding to the two nodes is positively correlated, and the first similarity is the similarity between the respective risk feature units corresponding to the two nodes in the path vector corresponding to the leaf node; The constructed similarity network uses the preset community discovery algorithm to generate the community of the similarity network, and uses the risk unit characteristics corresponding to each node in each community to generate a homogeneous risk unit feature set corresponding to the community .

The embodiments of this specification also provide a computer-readable storage medium that stores computer-executable instructions, and is characterized in that, when the computer-executable instructions are executed by a processor, the following steps are implemented: The risk unit feature in the target risk unit feature set, the risk unit feature is input to the pre-trained risk feature decision tree, and the leaf node to which the risk unit feature is divided is determined, wherein the leaf node of the risk feature decision tree corresponds to Risk category; for the leaf node of the target risk category in the risk feature decision tree, construct a similarity network corresponding to the leaf node, wherein each node of the constructed similarity network corresponds to each risk of the leaf node Unit feature, the similarity between any two nodes in the constructed similarity network graph is positively correlated with the first similarity between the two risk unit features corresponding to the two nodes, and the first similarity is that the two nodes correspond to each other The similarity of the risk feature unit in the path vector corresponding to the leaf node; for each similarity network constructed, use the preset community discovery algorithm to generate the community of the similarity network, and use each generated community The risk unit characteristics corresponding to each node generate a homogeneous risk unit characteristic set corresponding to the community.

The above at least one technical solution adopted in the embodiments of this specification can achieve the following beneficial effects including but not limited to.

First, the leaf node of each target risk category in the risk feature decision tree has a specific risk meaning. Compared with the prior art only identifying whether it is the target risk category, it can also give the specific characteristics of each risk unit. This kind of target risk and the subsequent identification of homogeneous risk units have a clear meaning of risk.

Second, the leaf node of each target risk category in the risk feature decision tree has a specific risk meaning, and the characteristics of risk units divided into the same leaf node in different batches can use the same combination of features to determine homogeneous risks, that is, use the same set of criteria To determine the homogenous type, it is possible to track the risk trend change of the homogenous risk of this type in the target risk.

Third, through the risk feature decision tree, the feature components in the features of all risk units are divided into different feature branch paths, and then for the leaf nodes of different target risk categories, there are different feature combinations, that is, different "scales" to measure Its homogeneity realizes the automatic differentiation of homogenous risk feature types.

Fourth, there is no need to pre-determine the number of homogenous risk features under a leaf node, but the community discovery algorithm is used to automatically determine the risk unit feature data of a leaf node in the target risk feature set of batch processing, which improves Flexibility in determining homogeneous risk characteristics.

Fifth, by processing the characteristics of risk units in batches, instead of processing the characteristics of individual risk units, the specific number of risk units with target risks in the processed risk unit characteristics under various target risks can be obtained, which can be compared with the needs of different batches. The characteristic sets of the processed risk units are compared vertically to obtain the trend of risk changes in order to formulate risk control measures in a timely manner.

Description of the drawings

In order to more clearly describe the technical solutions in the embodiments of this specification, the following will briefly introduce the drawings that need to be used in the description of the embodiments of this specification. Obviously, the drawings in the following description are only some of the descriptions in this specification. Embodiments, for those of ordinary skill in the art, without creative work, other drawings can be obtained based on these drawings.

Fig. 1 is a flowchart of a method for generating a feature set of homogeneous risk units provided by the first embodiment of this specification.

Figure 2 is a flowchart of the training steps provided in the second embodiment of this specification.

Fig. 3 is a schematic diagram of the risk feature decision tree provided by the third embodiment of this specification.

Fig. 4 is a schematic diagram of a community provided by the fourth embodiment of this specification.

Fig. 5 is a flowchart of another method for generating a feature set of homogeneous risk units provided by the fifth embodiment of the present specification.

Fig. 6 is a schematic structural diagram of a device for generating a feature set of homogeneous risk units provided by the sixth embodiment of the present specification.

Detailed ways

At present, various financial systems and related industries (for example, banking, securities, insurance, virtual currency), in order to avoid various violations (for example, money laundering), usually deploy different monitoring rules from multiple dimensions, thereby realizing a single risk unit (For example, users, mobile phone numbers, vehicles, real estate) for monitoring, each period of monitoring may generate a large amount of risk warning information.

However, the current risk monitoring may have the following problems.

First of all, because the monitoring of a single risk unit is performed, the monitoring efficiency is low.

Secondly, because it monitors a single risk unit and does not have a vertical comparison, it is easy to see the trees but not the forest, and the sensitivity of risk perception is not high. When there is a new change in risk or a certain type of risk increases, the risk change cannot be captured in time. Risks are discovered suddenly when they are relatively large, and it is not easy to control risks in advance.

Therefore, a method is needed to identify the homogenous risks and conduct centralized processing, which can speed up the efficiency of risk response, and at the same time understand the scale and development trend of various risks, and then formulate targeted risk control measures in a timely manner. Risk management is of great significance.

In order to enable those skilled in the art to better understand the technical solutions in this specification, the following will clearly and completely describe the technical solutions in the embodiments of this specification in conjunction with the drawings in the embodiments of this specification. Obviously, the described The embodiments are only a part of the embodiments of the present application, rather than all the embodiments. Based on the embodiments of this specification, all other embodiments obtained by a person of ordinary skill in the art without creative work shall fall within the protection scope of this application.

The technical solutions provided by the embodiments of this specification are described in detail below with reference to the accompanying drawings.

The first embodiment of this specification proposes a method for generating a feature set of homogeneous risk units.

Please refer to FIG. 1, which shows a process 100 of an embodiment of a method for generating a feature set of homogeneous risk units. In this embodiment, the method is mainly applied to an electronic device with a certain computing capability as an example. The method for generating a feature set of homogeneous risk units includes steps 101-103.

Step 101: For the risk unit feature in the target risk unit feature set, input the risk unit feature into a pre-trained risk feature decision tree, and determine the leaf node to which the risk unit feature is divided.

In this embodiment, the execution subject of the method for generating a homogeneous risk unit feature set may first obtain the target risk unit feature set.

Here, the risk unit refers to the smallest unit that cannot be reasonably divided in terms of risk. Depending on the risk, the risk unit can also be different accordingly. As an example, when the risk refers to the money laundering risk in life insurance, the risk unit may be a life insurance user. When the risk refers to the money laundering risk in property insurance, the risk unit can also be real estate. The risk unit characteristics are the characteristics obtained after feature extraction of the risk unit information describing the risk unit. For example, when the risk unit is a life insurance user, the risk unit information may include the life insurance user's name, mobile phone number, policy number, insurance amount, insurance liability description, and so on. When the risk unit is a certain real estate, the risk unit information can include the real estate certificate number, real estate area, real estate location, unit area valuation and so on.

It should be noted that the target risk unit feature set can be composed of any risk unit features with the same combination of features. Here, only the target range feature set is taken as an example for illustration. In practice, the target risk unit feature set may be the risk unit feature set acquired in the same historical time period. In this way, the homogeneous risk unit feature set generation method can obtain the homogeneous risk feature in the historical time period. In addition, the number of risk unit features in the target risk unit feature set may be at least two.

In addition, it should be noted that the specific feature extraction method may be different according to the information of the risk unit, and the feature extraction method is an existing technology widely researched and applied in the field, and will not be repeated here.

Then, the above-mentioned execution subject can input the risk unit characteristics in the target risk unit characteristic set into the pre-trained risk characteristic decision tree, and determine the leaf node to which the risk unit characteristic is divided.

Here, the above-mentioned execution subject may input each risk unit feature in the target risk unit feature set into the pre-trained risk feature decision tree. The above-mentioned execution subject may also input part of the risk unit characteristics in the target risk unit characteristic set into the pre-trained risk characteristic decision tree. For example, the risk unit features that meet the preset conditions in the target risk unit feature set can be input into the pre-trained risk feature decision tree.

In this embodiment, the risk feature decision tree is a decision tree used to characterize the corresponding relationship between the risk unit feature and the risk category, and the leaf nodes of the risk feature decision tree correspond to the risk category. That is, inputting the characteristics of the risk unit into the risk characteristic decision tree can determine which leaf node the input risk unit characteristic is divided into, and then it can be determined that the input risk unit characteristic is the risk category corresponding to the divided leaf node.

In this embodiment, the risk category may be a risk category in the preset risk category set. For example, when the above-mentioned homogenous risk unit feature set generation method is applied to an anti-money laundering scenario, the preset risk category set may include two risk categories: "money laundering risk" and "no money laundering risk", or preset risk categories The set can also include three risk categories: "high money laundering risk", "medium money laundering risk" and "low money laundering risk".

As an example, the risk feature decision tree may be a decision tree made in advance by a technician based on the specific scenarios applied by the above homogeneous risk unit feature set generation method, based on a large number of historical risk unit features and corresponding risk categories.

In some optional implementation manners, the aforementioned risk feature decision tree may also be obtained through training in advance according to the training step 200 shown in FIG. 2. Please refer to FIG. 2, which shows a flowchart of the training steps provided in the second embodiment of this specification. The flow 200 of the training steps may include steps 201 to 203.

Here, the execution subject of the training step may be the same as or different from the execution subject of the method for generating the homogeneous risk unit feature set. If they are the same, the executor of the training step can store the relevant information of the trained risk feature decision tree locally after the risk feature decision tree is trained. If they are different, the execution subject of the training step can send relevant information of the trained risk feature decision tree to the execution subject of the method for generating the homogeneous risk unit feature set after the risk feature decision tree is trained.

Step 201: Obtain a set of reference samples.

Here, the reference samples in the acquired reference sample set may include sample risk unit information and corresponding sample risk categories.

In some optional implementations, the sample risk category corresponding to the sample risk unit information can be manually labeled.

In some optional implementations, the sample risk unit information and the corresponding sample risk category can also be obtained from an authoritative financial institution.

Step 202: Perform feature extraction on sample risk unit information in each reference sample in the reference sample information set to obtain corresponding sample features.

It should be noted that the method for feature extraction of the sample risk unit information here may be the same as the method for feature extraction of the risk unit information in step 101.

Step 203: For the reference samples in the reference sample set, take the sample feature corresponding to the sample risk information in the reference sample as the input, and take the sample risk category in the reference sample as the expected output, train the decision tree to obtain the risk feature decision tree .

It needs to be explained that how to train a decision tree is an existing technology that has been extensively researched and applied, and will not be repeated here.

The risk feature decision tree trained by using the training step 200 shown in FIG. 2 is obtained by supervised training of the decision tree based on reference samples. Furthermore, by using the risk feature decision tree obtained through the above training steps to classify the risk unit features, the accuracy of risk classification can be improved.

Step 102: For the leaf node of the target risk category in the risk feature decision tree, construct a similarity network corresponding to the leaf node.

In this embodiment, the executive body of the method for generating a homogeneous risk unit feature set can divide the risk unit features in the target risk unit feature set into different leaf nodes of the risk feature decision tree, and then compare the target risk in the risk feature decision tree. The leaf node of the category constructs a similarity network corresponding to the leaf node. Among them, each node of the constructed similarity network corresponds to the characteristics of each risk unit of the leaf node, and the similarity between any two nodes in the constructed similarity network graph corresponds to the characteristics of the two risk units corresponding to the two nodes. The first similarity between the two nodes is positively correlated, and the above-mentioned first similarity is the similarity between the corresponding risk feature units of the two nodes and the path vectors corresponding to the leaf nodes.

In this embodiment, various implementations can be adopted to realize that the similarity between any two nodes in the constructed similarity network graph is positively correlated with the first similarity between the features of the two risk units corresponding to the two nodes.

For example, the similarity between any two nodes in the constructed similarity network graph may also be linearly positively correlated with the first similarity between the features of the two risk units corresponding to the two nodes. For example, the similarity between any two nodes in the constructed similarity network graph may be the first similarity between the features of two risk units corresponding to the two nodes.

For example, the similarity between any two nodes in the constructed similarity network graph may also be positively non-linearly correlated with the first similarity between the features of the two risk units corresponding to the two nodes.

In this embodiment, the target risk category may belong to a risk category set composed of risk categories corresponding to each leaf node of the risk feature decision tree. Here, only the target risk category is used as an example for explanation, and it is not limited to the specified risk category.

In some alternative implementations, the target risk category may be a risk category. For example, when the risk category set composed of the risk categories corresponding to each leaf node of the risk feature decision tree includes two risk categories: “money laundering risk” and “no money laundering risk”, the target risk category can be “money laundering risk” "This risk category. Then, step 102 may be to construct a similarity network corresponding to the leaf node whose risk category is "money laundering risk" in the risk feature decision tree.

In some alternative implementations, the target risk category may also include more than one risk category. For example, when the risk category set of the risk categories corresponding to each leaf node of the risk feature decision tree includes the three risk categories of "high money laundering risk", "medium money laundering risk" and "low money laundering risk", the target risk category can be These are two risk categories: "high money laundering risk" and "medium money laundering risk". Then, step 102 may be to construct a similarity network corresponding to the leaf nodes corresponding to the two risk categories of "high money laundering risk" and "medium money laundering risk" in the risk feature decision tree.

In order to facilitate the understanding of the above-mentioned similarity network, an example is given below.

For example, the risk unit features in the target risk unit feature set in step 101 include seven feature components A, B, C, D, E, F, and G. Refer to Figure 3 for the tree structure of the risk feature decision tree. Fig. 3 is a schematic diagram of the risk feature decision tree provided by the third embodiment of this specification.

As shown in Figure 3, the risk feature decision tree 300 has seven split

points

301, 302, 303, 304, 305, 305, and 307. These seven split points have an effect on features A, B, C, D, E, F, Different values of the seven characteristic components of G are judged. The risk feature decision tree also includes 9

leaf nodes

308, 309, 310, 311, 312, 313, 314, 315, and 316. The above 9 leaf nodes correspond to risk category 1, risk category 2, risk category 3, and risk category 2 respectively. , Risk category 3, risk category 1, risk category 1, risk category 3, risk category 2.

Assuming that there are 100 risk unit characteristics in the target risk unit characteristic set, we can identify these 100 risk unit characteristics with serial numbers 1-100. After step 101, each risk unit feature in the target risk unit feature set is divided into each leaf node of the risk feature decision tree. For clarity, please refer to Table 1, which shows the division of the characteristics of risk units with serial numbers 1-100 after step 101.

风险单位特征序号Characteristic number of risk unit	叶子结点Leaf node	风险类别Risk category	风险单位特征数量Number of risk unit characteristics
1-151-15	308308	11	1515
16-2016-20	309309	22	55
21-2921-29	310310	33	99
30-3730-37	311311	22	88
3838	312312	33	11
39-6339-63	313313	11	2525
64-8564-85	314314	11	22twenty two
86-9386-93	315315	33	88
94-10094-100	316316	22	77

Table 1

Assuming that the target risk category is risk category 1, then in step 102, for the leaf node 308, a similarity network corresponding to the leaf node 308 can be constructed; for the leaf node 313, a similarity network corresponding to the leaf node 308 can be constructed; For the leaf node 314, a similarity network corresponding to the leaf node 314 is constructed.

It can be seen from Table 1 that when constructing the similarity network corresponding to the leaf node 308, the similarity network constructed first has 15 nodes, and the above 15 nodes correspond to the characteristics of each risk unit with serial numbers 1-15. . Secondly, calculate the first similarity between the characteristics of each risk unit with serial numbers 1-15, that is, calculate the first similarity between the characteristics of each risk unit with serial numbers 1-15 at the leaf node 308 The path vector: the similarity between A, B, and D. In other words, when calculating the first similarity between the characteristics of the two risk units among the characteristics of each risk unit with serial numbers 1-15, only the three characteristic components A, B, and D need to be considered.

In addition, the leaf node 308 corresponds to risk category 1, and the leaf node 313 also corresponds to risk category 1. When constructing the similarity network corresponding to the leaf node 313, the similarity network constructed first has 25 nodes, and the above 25 nodes Corresponding to the characteristics of each risk unit with serial numbers 39-63. Secondly, calculate the first similarity between the characteristics of each risk unit with serial numbers 39-63, that is, calculate the pair of risk unit characteristics among the characteristics of each risk unit with serial numbers 39-63 at the leaf node 313 pairs The path vector: the similarity between A, B, and F. In other words, when calculating the first similarity between the characteristics of each risk unit with serial numbers 39-63, only the three characteristic components of A, B, and F need to be considered.

It can be seen from the above example that for the same risk category 1, different leaf nodes 308 and leaf nodes 313 can correspond to different risk meanings, and the feature combinations for measuring whether the risk unit characteristics of different risk meanings belong to a certain risk category can be different. Yes, that is, use different scales to judge whether it is a certain risk category.

In this embodiment, the aforementioned first degree of similarity may be any value that is known or developed in the future to characterize the degree of similarity between vectors.

In some optional implementation manners, the above-mentioned first similarity may be the cosine similarity between the corresponding risk feature units of the two nodes and the path vectors corresponding to the leaf nodes. It should be noted that how to calculate the cosine similarity between two vectors is a well-known technique in the art, and will not be repeated here.

Step 103: For each constructed similarity network, use a preset community discovery algorithm to generate the community of the similarity network, and use the generated characteristics of the risk unit corresponding to each node in each community to generate the identity corresponding to the community. A collection of characteristics of qualitative risk units.

In this embodiment, the executive body of the method for generating a feature set of homogeneous risk units may construct a similarity network corresponding to the leaf nodes of the target risk category in the risk feature decision tree in step 102, and then determine the similarity The network first uses a preset community discovery algorithm to generate a community of the similarity network. The community generated by the community discovery algorithm is a subgraph corresponding to a sub-set of nodes that are closely connected in the similarity network. That is, the feature similarity of the risk unit corresponding to each node included in each subgraph is relatively similar to each other. In other words, the characteristics of risk units corresponding to each node included in each subgraph are generally similar in type, quality, performance, value, etc. and have homogeneous risks. Then the characteristics of risk units corresponding to each node included in each subgraph form a group Homogeneous risk unit characteristics. Therefore, the above-mentioned executive body can generate a homogeneous risk unit feature set corresponding to the community by using the risk unit characteristics corresponding to each node in each generated community after generating the community.

For ease of understanding, examples are given below. Here, continue to use the above example, please refer to Figure 3 and Table 1. Assuming that the target risk category is risk category 1, then in step 103, for the leaf node 308, a community corresponding to the constructed similarity network corresponding to the leaf node 308 can be generated; for the leaf node 313, a community with the constructed and The community of the similarity network corresponding to the leaf node 308; for the leaf node 314, a community corresponding to the constructed similarity network corresponding to the leaf node 314 is generated.

Please refer to FIG. 4, which is a schematic diagram of a community provided in the fourth embodiment of this specification. FIG. 4 shows a schematic diagram of a community of the similarity network generated through step 103, taking the leaf node 308 as an example. It can be seen from Figure 4 that there are 15 nodes in the similarity network corresponding to the characteristics of risk units with serial numbers 1-15, and three communities are generated after step 103. Among them, the community 401 includes the corresponding serial numbers 1-5. The similarity network node corresponding to each risk unit feature, the community 402 includes the similarity network node corresponding to each risk unit feature of serial number 6-10, and the community 403 includes the similarity corresponding to each risk unit feature of serial number 11-15. Network node. And generated a homogeneous risk unit feature set including the characteristics of each risk unit with serial numbers 1-5, a homogeneous risk unit feature set including the characteristics of each risk unit with serial numbers 6-10, and a feature set of each risk unit including serial numbers 11-15 Another set of characteristics of homogeneous risk units.

In this embodiment, the aforementioned preset community discovery algorithm may be any known or future community discovery algorithm, which is not specifically limited in this embodiment.

In some optional implementation manners, the aforementioned preset community discovery algorithm may be a tag propagation algorithm.

The method provided by the above-mentioned embodiments of this specification processes the characteristics of risk units in batches, first uses a pre-trained risk characteristic decision tree to classify and screen the characteristics of risk units, and then divides them into different risk characteristic decision trees by constructing The similarity network of the characteristics of each risk unit of the leaf nodes of the target risk category, and finally use the community discovery algorithm to generate a community, and then generate a set of risk unit characteristics with homogeneous risks in the target risk category with different meanings.

The fifth embodiment of this specification proposes yet another method for generating a feature set of homogeneous risk units.

With further reference to FIG. 5, it shows a process 500 of another embodiment of a method for generating a feature set of homogeneous risk units. The process 500 of the method for generating a feature set of homogeneous risk units includes steps 501 to 504.

Step 501: For the risk unit features in the target risk unit feature set, input the risk unit features into a pre-trained risk feature decision tree, and determine the leaf node to which the risk unit feature is divided.

Step 502: For the leaf node of the target risk category in the risk feature decision tree, construct a similarity network corresponding to the leaf node.

Step 503: For each constructed similarity network, use a preset community discovery algorithm to generate the community of the similarity network, and use the generated characteristics of the risk unit corresponding to each node in each community to generate the identity corresponding to the community. A collection of characteristics of qualitative risk units.

In this embodiment, the specific operations of step 501, step 502, and step 503 are basically the same as the operations of step 101, step 102, and step 103 in the embodiment shown in FIG. 1, and will not be repeated here.

Step 504, for the leaf node of the target risk category in the risk feature decision tree, output at least one item of the following information: the number of nodes of the similarity network corresponding to the leaf node, and each community of the similarity network corresponding to the leaf node A collection of characteristics of homogeneous risk units.

After step 501 to step 503, the executive body of the method for generating the homogeneous risk unit feature set has divided each risk unit feature in the risk unit feature set into different risk categories, and each risk unit feature classified into the target risk category has also been divided The leaf nodes of decision trees with different risk characteristics have different meanings. In addition, the characteristics of each risk unit classified into the leaf nodes of the same target risk category are also divided into different communities to form a homogeneous risk unit feature set.

Then, in step 504, the execution subject may output at least one of the following information: the number of nodes in the similarity network corresponding to the leaf node of the target risk category in the risk feature decision tree, and the similarity network corresponding to the leaf node The set of characteristics of homogeneous risk units corresponding to each community of.

In other words, in step 504, if the number of nodes in the similarity network corresponding to the leaf nodes of the target risk category in the risk feature decision tree is output, the feature sets of risk units to be processed in different batches can be determined to belong to the specific same risk meaning (ie The risk scale of risk unit characteristics that are divided into the same leaf node) can be compared vertically to determine the trend of risk changes. If the homogenous risk unit feature set is output in step 504, it can be specifically specified which risk unit features in the target risk unit feature set belong to the homogeneous risk, and of course the number of risk unit features in the same homogeneous risk unit feature set can also be specified. Furthermore, it is possible to collectively monitor the characteristics of specific risk units belonging to homogeneous risks.

Here, the output of the information in step 504 may be various output methods. For example, the execution body may store the information to be output locally in the execution body, or the execution body may present the information to be output in various presentation methods (for example, text, graphics, voice, etc.) in the execution body. The main body's display terminal. For another example, the above-mentioned executive body may also send the above-mentioned information to be output to other electronic devices connected to its network, and the above-mentioned electronic equipment may store or page the above-mentioned information to be output by the above-mentioned electronic equipment in various presentation methods (for example, , Text, graphics, voice, etc.) are presented on the display terminal of the above-mentioned electronic device.

It can be seen from FIG. 5 that, compared with the embodiment corresponding to FIG. 1, the process 500 of the method for generating a feature set of homogeneous risk units in this embodiment has more steps of information output. Therefore, the solution described in this embodiment can realize information output, thereby realizing more comprehensive risk monitoring.

Based on the same idea, as shown in FIG. 6, the sixth embodiment of this specification provides a device for generating a feature set of homogenous risk units, including: a risk unit feature division module 601, which is used for determining the target risk unit feature set The risk unit characteristic of the risk unit is input into the pre-trained risk characteristic decision tree, and the leaf node to which the risk unit characteristic is divided is determined, wherein the leaf node of the risk characteristic decision tree corresponds to the risk category; similarity network The construction module 602 is used for constructing a similarity network corresponding to the leaf node of the target risk category in the risk feature decision tree, wherein each node of the constructed similarity network is correspondingly divided into the leaf node The similarity between any two nodes in the constructed similarity network graph is positively correlated with the first similarity between the features of the two risk units corresponding to the two nodes, and the first similarity is the The similarity of the risk feature unit corresponding to each node in the path vector corresponding to the leaf node; the homogenous risk generation module 603 is used to generate the similarity network for each similarity network constructed by using a preset community discovery algorithm Community, and use the generated risk unit characteristics corresponding to each node in each community to generate a homogeneous risk unit characteristic set corresponding to the community.

In this embodiment, the specific processing of the risk unit feature division module 601, the similarity network construction module 602, and the homogeneous risk generation module 603 of the homogeneous risk unit feature set generation device 600 and the technical effects brought by them can be referred to respectively. Figure 1 corresponds to the relevant descriptions of step 101, step 102, and step 103 in the embodiment, and will not be repeated here.

Optionally, the risk feature decision tree may be obtained by training through the following training steps: obtaining a reference sample set, where the reference sample includes sample risk unit information and a corresponding sample risk category; The sample risk unit information in each reference sample is feature extracted to obtain the corresponding sample feature; for the reference sample in the reference sample set, the sample feature corresponding to the sample risk information in the reference sample is used as input, and the reference sample The sample risk category in is used as the expected output, and the decision tree is trained to obtain the risk feature decision tree.

Optionally, the device may further include: an output module 604, configured to output at least one item of the following information for the leaf node of the target risk category in the risk feature decision tree: the node of the similarity network corresponding to the leaf node Number, the homogenous risk unit feature set corresponding to each community of the similarity network corresponding to the leaf node.

Optionally, the preset community discovery algorithm may be a label propagation algorithm and/or the first similarity may be the cosine similarity between the corresponding risk feature units of the two nodes and the path vectors corresponding to the leaf nodes.

It should be noted that the implementation details and technical effects of each module in the homogenous risk unit feature set generation device provided in the embodiment of this specification can refer to the description of other embodiments in this specification, and will not be repeated here.

Based on the same idea, the seventh embodiment of this specification provides a device for generating a feature set of homogeneous risk units, including: at least one processor; and, a memory communicatively connected with the at least one processor; The memory stores instructions executable by the at least one processor, and the instructions are executed by the at least one processor so that the at least one processor can: For the risk unit features in the target risk unit feature set, Input the characteristics of the risk unit into a pre-trained risk characteristic decision tree, and determine the leaf nodes to which the characteristics of the risk unit are divided, wherein the leaf nodes of the risk characteristic decision tree correspond to the risk category; for the risk characteristic decision tree The leaf node of the target risk category constructs a similarity network corresponding to the leaf node, where each node of the constructed similarity network corresponds to the characteristics of each risk unit of the leaf node, and the constructed similarity network graph The similarity between any two nodes is positively correlated with the first similarity between the two risk unit features corresponding to the two nodes, and the first similarity is the path corresponding to the risk feature unit of the two nodes in the leaf node. The similarity between vectors; for each constructed similarity network, use the preset community discovery algorithm to generate the community of the similarity network, and use the generated characteristics of the risk unit corresponding to each node in each community to generate the community Corresponding set of characteristics of homogeneous risk units.

Based on the same idea, the eighth embodiment of this specification provides a computer-readable storage medium that stores computer-executable instructions, and is characterized in that the computer-executable instructions are executed by a processor The following steps are implemented during execution: for the risk unit feature in the target risk unit feature set, input the risk unit feature to a pre-trained risk feature decision tree, and determine the leaf node to which the risk unit feature is divided, wherein the The leaf node of the risk feature decision tree corresponds to the risk category; for the leaf node of the target risk category in the risk feature decision tree, a similarity network corresponding to the leaf node is constructed, wherein each node of the constructed similarity network corresponds to each Each risk unit feature divided into the leaf node, the similarity between any two nodes in the constructed similarity network graph is positively correlated with the first similarity between the two risk unit features corresponding to the two nodes, and the first The similarity is the similarity between the corresponding risk feature units of the two nodes in the path vector corresponding to the leaf node; for each similarity network constructed, a preset community discovery algorithm is used to generate the community of the similarity network, and Use the generated risk unit characteristics corresponding to each node in each community to generate a homogeneous risk unit characteristic set corresponding to the community.

The specific embodiments of this specification have been described above, and other embodiments are within the scope of the appended claims. In some cases, the actions or steps described in the claims can be performed in a different order than in the embodiments and still achieve desired results. In addition, the processes depicted in the drawings do not necessarily have to be in the specific order or sequential order shown in order to achieve the desired result. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.

The apparatus, equipment, non-volatile computer readable storage medium and method provided in the embodiments of this specification correspond to each other. Therefore, the apparatus, equipment, and non-volatile computer storage medium also have beneficial technical effects similar to the corresponding method. The beneficial technical effects of the method have been described in detail above, therefore, the beneficial technical effects of the corresponding device, equipment, and non-volatile computer storage medium will not be repeated here.

In the 1990s, the improvement of a technology can be clearly distinguished between hardware improvements (for example, improvements in circuit structures such as diodes, transistors, switches, etc.) or software improvements (improvements in method flow). However, with the development of technology, the improvement of many methods and procedures of today can be regarded as a direct improvement of the hardware circuit structure. Designers almost always get the corresponding hardware circuit structure by programming the improved method flow into the hardware circuit. Therefore, it cannot be said that the improvement of a method flow cannot be realized by hardware entity modules. For example, a programmable logic device (Programmable Logic Device, PLD) (for example, a Field Programmable Gate Array (Field Programmable Gate Array, FPGA)) is such an integrated circuit whose logic function is determined by the user's programming of the device. It is programmed by the designer to "integrate" a digital system on a PLD without requiring the chip manufacturer to design and manufacture a dedicated integrated circuit chip. Moreover, nowadays, instead of manually making integrated circuit chips, this kind of programming is mostly realized by using "logic compiler" software, which is similar to the software compiler used in program development and writing, but before compilation The original code must also be written in a specific programming language, which is called Hardware Description Language (HDL), and there is not only one type of HDL, but many types, such as ABEL (Advanced Boolean Expression Language) , AHDL (Altera Hardware Description Language), Confluence, CUPL (Cornell University Programming Language), HDCal, JHDL (Java Hardware Description Language), Lava, Lola, MyHDL, PALASM, RHDL (Ruby Hardware Description), etc., currently most commonly used It is VHDL (Very-High-Speed Integrated Circuit Hardware Description Language) and Verilog. It should also be clear to those skilled in the art that only a little bit of logic programming of the method flow in the above-mentioned hardware description languages and programming into an integrated circuit, the hardware circuit that implements the logic method flow can be easily obtained.

The controller can be implemented in any suitable manner. For example, the controller can take the form of, for example, a microprocessor or a processor and a computer-readable medium storing computer-readable program codes (such as software or firmware) executable by the (micro)processor. , Logic gates, switches, application specific integrated circuits (ASICs), programmable logic controllers and embedded microcontrollers. Examples of controllers include but are not limited to the following microcontrollers: ARC625D, Atmel AT91SAM, Microchip PIC18F26K20 and Silicon Labs C8051F320, the memory controller can also be implemented as part of the memory control logic. Those skilled in the art also know that, in addition to implementing the controller in a purely computer-readable program code manner, it is entirely possible to program the method steps to make the controller use logic gates, switches, application-specific integrated circuits, programmable logic controllers, and embedded logic. The same function can be realized in the form of a microcontroller or the like. Therefore, such a controller can be regarded as a hardware component, and the devices included in it for realizing various functions can also be regarded as a structure within the hardware component. Or even, the device for realizing various functions can be regarded as both a software module for realizing the method and a structure within a hardware component.

The systems, devices, modules, or units illustrated in the above embodiments may be specifically implemented by computer chips or entities, or implemented by products with certain functions. A typical implementation device is a computer. Specifically, the computer may be, for example, a personal computer, a laptop computer, a cell phone, a camera phone, a smart phone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or Any combination of these devices.

For the convenience of description, when describing the above device, the functions are divided into various units and described separately. Of course, when implementing the embodiments of this specification, the functions of each unit may be implemented in the same or multiple software and/or hardware.

Those skilled in the art should understand that the embodiments of the present application can be provided as methods, systems, or computer program products. Therefore, this application may adopt the form of a complete hardware embodiment, a complete software embodiment, or an embodiment combining software and hardware. Moreover, this application may adopt the form of a computer program product implemented on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) containing computer-usable program codes.

This application is described with reference to flowcharts and/or block diagrams of methods, devices (systems), and computer program products according to embodiments of this application. It should be understood that each process and/or block in the flowchart and/or block diagram, and the combination of processes and/or blocks in the flowchart and/or block diagram can be realized by computer program instructions. These computer program instructions can be provided to the processor of a general-purpose computer, a special-purpose computer, an embedded processor, or other programmable data processing equipment to generate a machine, so that the instructions executed by the processor of the computer or other programmable data processing equipment are generated It is a device that realizes the functions specified in one process or multiple processes in the flowchart and/or one block or multiple blocks in the block diagram.

These computer program instructions can also be stored in a computer-readable memory that can guide a computer or other programmable data processing equipment to work in a specific manner, so that the instructions stored in the computer-readable memory produce an article of manufacture including the instruction device. The device implements the functions specified in one process or multiple processes in the flowchart and/or one block or multiple blocks in the block diagram.

These computer program instructions can also be loaded on a computer or other programmable data processing equipment, so that a series of operation steps are executed on the computer or other programmable equipment to produce computer-implemented processing, so as to execute on the computer or other programmable equipment. The instructions provide steps for implementing the functions specified in one process or multiple processes in the flowchart and/or one block or multiple blocks in the block diagram.

In a typical configuration, the computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include non-permanent memory in computer readable media, random access memory (RAM) and/or non-volatile memory, such as read-only memory (ROM) or flash memory (flash RAM). Memory is an example of computer readable media.

Computer-readable media include permanent and non-permanent, removable and non-removable media, and information storage can be realized by any method or technology. The information can be computer-readable instructions, data structures, program modules, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), flash memory or other memory technology, CD-ROM, digital versatile disc (DVD) or other optical storage, Magnetic cartridges, magnetic tape storage or other magnetic storage devices or any other non-transmission media can be used to store information that can be accessed by computing devices. According to the definition in this article, computer-readable media does not include transitory media, such as modulated data signals and carrier waves.

It should also be noted that the terms "include", "include" or any other variants thereof are intended to cover non-exclusive inclusion, so that a process, method, commodity or equipment including a series of elements not only includes those elements, but also includes Other elements that are not explicitly listed, or also include elements inherent to such processes, methods, commodities, or equipment. If there are no more restrictions, the element defined by the sentence "including a..." does not exclude the existence of other identical elements in the process, method, commodity, or equipment that includes the element.

This specification may be described in the general context of computer-executable instructions executed by a computer, such as a program module. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform specific tasks or implement specific abstract data types. This specification can also be practiced in distributed computing environments where tasks are performed by remote processing devices connected through a communication network. In a distributed computing environment, program modules can be located in local and remote computer storage media including storage devices.

The various embodiments in this specification are described in a progressive manner, and the same or similar parts between the various embodiments can be referred to each other, and each embodiment focuses on the difference from other embodiments. In particular, as for the system embodiment, since it is basically similar to the method embodiment, the description is relatively simple, and for related parts, please refer to the part of the description of the method embodiment.

The above descriptions are only examples of this specification, and are not intended to limit this application. For those skilled in the art, this application can have various modifications and changes. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of this application shall be included in the scope of the claims of this application.

Claims

A method for generating a feature set of homogeneous risk units, including:

For the risk unit feature in the target risk unit feature set, input the risk unit feature into a pre-trained risk feature decision tree, and determine the leaf node to which the risk unit feature is divided, wherein the leaf node of the risk feature decision tree Corresponding risk category;

For the leaf node of the target risk category in the risk feature decision tree, construct a similarity network corresponding to the leaf node, wherein each node of the constructed similarity network corresponds to each risk unit feature of the leaf node, The similarity between any two nodes in the constructed similarity network graph is positively correlated with the first similarity between the corresponding risk unit features of the two nodes, and the first similarity is the risk feature unit corresponding to the two nodes. The similarity between the path vectors corresponding to the leaf node;

For each similarity network constructed, use the preset community discovery algorithm to generate the community of the similarity network, and use the generated risk unit characteristics corresponding to each node in each community to generate a homogeneous risk unit corresponding to the community Feature collection.
The method according to claim 1, wherein the risk feature decision tree is obtained by training through the following training steps:

Obtain a reference sample set, where the reference sample includes sample risk unit information and corresponding sample risk category;

Performing feature extraction on sample risk unit information in each reference sample in the reference sample information set to obtain corresponding sample features;

For the reference samples in the reference sample set, the sample feature corresponding to the sample risk information in the reference sample is used as input, and the sample risk category in the reference sample is used as the expected output, and the decision tree is trained to obtain the risk feature decision tree.
The method of claim 2, further comprising:

For the leaf node of the target risk category in the risk feature decision tree, output at least one item of the following information: the number of nodes in the similarity network corresponding to the leaf node, and the number of nodes in the similarity network corresponding to the leaf node. A collection of characteristics of homogeneous risk units.
The method according to any one of claims 1 to 3, wherein the preset community discovery algorithm is a label propagation algorithm and/or the first similarity is the risk characteristic unit corresponding to each of the two nodes corresponding to the leaf node The cosine similarity between path vectors.
A device for generating a feature set of homogeneous risk units, including:

The risk unit feature division module is used to input the risk unit feature in the target risk unit feature set into the pre-trained risk feature decision tree, and determine the leaf node to which the risk unit feature is divided. The leaf nodes of the risk feature decision tree correspond to the risk category;

The similarity network construction module is used to construct a similarity network corresponding to the leaf node of the target risk category in the risk feature decision tree, wherein each node of the constructed similarity network is correspondingly divided into the For each risk unit feature of the leaf node, the similarity between any two nodes in the constructed similarity network graph is positively correlated with the first similarity between the two risk unit features corresponding to the two nodes, and the first similarity Is the similarity between the corresponding risk feature units of the two nodes and the path vectors corresponding to the leaf nodes;

The homogenous risk generation module is used for each constructed similarity network, using a preset community discovery algorithm to generate the community of the similarity network, and use the generated characteristics of the risk unit corresponding to each node in each community to generate and The feature set of homogeneous risk units corresponding to the community.
The device according to claim 5, wherein the risk feature decision tree is obtained by training through the following training steps:

Obtain a reference sample set, where the reference sample includes sample risk unit information and corresponding sample risk category;

Performing feature extraction on sample risk unit information in each reference sample in the reference sample information set to obtain corresponding sample features;

For the reference samples in the reference sample set, the sample feature corresponding to the sample risk information in the reference sample is used as input, and the sample risk category in the reference sample is used as the expected output, and the decision tree is trained to obtain the risk feature decision tree.
The device according to claim 6, further comprising:

The output module is used to output at least one item of the following information for the leaf node of the target risk category in the risk feature decision tree: the number of nodes of the similarity network corresponding to the leaf node, and the number of the similarity network corresponding to the leaf node The feature set of homogeneous risk units corresponding to each community.
7. The device according to any one of claims 6-7, wherein the preset community discovery algorithm is a label propagation algorithm and/or the first similarity is the risk characteristic unit corresponding to each of the two nodes corresponding to the leaf node The cosine similarity between path vectors.
A device for generating a feature set of homogeneous risk units, including:

At least one processor;

as well as,

A memory connected in communication with the at least one processor;

among them,

The memory stores instructions executable by the at least one processor, and the instructions are executed by the at least one processor, so that the at least one processor can:

For the risk unit feature in the target risk unit feature set, input the risk unit feature into a pre-trained risk feature decision tree, and determine the leaf node to which the risk unit feature is divided, wherein the leaf node of the risk feature decision tree Corresponding risk category;

For the leaf node of the target risk category in the risk feature decision tree, construct a similarity network corresponding to the leaf node, wherein each node of the constructed similarity network corresponds to each risk unit feature of the leaf node, The similarity between any two nodes in the constructed similarity network graph is positively correlated with the first similarity between the two risk unit features corresponding to the two nodes, and the first similarity is the risk feature corresponding to each of the two nodes. The similarity of the unit between the path vectors corresponding to the leaf node;

For each similarity network constructed, use the preset community discovery algorithm to generate the community of the similarity network, and use the generated risk unit characteristics corresponding to each node in each community to generate a homogeneous risk unit corresponding to the community Feature collection.
A computer-readable storage medium storing computer-executable instructions, wherein the computer-executable instructions are executed by a processor to implement the following steps:

For the risk unit feature in the target risk unit feature set, input the risk unit feature into a pre-trained risk feature decision tree, and determine the leaf node to which the risk unit feature is divided, wherein the leaf node of the risk feature decision tree Corresponding risk category;

For the leaf node of the target risk category in the risk feature decision tree, construct a similarity network corresponding to the leaf node, wherein each node of the constructed similarity network corresponds to each risk unit feature of the leaf node, The similarity between any two nodes in the constructed similarity network graph is positively correlated with the first similarity between the two risk unit features corresponding to the two nodes, and the first similarity is the risk feature corresponding to each of the two nodes. The similarity of the unit between the path vectors corresponding to the leaf node;

For each similarity network constructed, use the preset community discovery algorithm to generate the community of the similarity network, and use the generated risk unit characteristics corresponding to each node in each community to generate a homogeneous risk unit corresponding to the community Feature collection.