WO2020156000A1 - Computer implemented event risk assessment method and device - Google Patents

Computer implemented event risk assessment method and device Download PDF

Info

Publication number
WO2020156000A1
WO2020156000A1 PCT/CN2019/129863 CN2019129863W WO2020156000A1 WO 2020156000 A1 WO2020156000 A1 WO 2020156000A1 CN 2019129863 W CN2019129863 W CN 2019129863W WO 2020156000 A1 WO2020156000 A1 WO 2020156000A1
Authority
WO
WIPO (PCT)
Prior art keywords
event
node
feature
sample
knowledge graph
Prior art date
Application number
PCT/CN2019/129863
Other languages
French (fr)
Chinese (zh)
Inventor
李彬
张可尊
Original Assignee
阿里巴巴集团控股有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 阿里巴巴集团控股有限公司 filed Critical 阿里巴巴集团控股有限公司
Publication of WO2020156000A1 publication Critical patent/WO2020156000A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/06Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
    • G06Q10/063Operations research, analysis or management
    • G06Q10/0635Risk analysis of enterprise or organisation activities

Definitions

  • One or more embodiments of this specification relate to the field of machine learning, and more particularly to methods and devices for assessing event risk using machine learning.
  • One or more embodiments of this specification describe a computer-implemented event risk assessment method and device, which constructs event features by expanding the elements of the event, and trains the GBDT model to achieve an effective assessment of the risk of the event, and can evaluate the estimated risk value Provide corresponding feature explanation.
  • a computer-executed event risk assessment method including:
  • extracting multiple sample events from the content text database includes identifying the first sample event and its corresponding first sample event An event type, and at least one first event element of the first sample event is extracted according to the first event type;
  • the risk assessment of the second event to be analyzed is performed.
  • At least one event element of the first sample event is extracted in the following manner:
  • the at least one first event element includes at least one of the following: event time, event location, implementation subject, event object, fact type, and event level.
  • the related elements are obtained in the following way:
  • the at least one first event element is mapped to the first node in the at least one knowledge graph; the node directly connected to the first node in the at least one knowledge graph is used as the at least one associated element.
  • the above-mentioned knowledge graph may include: enterprise knowledge graph, product knowledge graph, character knowledge graph, information knowledge graph, stock knowledge graph, fund knowledge graph, and institution knowledge graph.
  • performing risk assessment on the second event to be analyzed specifically includes:
  • the event characteristics of the second event are input into the trained GBDT model, and the risk value of the second event is determined according to the model output.
  • the second event element is obtained in the following manner:
  • the at least one second event element is extracted from the input text.
  • the input second event and the at least one second event element may be directly received.
  • the trained GBDT model includes at least one decision tree, the decision tree includes branch nodes and leaf nodes, each branch node corresponds to a feature, and has the risk score and node weight obtained by training.
  • the node weight is determined based on the respective node loss values of the branch node and the split node, and the node loss value is determined based on the difference between the nominal risk value of the sample event falling into the node and the risk score of the node.
  • the risk assessment of the second event to be analyzed also includes:
  • the feature weight of the first feature is determined according to the node weight of at least one branch node corresponding to the first feature in each branch node, as The importance of this first characteristic to the risk value.
  • the GBDT model obtained by training includes at least one decision tree, and the decision tree includes branch nodes and leaf nodes; after obtaining such a GBDT model, performing risk assessment on the second event to be analyzed specifically includes :
  • the feature combination corresponding to the branch and trunk nodes included in the conditional path is acquired, and the feature combination is used as the influence feature of the second event under the predetermined condition.
  • each leaf node in the decision tree obtains a risk score through training, and each branch node corresponds to a feature, and has the risk score obtained by training and the node weight, wherein the node weight is based on The node loss value of the branch node and the node after the split is determined, and the node loss value is determined based on the difference between the calibrated risk value of the sample event falling into the node and the risk score of the node; accordingly, in an implementation
  • the risk assessment of the second event to be analyzed also includes one or more of the following:
  • each feature corresponding to each branch node in the feature combination is determined according to the node weight of each branch node in the conditional path.
  • a computer-executed event risk assessment device including:
  • the extraction unit is configured to use a natural language processing model to extract a plurality of sample events from the content text library, the plurality of sample events including a first sample event, and the extracting the plurality of sample events includes identifying the first sample event And its corresponding first event type, and extracting at least one first event element of the first sample event according to the first event type;
  • An associating unit configured to obtain at least one first associated element associated with the at least one first event element in at least one knowledge graph corresponding to at least one field associated with the first sample event;
  • a feature determining unit configured to determine the event feature of the first sample event according to the first event type, the at least one first event element, and the at least one first correlation element;
  • the training unit is configured to train the gradient boosting decision tree GBDT model according to the event characteristics of each sample event in the multiple sample events and the calibration risk value of each sample event to obtain the trained GBDT model;
  • the evaluation unit is configured to use the trained GBDT model to perform risk evaluation on the second event to be analyzed.
  • a computer-readable storage medium having a computer program stored thereon, and when the computer program is executed in a computer, the computer is caused to execute the method of the first aspect.
  • a computing device including a memory and a processor, characterized in that executable code is stored in the memory, and when the processor executes the executable code, the method of the first aspect is implemented .
  • a more comprehensive event feature is constructed by expanding the event elements in the knowledge graph of the related field.
  • a GBDT model including a decision tree can be trained. Using such a GBDT model, not only can the risk value be evaluated for the unknown risk to be assessed, but also the risk value can be characterized. In this way, while realizing quantitative prediction, it can also make the prediction result have a stronger logical expression And interpretability.
  • Figure 1 is a schematic diagram of the implementation process of an embodiment disclosed in this specification.
  • Figure 2 shows a flowchart of an event risk assessment method according to an embodiment
  • Figure 3 shows a decision tree trained according to an embodiment
  • Figure 4 shows a flow chart of performing risk assessment on a second event in an embodiment
  • Figure 5 shows the division process of the second event in the decision tree in one embodiment
  • Fig. 6 shows a flow of steps for feature interpretation in an embodiment
  • Fig. 7 shows a flowchart of steps for evaluating a second event according to an embodiment
  • Fig. 8 shows a schematic block diagram of an event evaluation device according to an embodiment.
  • FIG. 1 is a schematic diagram of the implementation process of an embodiment disclosed in this specification.
  • Fig. 1 according to the solution of the embodiment, first sample events are extracted, and features are constructed for the sample events.
  • the characteristics of the event not only the elements of the event itself are considered, but also the knowledge graphs of related fields are combined to dig out relevant elements from the knowledge graphs to jointly form the event characteristics, which makes the event characteristics more comprehensive and rich.
  • the gradient boosting decision tree GBDT model is trained using the event characteristics of multiple sample events and the calibrated risk, and the decision tree is obtained through training.
  • the path from the root node to the leaf nodes corresponds to a combination of features.
  • the feature combination corresponding to the decision path in the decision tree can be used to explain the contribution and impact of various features on the event risk, so that Event analysis has a stronger logical context and interpretability.
  • Fig. 2 shows a flowchart of an event risk assessment method according to an embodiment. It can be understood that the method can be executed by any device, device, platform, or device cluster with computing and processing capabilities.
  • the risk assessment method includes at least the following steps: Step 21, using a natural language processing model to extract a plurality of sample events from the content text database, the plurality of sample events including the first sample event, Extracting multiple sample events includes identifying a first sample event and its corresponding first event type, and extracting at least one first event element of the first sample event according to the first event type; step 22, in and In at least one knowledge graph corresponding to at least one field associated with the first sample event, at least one first associated element associated with the at least one first event element is acquired; step 23, according to the first The event type, at least one first event element, and the at least one first correlation element determine the event feature of the first sample event; step 24, according to the event feature of each sample event in the plurality of sample events, And the calibrated risk value of each sample event, train the gradient boosting decision
  • steps 21-24 involve the training process of the GBDT model for event evaluation
  • step 25 involves the process of prediction and evaluation using the trained model. The following describes the implementation of the above steps in conjunction with specific examples.
  • a natural language processing model is used to extract multiple events from the content text database as sample events for model training.
  • the above-mentioned content text library can include financial news, technology news, scientific research articles, and so on. It can be understood that there have been many event extraction models based on natural language processing, and these models can all be used for event extraction in step 21.
  • the event extraction process includes at least the following steps: First, perform word segmentation on sentences in the text based on natural language processing, remove stop words and other preprocessing, to obtain the word segmentation set; optionally, perform entity recognition on the word segmentation in the word segmentation set ; Then, determine the trigger word of the event from the word segmentation set.
  • the trigger word type corresponds to the event type. Once the trigger word and trigger word type are determined, the event type can be determined.
  • the argument words used as arguments and the roles of each argument word are also determined from the word segmentation set.
  • extracting each event also includes extracting elements of each event.
  • first sample event any one of these events, referred to as the first sample event below as an example, to describe the process of extracting event elements. It should be understood that the descriptions of "first” and “second” in this article are only used to distinguish similar objects and do not have other limiting meanings.
  • the first sample event can be identified and the event type of the first sample event can be determined.
  • the event elements of the first sample event are extracted from the aforementioned content text library.
  • Event elements can include event time, event location, implementation subject, event object, fact type, event level, and so on.
  • the event elements to be extracted are related to event types, and different event types correspond to different event elements.
  • the first sample event identified from the content text database is "XY company vaccine fraud event", and the event type corresponding to this event is "product fraud”.
  • the event elements that need to be extracted can include implementation subjects, product categories, event levels, and so on.
  • the first sample event identified was "passing someone to increase holdings of AB company stocks", and the event type corresponding to this event was "senior management increasing holdings”.
  • the event elements that need to be extracted can include event time, characters, fact types, numerical elements (holding ratio), and so on.
  • an element template may be provided for each event type in advance, and the element template may define each element to be extracted under the corresponding event type.
  • the element template can also define the data format of each element. Therefore, for the first sample event, the element template corresponding to the first event type can be determined; the element template is used to extract the event element of the first sample event from the content text library.
  • the first sample event and the corresponding event type are identified from the content text database, and each event element corresponding to the event type is extracted.
  • the event element of the first sample event extracted from the content text library is called the first event element.
  • step 22 in at least one knowledge graph corresponding to the field associated with the first sample event, the associated element associated with the first event element is obtained.
  • At least one knowledge graph can be selected according to the field associated with the first sample event.
  • the available knowledge graphs of related fields include enterprise knowledge graphs, institutional knowledge graphs, product knowledge graphs, and so on.
  • the available knowledge graphs of related fields may include person knowledge graphs, corporate knowledge graphs, stock knowledge graphs, fund knowledge graphs, and so on.
  • the event elements can be expanded in these knowledge graphs to obtain the association associated with the first event element extracted in step 21 Elements.
  • the knowledge graph can be organized into the form of a node connection graph, which includes multiple nodes, each node corresponds to a knowledge point, and the nodes corresponding to the knowledge points with the association relationship are connected by connecting edges.
  • a node connection graph which includes multiple nodes, each node corresponds to a knowledge point, and the nodes corresponding to the knowledge points with the association relationship are connected by connecting edges.
  • the node that can be reached through a connecting edge is called the first degree associated node of the node
  • the node that can be reached through at least k connecting edges is called the k degree associated node, or the k-order neighbor node.
  • step 22 the first event element extracted in step 21 can be mapped to the node in the above-mentioned knowledge graph, which is called the first node; then, starting from the first node, the knowledge graph and the first node
  • the associated node serves as the associated element of the first sample event.
  • a node directly connected to the first node that is, a once-associated node, can be selected as the associated element.
  • the extracted event elements include the implementation subject: company, product category: medicine, and so on.
  • the once-related nodes can be determined in the corporate knowledge graph, such as “sector” and "region”.
  • the event element "medicine” it can be determined in the product knowledge graph.
  • the nodes that were once associated include, for example, “side effects”, etc. Therefore, the above associated nodes: “section”, “region”, “side effects”, etc., can be used as the associated elements of the first sample event.
  • step 23 the event characteristics of the first sample event are determined according to the event type of the first sample event, the first event element extracted in step 21, and the associated element expanded in step 22.
  • the n features f1-fn in the feature vector F include the event type of the first sample event, the feature corresponding to the first event element extracted in step 21, and the feature corresponding to the associated element obtained in step 22 feature. These features can be either discrete features or continuous features. In this way, a comprehensive event feature is constructed for the first sample event.
  • the calibrated risk value of the first sample event can also be obtained as the label of the sample, and the calibrated risk value is used to reflect the true degree of event influence in the history of the first sample event.
  • the calibrated risk value is determined by manual labeling, that is, the degree of influence caused by the first sample event is artificially measured, and a grade or score of the degree of influence/risk degree is given.
  • some existing index values are used as calibrated risk values.
  • the impact of the event can be reflected by the changes in the corresponding company's stock price, and correspondingly, some stock price indicators can be used as the calibrated risk value. More specifically, for example, the cumulative stock price increase/decrease within 3 days after the occurrence of the event can be used as the calibrated risk value, or the maximum retracement index in 5 days after the event occurs as the calibrated risk value.
  • the calibrated risk value of the first sample event is also obtained as the label of the sample.
  • the event feature and label of the first sample event together constitute a training sample.
  • the first sample event is any one of the aforementioned multiple sample events. Therefore, for each of the above-mentioned multiple sample events, the aforementioned steps 21-23 can be used to determine the event characteristics of each sample event and the calibration risk value of each sample event, so as to obtain multiple training samples.
  • step 24 the gradient boosting decision tree GBDT model is trained according to the event characteristics of each sample event mentioned above and the calibrated risk value of each sample event.
  • the GBDT model includes at least one decision tree, which is trained through the following process.
  • N is the number of sample events.
  • y (i) is the i-th sample event Calibration risk value.
  • the N sample events are segmented through the decision tree, the split feature and feature threshold are set at each branch node of the decision tree, and the corresponding feature of the sample event is compared with the feature threshold at the branch node.
  • the sample events are divided into corresponding child nodes.
  • the N sample events are divided into each leaf node. Therefore, the score of each leaf node can be obtained, that is, the average value of the calibration risk value (ie y (i) ) of each sample event in the leaf node.
  • the residual r (i) of each sample event is obtained by subtracting the calibrated risk value of each sample event from the leaf node score of the sample event in the aforementioned decision tree, To It is a new training set, which corresponds to the same sample event set as D1.
  • a further decision tree can be obtained.
  • N sample events are also divided into each leaf node, and the score of each leaf node is the value of the residual value of each sample event Mean.
  • multiple decision trees can be obtained sequentially, and each decision tree is obtained based on the residual of the previous decision tree.
  • a GBDT model including multiple decision trees can be obtained.
  • Fig. 3 shows a decision tree trained according to an embodiment.
  • the trained decision tree includes branch nodes and leaf nodes.
  • Each branch node is set with a split feature and a feature threshold.
  • Each sample event compares the split feature with the feature threshold at the branch node. , And enter the next branch node, and finally be divided into leaf nodes.
  • the arrow from node 0 to node 1 is marked with "f1 ⁇ 0.5”
  • the arrow from node 0 to node 2 is marked with "f1>0.5”
  • f1 represents feature 1, more specifically, feature 1 such as Is the "event type", which is the split feature of node 0, and 0.5 is the split threshold of node 0.
  • the path from the root node to the leaf node passes through a combination of several branch nodes, each branch node corresponds to a split feature, so the path corresponds to a feature combination, and the feature combination It reflects that a sample event is classified into the feature based on the corresponding leaf node.
  • a leaf node in a decision tree will obtain a corresponding score through training.
  • the score is, for example, the average value of the calibration risk value of each sample event in the leaf node, or the average value of the residual.
  • each branch node is also assigned a certain score, and the score is determined based on the score of the leaf node covered by the branch node.
  • the score of a branch node may be determined as the average value of the scores of the leaf nodes covered by the branch node.
  • the score of the branch node is determined based on the following formula:
  • N c1 and N c2 are the sample numbers of the child nodes c1 and c2 that fall into the branch node during model training. That is, the score of the parent node is the weighted average of the scores of its two child nodes, and the weight of the two child nodes is the number of samples that fall into it during the model training process. In this way, starting from the leaf nodes, the score of each branch node can be determined layer by layer.
  • Figure 3 shows the scores of some nodes under the node, where the scores of the branch nodes are the average of the scores of the covered leaf nodes.
  • each branch node is also assigned a corresponding score.
  • the above score can also be called the risk score of the node.
  • node weights to each branch node through the training process.
  • For a branch node A it can be determined based on the respective node loss value of each node before and after the branch node A is split.
  • the node loss value is based on the calibration risk value of the sample event falling into the node and the risk of the node The difference between the scores is determined.
  • the weight of node A can be defined as:
  • the loss value of node L is determined based on the difference between the calibrated risk value of the sample event falling into node L and the risk score of node L. More specifically, the loss value may be the sum of the squares of the difference between the nominal risk value of each sample and the risk score of the node. Or, in other examples, it may also be the root mean square of the difference. Similarly, the loss value of node R and the loss value of node A can be obtained, and then the weight of node A can be obtained.
  • each branch node is given a node weight. Since each branch node also corresponds to a feature, the node weight can reflect in a certain sense, the role played by the feature during this split, and to a certain extent reflect the contribution of the feature to the decision path.
  • the risk assessment of events with unknown results can be carried out. Moreover, due to the characteristics of the decision tree in the above GBDT model, the risk assessment results can be better explained.
  • step 25 of FIG. 2 the GBDT model obtained by training is used to perform risk assessment on the event to be analyzed.
  • the event to be analyzed is called the second event.
  • Fig. 4 shows a flow chart of performing risk assessment on the second event in an embodiment, that is, the sub-steps of step 25 above. It can be understood that in order to evaluate the second event, the event feature of the second event must first be constructed, and the construction process of the event feature corresponds to the construction method of the event feature of the sample event in the GBDT model training phase.
  • step 251 the event type of the second event and at least one second event element are acquired.
  • the event type and event elements of the second event may be directly input by the user.
  • a user wants to query or evaluate the risk or impact of an event, he can directly enter the description of the second event in the query interface, such as "FF Company User Data Leakage”, and then select the event type "Information Leakage” Then, in the element template provided according to the event type, enter the event elements of the event, such as the implementation subject, data category, event level, and so on.
  • the text describing the second event may be input to the evaluation system, and the evaluation system performs event identification and element extraction.
  • the above-mentioned input text can be, for example, news reports such as financial information, or various articles on the Internet.
  • the process of event recognition and element extraction is similar to the aforementioned step 21. That is, the natural language processing model is used to identify the second event and the second event type from the input text; and according to the second event type, the event element of the second event is extracted from the input text.
  • step 252 in at least one knowledge graph related to the field of the second event, the associated element associated with the event element of the second event is obtained. Specifically, in the knowledge graph, the event element of the second event may be mapped to the second node, and then the node associated with the second node may be used as the associated element. This process is similar to the aforementioned step 22 and will not be repeated here.
  • the event characteristics of the second event are determined according to the event type, event elements, and related elements of the second event, which are hereinafter referred to as second event characteristics.
  • the second event feature can be expressed as a feature vector V. In this way, an event feature is constructed for the second event.
  • step 254 the event feature V of the second event is input to the GBDT model obtained by the aforementioned training, and the risk value of the second event is determined according to the model output.
  • the GBDT model obtained by training includes at least one decision tree, and the branch nodes in the decision tree correspond to split features and feature thresholds.
  • the second event feature V is input into the GBDT model, at each branch node i of the decision tree, the feature value of the feature corresponding to the split feature of the branch node in the feature vector V is compared with the feature threshold, and according to the comparison For the result, the second event is divided into nodes of the next level until it is divided into leaf nodes.
  • FIG. 5 shows the division process of the second event in the decision tree in one embodiment, which is the same as the decision tree shown in FIG. 3.
  • the split feature at node 0 is f1 "event type", and the feature threshold is 0.5; the split feature at node 2 is f3 "implementing subject", and the feature threshold is 0.6.
  • the event feature vector V of the second event is input into the decision tree.
  • the feature value corresponding to the "event type” is 0.8, which is greater than the feature threshold value 0.5 of the split feature, so the second event is divided from node 0 to node 2.
  • the split feature "implementation subject" Assuming that the feature value of the feature "implementing subject" in the second event feature vector V is 0.2, which is smaller than the feature threshold of the split feature 0.6, the second event is then divided into node 5. This continues until the second event is divided into the leaf node 16.
  • the GBDT model can output the score of the leaf node to which the second event is divided. Therefore, in step 254, the leaf node output by the model can be As the risk value of the second event.
  • the score 0.062 of the leaf node 16 in FIG. 5 can be used as the risk value of the second event.
  • the GBDT model includes multiple decision trees, the second event in each decision tree will be divided into corresponding leaf nodes. At this time, the GBDT model can determine the corresponding score of the leaf node where the second event is located in each decision tree, and use the sum of the corresponding scores of each leaf node, that is, the total score, as the output result. Therefore, the total score output by the GBDT model can be used as the risk value of the second event.
  • the risk value of the second event can be determined according to the model output, so as to perform a quantitative risk assessment of the second event.
  • performing risk assessment on the second event in step 25 may also include, after the risk value of the second event is given in step 254, performing characteristic interpretation on the risk value of the second event.
  • Fig. 6 shows a flow of steps for feature interpretation in an embodiment.
  • the decision path of the second event in the decision tree is determined according to the event characteristics of the second event.
  • the second event is divided into child nodes according to the characteristic value of the corresponding feature of the second event until the leaf node is reached. In this way, the path taken from the root node to the leaf node to which the second event is divided in the decision tree is the decision path.
  • the second event is finally divided into leaf nodes 16, and the path from root node 0 through node 2, node 5, node 11 to node 16 is the decision path of the second event.
  • the corresponding decision path can be determined in each decision tree.
  • step 62 each branch node through which the decision path passes is determined, and the characteristics and node weights corresponding to each branch node are obtained.
  • the starting point of the decision path is the root node of the decision tree
  • the ending point is the leaf node to which the second event is divided
  • nodes other than the leaf nodes can be used as branch nodes.
  • each branch node included in the decision path can be determined.
  • each branch node included in the multiple paths is determined.
  • each branch node in the decision tree is given a certain node weight. In this way, the node weight of each branch node in the decision path can be determined.
  • step 63 a certain feature included in the event feature of the second event is called the first feature, and the node weight of at least one branch node corresponding to the first feature among the above-mentioned branch nodes is determined.
  • the feature weight of the first feature is used as the importance of the first feature to the risk value.
  • each branch node in a decision tree corresponds to a feature, but a feature can appear in multiple branch nodes of multiple decision trees, or even multiple branch nodes of the same decision tree. Therefore, for the above-mentioned first feature, at least one branch node corresponding to the first feature can be determined from the branch nodes included in the decision path, the node weight of the at least one branch node can be obtained, and the feature can be determined accordingly The feature weight.
  • the feature weight of the first feature may be an average value of the node weights of the at least one branch node corresponding to the first feature. In this way, the feature weight of the first feature is obtained, and the feature weight can reflect the contribution or importance of the first feature to the risk value of the second event.
  • the feature weight of each feature in the event feature of the second event can be obtained as the contribution or importance to the risk value of the second event.
  • the corresponding features can be ranked according to the ranking of the feature weights of each feature, thereby indicating the importance ranking of the features that affect the risk value of the second event.
  • the second event is "listed company historical financial fraud.”
  • the second event is divided into leaf nodes via the decision path, and the risk value of the second event is determined by the score of the leaf node.
  • the decision path passes through multiple branch nodes, and each branch node corresponds to a feature. Therefore, the decision path can correspond to the feature combination of the split features of each branch node passed.
  • the contribution or importance of the corresponding feature to the final risk value result can be measured, that is, the characteristic interpretation of the risk value result is performed. Therefore, in the above process, not only the risk value of the second event is determined through the GBDT model, but also the characteristic interpretation of the risk value can be performed, that is to say, the magnitude of the role played by each characteristic when the risk value is obtained .
  • the above describes the process of obtaining the comprehensive event characteristics of the second event after expanding the event elements through the knowledge graph for the second event to be evaluated, and inputting the event characteristics into the trained GBDT model to obtain the risk value of the second event.
  • the parameters in the GBDT model can also be used to characterize the obtained risk value.
  • the above evaluation process is applicable to the situation where the corresponding elements of the second event can be obtained, and then the event characteristics can be constructed.
  • the GBDT model obtained through the above training can also be applied to conditional prediction of events for which complete event characteristics cannot be obtained, that is, when only a few elements of the event can be obtained, different conditions or different conditions are given. The assessment of the different risk trends of the event under the circumstances.
  • the GBDT model obtained by the above training can also be used to give an assessment of the risk trend of the event in different situations, for example, under what conditions are met, the event will have a great impact on public opinion risk, and what is being met? Under conditions, the impact of the event will be minimized.
  • the evaluation process for such a second event is described below.
  • Fig. 7 shows a flowchart of steps for evaluating a second event according to an embodiment.
  • step 71 at least one event element of the second event is acquired.
  • this step process is suitable for the case where the second event element is incomplete. Therefore, the event element obtained in step 71 can be a small number of incomplete event elements, for example, only the implementation subject, or even the event type. .
  • the event type that can only obtain the incident is "product fraud" and the subject of implementation is a certain company.
  • step 72 the second event is divided in the decision tree according to the at least one event element, and the subtree of the decision tree is determined based on the divided stop nodes.
  • the second event can be divided in the decision tree according to the obtained elements, the stop node that cannot be divided and the division stopped is determined, and the subtree of the decision tree is determined based on the stop node.
  • the subtree is The node area covered by the stop node.
  • node 1 determines the split feature "event type". Assuming that the event type of the second event "a vaccine fraud by a company" is 0.3, which is less than the characteristic threshold 0.5, the second event is then classified to node 1.
  • the split feature at node 1 is f2 "penalty type". However, as described above, because the elements of the second event are incomplete, this feature cannot be obtained, so the second event cannot be divided, and node 1 is the stop node.
  • the node area covered by node 1 is the aforementioned subtree.
  • step 73 the first leaf node in the aforementioned subtree that meets the predetermined condition and the conditional path from the root node to the first leaf node are determined.
  • the above-mentioned predetermined conditions can be set according to evaluation needs, for example, the risk is the largest, the risk is the smallest, the risk value meets a certain threshold, and so on.
  • the leaf node with the largest score is selected from the leaf nodes included in the subtree as the first leaf node.
  • the path from the root node to the leaf node is the above conditional path.
  • the stop node is node 1, and the determined subtree contains leaf nodes 7, 8, 9, 10. Assuming that node 8 has the largest score, then node 8 can be determined as the maximum risk condition
  • the leaf node of, the path from node 0 to node 8, that is, the path containing nodes 0, 1, 3, and 8 is used as the above conditional path.
  • the corresponding leaf node is selected as the first leaf node according to the score of each leaf node.
  • step 74 the feature combination corresponding to the branch nodes included in the conditional path is obtained, and the feature combination is used as the influence feature of the second event under the predetermined condition.
  • conditional path corresponds to the division path of the second event under a predetermined condition assumed to occur. Therefore, the feature combinations corresponding to the branch and trunk nodes included in the path are those that have an impact on the second event and make it meet the aforementioned predetermined conditions. For example, if the predetermined condition is the maximum risk, then the feature combination corresponding to the conditional path at this time is the impact feature that causes the second event to appear the maximum risk. In this way, the conditional prediction and interpretation of the second event are performed, and different impact characteristics under different conditions are given to help predict the subsequent trend of the event.
  • the following information can also be provided as an assessment of the second event.
  • the score of the above-mentioned first leaf node may be provided as the risk value of the second event under predetermined conditions.
  • the score of node 8 may be provided as the possible maximum risk value of the second event.
  • the importance of each feature in the above-mentioned feature combination may be determined according to the node weight of the branch node in the above-mentioned conditional path. This process is similar to the aforementioned step 63.
  • the second event with fewer elements and incomplete features can be evaluated, and the corresponding characteristic conditions that the second event will meet when different risk results appear are given, so as to make better use of the characteristics of the GBDT model to assess the future of the event.
  • the risk is explained and predicted.
  • a device for event risk assessment is provided.
  • the device can be deployed in any device, platform or device cluster with computing and processing capabilities.
  • Fig. 8 shows a schematic block diagram of an event evaluation device according to an embodiment. As shown in Figure 8, the evaluation device 800 includes:
  • the extraction unit 81 is configured to use a natural language processing model to extract a plurality of sample events from the content text library, the plurality of sample events including a first sample event, and the extracting a plurality of sample events includes, identifying the first sample An event and its corresponding first event type, and extract at least one first event element of the first sample event according to the first event type;
  • the associating unit 82 is configured to obtain at least one first associated element associated with the at least one first event element in at least one knowledge graph corresponding to at least one field associated with the first sample event;
  • the determining unit 83 is configured to determine the event feature of the first sample event according to the first event type, the at least one first event element, and the at least one first correlation element;
  • the training unit 84 is configured to train the gradient boosting decision tree GBDT model according to the event characteristics of each sample event in the multiple sample events and the calibration risk value of each sample event to obtain the trained GBDT model;
  • the evaluation unit 85 is configured to use the trained GBDT model to perform risk evaluation on the second event to be analyzed.
  • the extracting unit 81 is specifically configured to: determine a first template corresponding to the first event type; use the first template to extract the first sample event from the content text library At least one element of the first event.
  • the aforementioned first event element includes at least one of the following: event time, event location, implementation subject, event object, fact type, and event level.
  • the associating unit 82 is specifically configured as follows:
  • the above-mentioned knowledge graph may include one or more of the following: enterprise knowledge graph, product knowledge graph, character knowledge graph, information knowledge graph, stock knowledge graph, fund knowledge graph, and institution knowledge graph.
  • the evaluation unit 85 includes:
  • the element acquisition module 851 is configured to acquire the event type of the second event and at least one second event element;
  • the element association module 852 is configured to obtain at least one second association element associated with the at least one second event element in the at least one knowledge graph;
  • the first determining module 853 is configured to determine the event feature of the second event according to the event type of the second event, the at least one second event element, and the at least one second correlation element;
  • the second determining module 854 is configured to input the event characteristics of the second event into the trained GBDT model, and determine the risk value of the second event according to the model output.
  • the element acquisition module 851 is configured to:
  • the at least one second event element is extracted from the input text.
  • the element acquisition module 851 is configured to:
  • the input second event and the at least one second event element are received.
  • the GBDT model obtained by training includes at least one decision tree, the decision tree includes branch nodes and leaf nodes, each branch node corresponds to a feature, and has the risk score and node weight obtained by training.
  • the node weight is determined based on the respective node loss values of the branch node and the split node, and the node loss value is determined based on the difference between the nominal risk value of the sample event falling into the node and the risk score of the node;
  • the evaluation unit 85 further includes (not shown):
  • a decision path determining module configured to determine the decision path of the second event in the decision tree according to the event characteristics of the second event
  • a node weight determination module configured to determine each branch node through which the decision path passes, and obtain the characteristics and node weights corresponding to each branch node;
  • the importance determination module is configured to determine the first feature included in the event feature of the second event according to the node weight of at least one branch node corresponding to the first feature in each branch node
  • the feature weight of a feature is used as the importance of the first feature to the risk value.
  • the evaluation unit 85 includes (not shown):
  • An element acquisition module configured to acquire at least one second event element of the second event
  • a subtree determining module configured to divide a second event in the decision tree according to the at least one second event element, and determine a subtree of the decision tree based on the divided stop nodes;
  • a conditional path determining module configured to determine a first leaf node in the subtree that meets a predetermined condition and a conditional path from the root node to the first leaf node;
  • the feature determination module is configured to obtain a feature combination corresponding to a branch node included in the conditional path, and use the feature combination as an impact feature of the second event under the predetermined condition.
  • each leaf node in the decision tree obtains a risk score through training, and each branch node corresponds to a feature, and has the risk score obtained by training and the node weight, wherein the node weight is based on the The node loss value of the branch node and the node after the split is determined, and the node loss value is determined based on the difference between the calibrated risk value of the sample event falling into the node and the risk score of the node;
  • the evaluation unit further includes one or more of the following:
  • a third determining module configured to determine the first risk score corresponding to the first leaf node as the risk value of the second event under the predetermined condition
  • the fourth determining module is configured to determine the importance of each feature corresponding to each branch node in the feature combination according to the node weight of each branch node in the conditional path.
  • the training and use of the GBDT model can be realized, and the event risk can be effectively evaluated and explained.
  • a computer-readable storage medium having a computer program stored thereon, and when the computer program is executed in a computer, the computer is caused to execute the method described in conjunction with FIG. 2.
  • a computing device including a memory and a processor, the memory stores executable code, and when the processor executes the executable code, a combination of FIGS. 2 and 4 is implemented. The method described.

Landscapes

  • Engineering & Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Human Resources & Organizations (AREA)
  • Theoretical Computer Science (AREA)
  • Strategic Management (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Economics (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Development Economics (AREA)
  • Game Theory and Decision Science (AREA)
  • Databases & Information Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Educational Administration (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • Animal Behavior & Ethology (AREA)
  • Marketing (AREA)
  • Operations Research (AREA)
  • Quality & Reliability (AREA)
  • Tourism & Hospitality (AREA)
  • General Business, Economics & Management (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

A computer implemented event risk assessment method and device. In the method above, first a natural language processing model is used to extract a plurality of sample events from a content text library, which comprises identifying a first sample event and event type corresponding thereto and extracting a first event element of the first sample event according to the event type. Then, in a knowledge graph associated with the first sample event, a first associated element associated with the first event element is obtained. Next, according to the event type, the first event element and the first associated element, event features of the first sample event are determined. A GBDT model may be trained on the basis of the event features of each sample event among the plurality of sample events and a calibrated risk value of each sample event. Therefore, the trained GBDT model may be used to assess the risk value of a second event to be analyzed, and the features of the assessed risk value may also be expounded upon.

Description

计算机执行的事件风险评估的方法及装置Method and device for computer-executed event risk assessment 技术领域Technical field
本说明书一个或多个实施例涉及机器学习领域,尤其涉及利用机器学习对事件风险进行评估的方法和装置。One or more embodiments of this specification relate to the field of machine learning, and more particularly to methods and devices for assessing event risk using machine learning.
背景技术Background technique
随着计算机技术的发展,机器学习已经应用到各种各样的技术领域,用于分析、预测各种业务数据。在许多应用场景中,需要对各种业务事件进行分析和预测,特别是预测各类事件的风险度,例如舆情风险度,安全风险度等,以便于提前预警,辅助相关业务人员进行业务准备。With the development of computer technology, machine learning has been applied to various technical fields for analyzing and predicting various business data. In many application scenarios, it is necessary to analyze and predict various business events, especially the risk of various events, such as public opinion risk, security risk, etc., in order to provide early warning and assist relevant business personnel in business preparation.
因此,希望提供改进的方案,能够有效地对事件风险度进行评估。Therefore, it is hoped to provide an improved scheme that can effectively evaluate the risk of an event.
发明内容Summary of the invention
本说明书一个或多个实施例描述了计算机执行的事件风险评估方法和装置,通过扩展事件的要素而构建事件特征,并训练GBDT模型,实现事件风险度的有效评估,并可以对评估的风险值提供相应的特征解释。One or more embodiments of this specification describe a computer-implemented event risk assessment method and device, which constructs event features by expanding the elements of the event, and trains the GBDT model to achieve an effective assessment of the risk of the event, and can evaluate the estimated risk value Provide corresponding feature explanation.
根据第一方面,提供了一种计算机执行的事件风险评估方法,包括:According to the first aspect, a computer-executed event risk assessment method is provided, including:
采用自然语言处理模型,从内容文本库中提取多个样本事件,所述多个样本事件包括第一样本事件,所述提取多个样本事件包括,识别第一样本事件及其对应的第一事件类型,并根据第一事件类型,提取所述第一样本事件的至少一个第一事件要素;Using a natural language processing model, extracting multiple sample events from the content text database, the multiple sample events including a first sample event, and the extracting multiple sample events includes identifying the first sample event and its corresponding first sample event An event type, and at least one first event element of the first sample event is extracted according to the first event type;
在与所述第一样本事件相关联的至少一个领域所对应的至少一个知识图谱中,获取与所述至少一个第一事件要素相关联的至少一个第一关联要素;Acquiring at least one first associated element associated with the at least one first event element in at least one knowledge graph corresponding to at least one field associated with the first sample event;
根据所述第一事件类型,所述至少一个第一事件要素,以及所述至少一个第一关联要素,确定所述第一样本事件的事件特征;Determine the event characteristics of the first sample event according to the first event type, the at least one first event element, and the at least one first correlation element;
根据所述多个样本事件中各个样本事件的事件特征,以及各个样本事件的标定风险值,训练梯度提升决策树GBDT模型,得到训练的GBDT模型;Training the gradient boosting decision tree GBDT model according to the event characteristics of each sample event in the multiple sample events and the calibration risk value of each sample event to obtain the trained GBDT model;
利用所述训练的GBDT模型,对待分析的第二事件进行风险评估。Using the trained GBDT model, the risk assessment of the second event to be analyzed is performed.
在一个实施例中,通过以下方式提取第一样本事件的至少一个事件要素:In an embodiment, at least one event element of the first sample event is extracted in the following manner:
确定所述第一事件类型对应的第一模板;利用所述第一模板,从所述内容文本库中提取所述第一样本事件的至少一个第一事件要素。Determine a first template corresponding to the first event type; use the first template to extract at least one first event element of the first sample event from the content text library.
在一个实施例中,至少一个第一事件要素包括以下中的至少一个:事件时间、事件地点、实施主体、事件客体、事实类型、事件等级。In an embodiment, the at least one first event element includes at least one of the following: event time, event location, implementation subject, event object, fact type, and event level.
根据一种实施方式,通过以下方式获取关联要素:According to one embodiment, the related elements are obtained in the following way:
将所述至少一个第一事件要素映射为所述至少一个知识图谱中的第一节点;将所述至少一个知识图谱中与所述第一节点直接连接的节点作为所述至少一个关联要素。The at least one first event element is mapped to the first node in the at least one knowledge graph; the node directly connected to the first node in the at least one knowledge graph is used as the at least one associated element.
在一个实施例中,上述知识图谱可以包括:企业知识图谱,产品知识图谱,人物知识图谱,信息知识图谱,股票知识图谱,基金知识图谱,机构知识图谱。In an embodiment, the above-mentioned knowledge graph may include: enterprise knowledge graph, product knowledge graph, character knowledge graph, information knowledge graph, stock knowledge graph, fund knowledge graph, and institution knowledge graph.
根据一种实施方式,在训练GBDT模型之后,对待分析的第二事件进行风险评估具体包括:According to one embodiment, after training the GBDT model, performing risk assessment on the second event to be analyzed specifically includes:
获取第二事件的事件类型,以及至少一个第二事件要素;Acquiring the event type of the second event and at least one second event element;
在所述至少一个知识图谱中,获取与所述至少一个第二事件要素相关联的至少一个第二关联要素;In the at least one knowledge graph, acquiring at least one second associated element associated with the at least one second event element;
根据所述第二事件的事件类型,所述至少一个第二事件要素,以及所述至少一个第二关联要素,确定所述第二事件的事件特征;Determine the event characteristics of the second event according to the event type of the second event, the at least one second event element, and the at least one second correlation element;
将所述第二事件的事件特征输入所述训练的GBDT模型,根据模型输出确定所述第二事件的风险值。The event characteristics of the second event are input into the trained GBDT model, and the risk value of the second event is determined according to the model output.
进一步的,在一个实施例中,通过以下方式获取第二事件要素:Further, in one embodiment, the second event element is obtained in the following manner:
从输入文本中识别出所述第二事件以及第二事件类型;Identifying the second event and the second event type from the input text;
根据第二事件类型,从所述输入文本中提取所述至少一个第二事件要素。According to the second event type, the at least one second event element is extracted from the input text.
或者,可以直接接收输入的第二事件,以及所述至少一个第二事件要素。Alternatively, the input second event and the at least one second event element may be directly received.
在一个实施例中,训练的GBDT模型包括至少一棵决策树,所述决策树包括枝干节点和叶子节点,每个枝干节点对应一项特征,并具有训练得到的风险分值以及节点权重,其中节点权重基于该枝干节点以及分裂后节点各自的节点损失值确定,所述节点损失值基于落入该节点的样本事件的标定风险值与该节点的风险分值之差而确定。在这样的情 况下,对待分析的第二事件进行风险评估还包括:In one embodiment, the trained GBDT model includes at least one decision tree, the decision tree includes branch nodes and leaf nodes, each branch node corresponds to a feature, and has the risk score and node weight obtained by training. , Wherein the node weight is determined based on the respective node loss values of the branch node and the split node, and the node loss value is determined based on the difference between the nominal risk value of the sample event falling into the node and the risk score of the node. In this case, the risk assessment of the second event to be analyzed also includes:
根据所述第二事件的事件特征确定所述第二事件在所述决策树中的决策路径;Determining the decision path of the second event in the decision tree according to the event characteristics of the second event;
确定所述决策路径所经过的各个枝干节点,并获取各个枝干节点对应的特征以及节点权重;Determine each branch node passed by the decision path, and obtain the feature and node weight corresponding to each branch node;
对于所述第二事件的事件特征中包含的第一特征,根据所述各个枝干节点中对应于该第一特征的至少一个枝干节点的节点权重,确定该第一特征的特征权重,作为该第一特征对于所述风险值的重要性。For the first feature included in the event feature of the second event, the feature weight of the first feature is determined according to the node weight of at least one branch node corresponding to the first feature in each branch node, as The importance of this first characteristic to the risk value.
根据另一种实施方式,训练得到的GBDT模型包括至少一棵决策树,所述决策树包括枝干节点和叶子节点;在得到这样的GBDT模型后,对待分析的第二事件进行风险评估具体包括:According to another embodiment, the GBDT model obtained by training includes at least one decision tree, and the decision tree includes branch nodes and leaf nodes; after obtaining such a GBDT model, performing risk assessment on the second event to be analyzed specifically includes :
获取第二事件的至少一个第二事件要素;Acquiring at least one second event element of the second event;
根据所述至少一个第二事件要素在所述决策树中对第二事件进行划分,基于划分的停止节点确定所述决策树的子树;Dividing a second event in the decision tree according to the at least one second event element, and determining a subtree of the decision tree based on the divided stop nodes;
确定所述子树中满足预定条件的第一叶子节点,以及从根节点到该第一叶子节点的条件路径;Determining a first leaf node in the subtree that meets a predetermined condition, and a conditional path from the root node to the first leaf node;
获取所述条件路径中包含的枝干节点所对应的特征组合,将所述特征组合作为所述第二事件在所述预定条件下的影响特征。The feature combination corresponding to the branch and trunk nodes included in the conditional path is acquired, and the feature combination is used as the influence feature of the second event under the predetermined condition.
进一步的,在一个实施例中,决策树中每个叶子节点通过训练得到有风险分值,每个枝干节点对应一项特征,并具有训练得到的风险分值以及节点权重,其中节点权重基于该枝干节点以及分裂后节点各自的节点损失值确定,所述节点损失值基于落入该节点的样本事件的标定风险值与该节点的风险分值之差而确定;相应的,在一个实施例中,对待分析的第二事件进行风险评估还包括以下中的一项或多项:Further, in one embodiment, each leaf node in the decision tree obtains a risk score through training, and each branch node corresponds to a feature, and has the risk score obtained by training and the node weight, wherein the node weight is based on The node loss value of the branch node and the node after the split is determined, and the node loss value is determined based on the difference between the calibrated risk value of the sample event falling into the node and the risk score of the node; accordingly, in an implementation In the example, the risk assessment of the second event to be analyzed also includes one or more of the following:
确定所述第一叶子节点对应的第一风险分值,作为所述预定条件下第二事件的风险值;Determining the first risk score corresponding to the first leaf node as the risk value of the second event under the predetermined condition;
根据所述条件路径中各个枝干节点的节点权重,确定所述特征组合中与所述各个枝干节点对应的各项特征的重要度。The importance of each feature corresponding to each branch node in the feature combination is determined according to the node weight of each branch node in the conditional path.
根据第二方面,提供一种计算机执行的事件风险评估装置,包括:According to a second aspect, a computer-executed event risk assessment device is provided, including:
提取单元,配置为采用自然语言处理模型,从内容文本库中提取多个样本事件,所 述多个样本事件包括第一样本事件,所述提取多个样本事件包括,识别第一样本事件及其对应的第一事件类型,并根据第一事件类型,提取所述第一样本事件的至少一个第一事件要素;The extraction unit is configured to use a natural language processing model to extract a plurality of sample events from the content text library, the plurality of sample events including a first sample event, and the extracting the plurality of sample events includes identifying the first sample event And its corresponding first event type, and extracting at least one first event element of the first sample event according to the first event type;
关联单元,配置为在与所述第一样本事件相关联的至少一个领域所对应的至少一个知识图谱中,获取与所述至少一个第一事件要素相关联的至少一个第一关联要素;An associating unit configured to obtain at least one first associated element associated with the at least one first event element in at least one knowledge graph corresponding to at least one field associated with the first sample event;
特征确定单元,配置为根据所述第一事件类型,所述至少一个第一事件要素,以及所述至少一个第一关联要素,确定所述第一样本事件的事件特征;A feature determining unit configured to determine the event feature of the first sample event according to the first event type, the at least one first event element, and the at least one first correlation element;
训练单元,配置为根据所述多个样本事件中各个样本事件的事件特征,以及各个样本事件的标定风险值,训练梯度提升决策树GBDT模型,得到训练的GBDT模型;The training unit is configured to train the gradient boosting decision tree GBDT model according to the event characteristics of each sample event in the multiple sample events and the calibration risk value of each sample event to obtain the trained GBDT model;
评估单元,配置为利用所述训练的GBDT模型,对待分析的第二事件进行风险评估。The evaluation unit is configured to use the trained GBDT model to perform risk evaluation on the second event to be analyzed.
根据第三方面,提供了一种计算机可读存储介质,其上存储有计算机程序,当所述计算机程序在计算机中执行时,令计算机执行第一方面的方法。According to a third aspect, there is provided a computer-readable storage medium having a computer program stored thereon, and when the computer program is executed in a computer, the computer is caused to execute the method of the first aspect.
根据第四方面,提供了一种计算设备,包括存储器和处理器,其特征在于,所述存储器中存储有可执行代码,所述处理器执行所述可执行代码时,实现第一方面的方法。According to a fourth aspect, there is provided a computing device, including a memory and a processor, characterized in that executable code is stored in the memory, and when the processor executes the executable code, the method of the first aspect is implemented .
根据本说明书实施例提供的方法和装置,通过在相关领域的知识图谱中对事件要素进行扩展,构建更为全面的事件特征。基于样本事件的事件特征和标定风险值,可以训练得到包含决策树的GBDT模型。利用这样的GBDT模型,不仅可以对未知风险的待评估事件进行风险值的评估,还可以对风险值进行特征解释,如此,在实现定量预测的同时,还可以使得预测结果具有更强的逻辑表达和可解释性。According to the method and device provided by the embodiment of this specification, a more comprehensive event feature is constructed by expanding the event elements in the knowledge graph of the related field. Based on the event characteristics of the sample events and the calibrated risk value, a GBDT model including a decision tree can be trained. Using such a GBDT model, not only can the risk value be evaluated for the unknown risk to be assessed, but also the risk value can be characterized. In this way, while realizing quantitative prediction, it can also make the prediction result have a stronger logical expression And interpretability.
附图说明Description of the drawings
为了更清楚地说明本发明实施例的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其它的附图。In order to explain the technical solutions of the embodiments of the present invention more clearly, the following will briefly introduce the drawings used in the description of the embodiments. Obviously, the drawings in the following description are only some embodiments of the present invention. A person of ordinary skill in the art can obtain other drawings based on these drawings without creative work.
图1为本说明书披露的一个实施例的实施过程示意图;Figure 1 is a schematic diagram of the implementation process of an embodiment disclosed in this specification;
图2示出根据一个实施例的事件风险评估方法的流程图;Figure 2 shows a flowchart of an event risk assessment method according to an embodiment;
图3示出根据一个实施例训练得到的决策树;Figure 3 shows a decision tree trained according to an embodiment;
图4示出在一个实施例中对第二事件进行风险评估的流程图;Figure 4 shows a flow chart of performing risk assessment on a second event in an embodiment;
图5示出在一个实施例中第二事件在决策树中的划分过程;Figure 5 shows the division process of the second event in the decision tree in one embodiment;
图6示出在一个实施例中进行特征解释的步骤流程;Fig. 6 shows a flow of steps for feature interpretation in an embodiment;
图7示出根据一个实施例对第二事件进行评估的步骤流程图;Fig. 7 shows a flowchart of steps for evaluating a second event according to an embodiment;
图8示出根据一个实施例的事件评估装置的示意性框图。Fig. 8 shows a schematic block diagram of an event evaluation device according to an embodiment.
具体实施方式detailed description
下面结合附图,对本说明书提供的方案进行描述。The following describes the solutions provided in this specification with reference to the drawings.
如前所述,在多种应用场景中,需要对各类事件进行研究和风险评估,例如,确定某互联网公司用户信息泄露事件对网络安全方面的影响度和风险度等等。总体来说,在这样的事件研究领域进行分析的方法主要包括两类:定量的方法和定性的方法。定量的方法常常使用量化方式进行舆情因子挖掘,构建基于AI算法的舆情量化因子,也就是,先将事件因子化,并通过一些定量指标,例如该事件后预定时间内历史投资收益的高低,来衡量事件的影响和风险度。然而,这样的方案往往缺少对事件类型的细致划分,丢失了事件的逻辑脉络,可解释性不强。并且,事件的影响和风险度依赖于因子化时事件的划分粒度,往往由于事件定义中没有区分事件的某个关键属性特征,导致难以发掘出真正有意义的因子或特征。As mentioned earlier, in a variety of application scenarios, various incidents need to be studied and risk assessed, for example, to determine the impact and risk of a certain Internet company’s user information leakage incident on network security. Generally speaking, the methods of analysis in the field of such event research mainly include two types: quantitative methods and qualitative methods. Quantitative methods often use quantitative methods to mine public opinion factors, and construct quantitative public opinion factors based on AI algorithms, that is, first factorize the event and use some quantitative indicators, such as the level of historical investment income within a predetermined time after the event. Measure the impact and risk of the event. However, such schemes often lack a detailed division of event types, lose the logical context of the event, and are not well interpretable. In addition, the impact and risk of an event depend on the granularity of the event during factorization. Often, the event definition does not distinguish a certain key attribute characteristic of the event, which makes it difficult to discover truly meaningful factors or characteristics.
定性的方法往往通过人工标注的方式,由人工完成事件的定义、风险程度分析。这个过程需要很强的专业分析,需要逐个事件单独分析,未能系统化、自动化,导致分析效率低。并且,分析结果是否正确依赖于分析人员的主观经验是否能覆盖事件的关键属性特征。此外,定性分析的结论往往只能到正负面的方向判断,对于影响程度的判断无法量化,带有很强的主观性。Qualitative methods often use manual labeling to manually complete event definition and risk analysis. This process requires strong professional analysis and individual analysis of each event. Failure to systematize and automate the analysis results in low analysis efficiency. Moreover, whether the analysis result is correct depends on whether the subjective experience of the analyst can cover the key attributes of the event. In addition, the conclusions of qualitative analysis can only be judged in the positive and negative directions, and the judgment of the degree of influence cannot be quantified, and it is highly subjective.
在此基础上,本说明书的实施例提供改进的方案对事件风险进行评估,在提供客观、定量预测分析的同时,还可以使得预测结果有更强的可解释性。图1为本说明书披露的一个实施例的实施过程示意图。如图1所示,根据实施例的方案,首先抽取样本事件,并为样本事件构建特征。在对事件进行特征构建时,不仅考虑事件本身的要素,还结合相关领域的知识图谱,从知识图谱中挖掘出相关要素,共同构成事件特征,如此使得事件特征更加全面更加丰富。在此基础上,利用多个样本事件的事件特征和标定的风险度训练梯度提升决策树GBDT模型,通过训练得到决策树。在该决策树中,从根节点到叶 子节点的路径对应一种特征组合。如此,不仅可以使用训练得到的GBDT模型对于待分析事件评估出其风险度,还可以通过决策树中决策路径所对应的特征组合,对各种特征对事件风险度的贡献和影响进行解释,使得事件分析具有更强的逻辑脉络和可解释性。下面具体描述以上构思的实现方式。On this basis, the embodiments of this specification provide an improved scheme for assessing event risk, which provides objective and quantitative predictive analysis while also making the predictive result more interpretable. Figure 1 is a schematic diagram of the implementation process of an embodiment disclosed in this specification. As shown in Fig. 1, according to the solution of the embodiment, first sample events are extracted, and features are constructed for the sample events. When constructing the characteristics of the event, not only the elements of the event itself are considered, but also the knowledge graphs of related fields are combined to dig out relevant elements from the knowledge graphs to jointly form the event characteristics, which makes the event characteristics more comprehensive and rich. On this basis, the gradient boosting decision tree GBDT model is trained using the event characteristics of multiple sample events and the calibrated risk, and the decision tree is obtained through training. In this decision tree, the path from the root node to the leaf nodes corresponds to a combination of features. In this way, not only can the trained GBDT model be used to evaluate the risk of the event to be analyzed, but also the feature combination corresponding to the decision path in the decision tree can be used to explain the contribution and impact of various features on the event risk, so that Event analysis has a stronger logical context and interpretability. The implementation of the above concept is described in detail below.
图2示出根据一个实施例的事件风险评估方法的流程图。可以理解,该方法可以通过任何具有计算、处理能力的装置、设备、平台、设备集群来执行。如图2所示,该风险评估方法至少包括以下步骤:步骤21,采用自然语言处理模型,从内容文本库中提取多个样本事件,所述多个样本事件包括第一样本事件,所述提取多个样本事件包括,识别第一样本事件及其对应的第一事件类型,并根据第一事件类型,提取所述第一样本事件的至少一个第一事件要素;步骤22,在与所述第一样本事件相关联的至少一个领域所对应的至少一个知识图谱中,获取与所述至少一个第一事件要素相关联的至少一个第一关联要素;步骤23,根据所述第一事件类型,至少一个第一事件要素,以及所述至少一个第一关联要素,确定所述第一样本事件的事件特征;步骤24,根据所述多个样本事件中各个样本事件的事件特征,以及各个样本事件的标定风险值,训练梯度提升决策树GBDT模型,得到训练的GBDT模型;步骤25,利用所述训练的GBDT模型,对待分析的第二事件进行风险评估。Fig. 2 shows a flowchart of an event risk assessment method according to an embodiment. It can be understood that the method can be executed by any device, device, platform, or device cluster with computing and processing capabilities. As shown in Figure 2, the risk assessment method includes at least the following steps: Step 21, using a natural language processing model to extract a plurality of sample events from the content text database, the plurality of sample events including the first sample event, Extracting multiple sample events includes identifying a first sample event and its corresponding first event type, and extracting at least one first event element of the first sample event according to the first event type; step 22, in and In at least one knowledge graph corresponding to at least one field associated with the first sample event, at least one first associated element associated with the at least one first event element is acquired; step 23, according to the first The event type, at least one first event element, and the at least one first correlation element determine the event feature of the first sample event; step 24, according to the event feature of each sample event in the plurality of sample events, And the calibrated risk value of each sample event, train the gradient boosting decision tree GBDT model to obtain the trained GBDT model; step 25, use the trained GBDT model to perform risk assessment on the second event to be analyzed.
可以理解,在以上步骤中,步骤21-24涉及对用于事件评估的GBDT模型的训练过程,步骤25涉及使用训练得到的模型进行预测和评估的过程。下面结合具体例子,描述以上各个步骤的执行方式。It can be understood that in the above steps, steps 21-24 involve the training process of the GBDT model for event evaluation, and step 25 involves the process of prediction and evaluation using the trained model. The following describes the implementation of the above steps in conjunction with specific examples.
首先,在步骤21,采用自然语言处理模型,从内容文本库中提取多个事件作为样本事件,用于模型训练。根据要分析事件的领域,上述内容文本库可以包括,财经新闻,科技新闻,科研文章,等等。可以理解,已经存在多种基于自然语言处理的事件提取模型,这些模型都可以用于在步骤21中进行事件提取。First, in step 21, a natural language processing model is used to extract multiple events from the content text database as sample events for model training. According to the field of the event to be analyzed, the above-mentioned content text library can include financial news, technology news, scientific research articles, and so on. It can be understood that there have been many event extraction models based on natural language processing, and these models can all be used for event extraction in step 21.
一般的,事件提取过程至少包含以下步骤:首先基于自然语言处理对文本中的句子进行分词,去停用词等预处理,得到分词集合;可选的,还对分词集合中的分词进行实体识别;然后,从分词集合中确定出事件的触发词。一般的,触发词的类型与事件类型相对应,一旦确定出触发词以及触发词类型,就可以确定出事件类型。进一步地,为了对事件进行表述,还从分词集合中确定出作为论元的论元词,以及各个论元词的角色。通过提取和确定触发词和论元词,就可以识别出一个事件,并确定出该事件的事件类型。Generally, the event extraction process includes at least the following steps: First, perform word segmentation on sentences in the text based on natural language processing, remove stop words and other preprocessing, to obtain the word segmentation set; optionally, perform entity recognition on the word segmentation in the word segmentation set ; Then, determine the trigger word of the event from the word segmentation set. Generally, the trigger word type corresponds to the event type. Once the trigger word and trigger word type are determined, the event type can be determined. Furthermore, in order to express the event, the argument words used as arguments and the roles of each argument word are also determined from the word segmentation set. By extracting and determining trigger words and argument words, an event can be identified and the event type of the event can be determined.
根据本说明书的实施例,在步骤21,提取各个事件还包括,提取各个事件的要素。 下面以其中的任意一个事件,下文称为第一样本事件为例,描述提取事件要素的过程。需要理解,本文中的“第一”、“第二”的描述,仅仅是用于区分相似的对象,而不具有其他限定意义。According to the embodiment of the present specification, in step 21, extracting each event also includes extracting elements of each event. Take any one of these events, referred to as the first sample event below as an example, to describe the process of extracting event elements. It should be understood that the descriptions of "first" and "second" in this article are only used to distinguish similar objects and do not have other limiting meanings.
如前所述,通过从内容文本库提取和确定触发词和论元词,可以识别出第一样本事件,同时确定出该第一样本事件的事件类型。相应地,根据第一样本事件的事件类型,以下称为第一事件类型,从前述的内容文本库提取第一样本事件的事件要素。事件要素可以包括,事件时间、事件地点、实施主体、事件客体、事实类型、事件等级等等。根据一个实施例,要提取的事件要素与事件类型相关,不同事件类型对应于不同的事件要素。As mentioned above, by extracting and determining trigger words and argument words from the content text database, the first sample event can be identified and the event type of the first sample event can be determined. Correspondingly, according to the event type of the first sample event, hereinafter referred to as the first event type, the event elements of the first sample event are extracted from the aforementioned content text library. Event elements can include event time, event location, implementation subject, event object, fact type, event level, and so on. According to one embodiment, the event elements to be extracted are related to event types, and different event types correspond to different event elements.
例如,在一个具体例子中,从内容文本库中识别出的第一样本事件为“XY公司疫苗造假事件”,该事件对应的事件类型为“产品造假”。对于这样的事件类型,需要提取的事件要素可以包括,实施主体,产品类别,事件等级,等等。For example, in a specific example, the first sample event identified from the content text database is "XY company vaccine fraud event", and the event type corresponding to this event is "product fraud". For such event types, the event elements that need to be extracted can include implementation subjects, product categories, event levels, and so on.
在另一个具体例子中,识别出的第一样本事件为“传某某人增持AB公司股票”,该事件对应的事件类型为“高管增持”。对于这样的事件类型,需要提取的事件要素可以包括,事件时间,人物,事实类型,数值要素(增持比例),等等。In another specific example, the first sample event identified was "passing someone to increase holdings of AB company stocks", and the event type corresponding to this event was "senior management increasing holdings". For such event types, the event elements that need to be extracted can include event time, characters, fact types, numerical elements (holding ratio), and so on.
根据一个实施例,可以预先针对各个事件类型提供要素模板,该要素模板可以定义对应事件类型下要提取的各个要素。可选的,要素模板还可以定义各个要素的数据格式。于是,对于上述第一样本事件,可以确定上述第一事件类型对应的要素模板;利用该要素模板,从内容文本库中提取第一样本事件的事件要素。According to an embodiment, an element template may be provided for each event type in advance, and the element template may define each element to be extracted under the corresponding event type. Optionally, the element template can also define the data format of each element. Therefore, for the first sample event, the element template corresponding to the first event type can be determined; the element template is used to extract the event element of the first sample event from the content text library.
如此,从内容文本库中识别出第一样本事件和对应的事件类型,并提取出与该事件类型对应的各个事件要素。下文中,将从内容文本库中提取的第一样本事件的事件要素称为第一事件要素。In this way, the first sample event and the corresponding event type are identified from the content text database, and each event element corresponding to the event type is extracted. Hereinafter, the event element of the first sample event extracted from the content text library is called the first event element.
为了更全面更丰富地表征该第一样本事件,在步骤22,在与第一样本事件相关联的领域所对应的至少一个知识图谱中,获取与第一事件要素相关联的关联要素。In order to characterize the first sample event more comprehensively and abundantly, in step 22, in at least one knowledge graph corresponding to the field associated with the first sample event, the associated element associated with the first event element is obtained.
可以理解,现有技术中已经针对各种领域或各种主题,整理有各种形式的知识图谱。这些知识图谱可以包括,企业知识图谱,产品知识图谱,人物知识图谱,信息知识图谱,股票知识图谱,基金知识图谱,机构知识图谱,等等。在步骤22,可以根据第一样本事件所关联的领域,选择出至少一个知识图谱。例如,在第一样本事件为“产品造假”类事件时,可以获取的相关领域的知识图谱包括,企业知识图谱,机构知识图谱,产品知 识图谱,等等。在第一样本事件为“高管增持”类事件时,可以获取的相关领域的知识图谱可以包括,人物知识图谱,企业知识图谱,股票知识图谱,基金知识图谱,等等。It can be understood that in the prior art, various forms of knowledge graphs have been organized for various fields or various topics. These knowledge graphs can include corporate knowledge graphs, product knowledge graphs, character knowledge graphs, information knowledge graphs, stock knowledge graphs, fund knowledge graphs, institutional knowledge graphs, etc. In step 22, at least one knowledge graph can be selected according to the field associated with the first sample event. For example, when the first sample event is a "product fraud" type of event, the available knowledge graphs of related fields include enterprise knowledge graphs, institutional knowledge graphs, product knowledge graphs, and so on. When the first sample event is an event of "executive holdings increase", the available knowledge graphs of related fields may include person knowledge graphs, corporate knowledge graphs, stock knowledge graphs, fund knowledge graphs, and so on.
如此,在确定出与第一样本事件相关联的领域所对应的知识图谱后,就可以在这些知识图谱中,对事件要素进行扩展,得到与步骤21提取的第一事件要素相关联的关联要素。In this way, after the knowledge graph corresponding to the domain associated with the first sample event is determined, the event elements can be expanded in these knowledge graphs to obtain the association associated with the first event element extracted in step 21 Elements.
一般的,知识图谱可以整理为节点连接图的形式,其中包括多个节点,每个节点对应一个知识点,具有关联关系的知识点所对应的节点之间,通过连接边进行连接。从某个节点出发,通过一条连接边可以到达的节点称为该节点的一度关联节点,通过至少k条连接边可以到达的节点称为k度关联节点,或k阶邻居节点。Generally, the knowledge graph can be organized into the form of a node connection graph, which includes multiple nodes, each node corresponds to a knowledge point, and the nodes corresponding to the knowledge points with the association relationship are connected by connecting edges. Starting from a certain node, the node that can be reached through a connecting edge is called the first degree associated node of the node, and the node that can be reached through at least k connecting edges is called the k degree associated node, or the k-order neighbor node.
基于此,在步骤22中,可以将步骤21中提取的第一事件要素映射为上述知识图谱中的节点,称为第一节点;然后,从第一节点出发,将知识图谱中与第一节点相关联的节点作为第一样本事件的关联要素。Based on this, in step 22, the first event element extracted in step 21 can be mapped to the node in the above-mentioned knowledge graph, which is called the first node; then, starting from the first node, the knowledge graph and the first node The associated node serves as the associated element of the first sample event.
具体的,在一个实施例中,可以选择与第一节点直接连接的节点,也就是一度关联节点,作为关联要素。在另一实施例中,还可以选择与第一节点最大k度关联的节点作为关联要素,其中k的值可以根据需要预先设定,例如k=3。Specifically, in an embodiment, a node directly connected to the first node, that is, a once-associated node, can be selected as the associated element. In another embodiment, the node associated with the first node with the largest degree k can also be selected as the associated element, where the value of k can be preset as required, for example, k=3.
例如,假定第一样本事件为“产品造假”类事件,提取的事件要素包括实施主体:公司,产品类别:医药,等等。对于“公司”这一事件要素,在企业知识图谱中可以确定出其一度关联的节点包括,例如“板块”、“地域”,对于“医药”这一事件要素,在产品知识图谱中可以确定出其一度关联的节点包括,例如“副作用”等,于是,可以将以上关联的节点:“板块”、“地域”,“副作用”等,作为第一样本事件的关联要素。For example, assuming that the first sample event is a "product fraud" event, the extracted event elements include the implementation subject: company, product category: medicine, and so on. For the event element "company", the once-related nodes can be determined in the corporate knowledge graph, such as "sector" and "region". For the event element "medicine", it can be determined in the product knowledge graph. The nodes that were once associated include, for example, "side effects", etc. Therefore, the above associated nodes: "section", "region", "side effects", etc., can be used as the associated elements of the first sample event.
如此,通过相关领域的知识图谱,扩展了第一样本事件的要素表达。In this way, through the knowledge graph of related fields, the element expression of the first sample event is expanded.
接着,在步骤23,根据上述第一样本事件的事件类型,步骤21中提取的第一事件要素,以及步骤22中扩展得到的关联要素,确定第一样本事件的事件特征。Next, in step 23, the event characteristics of the first sample event are determined according to the event type of the first sample event, the first event element extracted in step 21, and the associated element expanded in step 22.
具体地,在一个实施例中,可以将第一样本事件的事件特征通过特征向量F来表示,F=<f1,f2,f3,…,fn>。特征向量F中的n项特征f1-fn中,包括第一样本事件的事件类型,也包括与步骤21中提取的第一事件要素对应的特征,还包括与步骤22得到的关联要素对应的特征。这些特征既可以是离散型特征,也可以是连续型特征。如此,为第一样本事件构建了全面的事件特征。Specifically, in an embodiment, the event feature of the first sample event may be represented by a feature vector F, F=<f1, f2, f3,..., fn>. The n features f1-fn in the feature vector F include the event type of the first sample event, the feature corresponding to the first event element extracted in step 21, and the feature corresponding to the associated element obtained in step 22 feature. These features can be either discrete features or continuous features. In this way, a comprehensive event feature is constructed for the first sample event.
另一方面,还可以获取该第一样本事件的标定风险值作为该样本的标签,该标定风险值用于反映,该第一样本事件历史上真实的事件影响程度。在一个实施例中,标定风险值通过人工标注确定,也就是,人为衡量该第一样本事件所造成的影响程度,并给出影响程度/风险程度的等级或打分。在另一实施例中,将一些已有的指标值作为标定风险值。例如,对于经济领域的事件,可以通过对应企业股价的变动反映事件的影响,相应的,可以将一些股价指标作为标定风险值。更具体的,例如,可以将事件发生后3天内的累积股价涨/跌幅作为标定风险值,或者将事件发生后,5日最大回撤指标作为标定风险值。On the other hand, the calibrated risk value of the first sample event can also be obtained as the label of the sample, and the calibrated risk value is used to reflect the true degree of event influence in the history of the first sample event. In one embodiment, the calibrated risk value is determined by manual labeling, that is, the degree of influence caused by the first sample event is artificially measured, and a grade or score of the degree of influence/risk degree is given. In another embodiment, some existing index values are used as calibrated risk values. For example, for events in the economic field, the impact of the event can be reflected by the changes in the corresponding company's stock price, and correspondingly, some stock price indicators can be used as the calibrated risk value. More specifically, for example, the cumulative stock price increase/decrease within 3 days after the occurrence of the event can be used as the calibrated risk value, or the maximum retracement index in 5 days after the event occurs as the calibrated risk value.
如此,还获取了第一样本事件的标定风险值作为样本的标签。第一样本事件的事件特征与标签,共同构成一个训练样本。In this way, the calibrated risk value of the first sample event is also obtained as the label of the sample. The event feature and label of the first sample event together constitute a training sample.
如前所述,第一样本事件为前述多个样本事件中的任意一个样本事件。因此,对于上述多个样本事件的每一个,均可以采用前述步骤21-23的过程,确定各个样本事件的事件特征,以及各个样本事件的标定风险值,如此获得多个训练样本。As mentioned above, the first sample event is any one of the aforementioned multiple sample events. Therefore, for each of the above-mentioned multiple sample events, the aforementioned steps 21-23 can be used to determine the event characteristics of each sample event and the calibration risk value of each sample event, so as to obtain multiple training samples.
于是,在步骤24,根据上述各个样本事件的事件特征,以及各个样本事件的标定风险值,训练梯度提升决策树GBDT模型。Therefore, in step 24, the gradient boosting decision tree GBDT model is trained according to the event characteristics of each sample event mentioned above and the calibrated risk value of each sample event.
GBDT模型包括至少一颗决策树,这些决策树通过以下过程训练得到。首先,根据前述步骤,已经获取到训练样本集
Figure PCTCN2019129863-appb-000001
其中N为样本事件的数目。其中,F (i)为第i个样本事件的特征向量,其例如为n维向量,即F=(f 1,f 2,…,f n),y (i)为第i个样本事件的标定风险值。然后,通过决策树对所述N个样本事件进行分割,在决策树的每个枝干节点设定分裂特征和特征阈值,通过在枝干节点处将样本事件的对应特征与特征阈值比较而将样本事件分割到相应的子节点中。通过这样的过程,最后将N个样本事件分割到各个叶子节点中。于是,可以得到各个叶子节点的分值,即为该叶子节点中各个样本事件的标定风险值(即y (i))的均值。
The GBDT model includes at least one decision tree, which is trained through the following process. First, according to the previous steps, the training sample set has been obtained
Figure PCTCN2019129863-appb-000001
Where N is the number of sample events. Among them, F (i) is the feature vector of the i-th sample event, which is, for example, an n-dimensional vector, that is, F=(f 1 , f 2 ,..., f n ), and y (i) is the i-th sample event Calibration risk value. Then, the N sample events are segmented through the decision tree, the split feature and feature threshold are set at each branch node of the decision tree, and the corresponding feature of the sample event is compared with the feature threshold at the branch node. The sample events are divided into corresponding child nodes. Through this process, finally the N sample events are divided into each leaf node. Therefore, the score of each leaf node can be obtained, that is, the average value of the calibration risk value (ie y (i) ) of each sample event in the leaf node.
在此基础上,还可以在残差减小的方向继续训练进一步的决策树。即,在获取上述决策树之后,通过将每个样本事件的标定风险值与该样本事件在前述决策树中的叶子节点的分值相减,获取每个样本事件的残差r (i),以
Figure PCTCN2019129863-appb-000002
为新的训练集,其与D1对应于相同的样本事件集合。以与上述相同的方法,可获取进一步的决策树,在该决策树中,N个样本事件同样被分割到各个叶子节点中,并且每个叶子节点的分值为各个样本事件的残差值的均值。类似地,可顺序获取多个决策树,每个决策树都基于前 一个决策树的残差获得。从而可获得包括多个决策树的GBDT模型。
On this basis, you can continue to train further decision trees in the direction where the residuals decrease. That is, after obtaining the above-mentioned decision tree, the residual r (i) of each sample event is obtained by subtracting the calibrated risk value of each sample event from the leaf node score of the sample event in the aforementioned decision tree, To
Figure PCTCN2019129863-appb-000002
It is a new training set, which corresponds to the same sample event set as D1. In the same way as above, a further decision tree can be obtained. In this decision tree, N sample events are also divided into each leaf node, and the score of each leaf node is the value of the residual value of each sample event Mean. Similarly, multiple decision trees can be obtained sequentially, and each decision tree is obtained based on the residual of the previous decision tree. Thus, a GBDT model including multiple decision trees can be obtained.
图3示出根据一个实施例训练得到的决策树。如图3所示,训练得到的决策树包括枝干节点和叶子节点,每个枝干节点设定有分裂特征和特征阈值,各个样本事件通过在枝干节点处将分裂特征与特征阈值进行比较,而进入下一枝干节点,最终被划分到叶子节点。例如,节点0通向节点1的箭头上标出“f1≤0.5”,节点0通向节点2的箭头上标出“f1>0.5”,这里的f1表示特征1,更具体的,特征1例如是“事件类型”,其为节点0的分裂特征,0.5就是节点0的分裂阈值。Fig. 3 shows a decision tree trained according to an embodiment. As shown in Figure 3, the trained decision tree includes branch nodes and leaf nodes. Each branch node is set with a split feature and a feature threshold. Each sample event compares the split feature with the feature threshold at the branch node. , And enter the next branch node, and finally be divided into leaf nodes. For example, the arrow from node 0 to node 1 is marked with "f1≤0.5", and the arrow from node 0 to node 2 is marked with "f1>0.5", where f1 represents feature 1, more specifically, feature 1 such as Is the "event type", which is the split feature of node 0, and 0.5 is the split threshold of node 0.
可以看到,在训练得到的决策树中,从根节点到叶子节点的路径经过若干枝干节点的组合,每个枝干节点对应有分裂特征,于是该路径对应一种特征组合,该特征组合反映出,一个样本事件被划分到对应叶子节点所基于的特征。It can be seen that in the decision tree obtained by training, the path from the root node to the leaf node passes through a combination of several branch nodes, each branch node corresponds to a split feature, so the path corresponds to a feature combination, and the feature combination It reflects that a sample event is classified into the feature based on the corresponding leaf node.
一般的,决策树中的叶子节点通过训练会得到对应分值,该分值例如是该叶子节点中各个样本事件的标定风险值的均值,或残差的均值。Generally, a leaf node in a decision tree will obtain a corresponding score through training. The score is, for example, the average value of the calibration risk value of each sample event in the leaf node, or the average value of the residual.
根据本说明书的实施例,为每个枝干节点也赋予一定的分值,该分值基于该枝干节点所覆盖的叶子节点的分值而确定。例如,在一个实施例中,枝干节点的分值可以确定为,该枝干节点所覆盖的叶子节点的分值的平均值。According to the embodiment of the present specification, each branch node is also assigned a certain score, and the score is determined based on the score of the leaf node covered by the branch node. For example, in one embodiment, the score of a branch node may be determined as the average value of the scores of the leaf nodes covered by the branch node.
在另一个实施例中,基于以下公式确定枝干节点的分值:In another embodiment, the score of the branch node is determined based on the following formula:
Figure PCTCN2019129863-appb-000003
Figure PCTCN2019129863-appb-000003
其中,N c1和N c2为在模型训练中分别落入该枝干节点的子节点c1和c2的样本数。即,父节点的分值为其两个子节点的分值的加权平均值,所述两个子节点的权重为模型训练过程中落入其中的样本数。如此,可以从叶子节点开始,逐层向上确定出各个枝干节点的分值。 Among them, N c1 and N c2 are the sample numbers of the child nodes c1 and c2 that fall into the branch node during model training. That is, the score of the parent node is the weighted average of the scores of its two child nodes, and the weight of the two child nodes is the number of samples that fall into it during the model training process. In this way, starting from the leaf nodes, the score of each branch node can be determined layer by layer.
为了示例的目的,图3在部分节点下方标出了该节点的分值,其中,枝干节点的分值为覆盖的叶子节点的分值的平均。For the purpose of example, Figure 3 shows the scores of some nodes under the node, where the scores of the branch nodes are the average of the scores of the covered leaf nodes.
如此,为每个枝干节点也赋予相应分值。以上分值也可以称为节点的风险分值。In this way, each branch node is also assigned a corresponding score. The above score can also be called the risk score of the node.
在此基础上,还可以通过训练过程为各个枝干节点赋予节点权重。对于某个枝干节点A,可以基于该枝干节点A分裂前后的各个节点各自的节点损失值而确定,所述节点损失值基于落入该节点的样本事件的标定风险值与该节点的风险分值的差值而确定。On this basis, it is also possible to assign node weights to each branch node through the training process. For a branch node A, it can be determined based on the respective node loss value of each node before and after the branch node A is split. The node loss value is based on the calibration risk value of the sample event falling into the node and the risk of the node The difference between the scores is determined.
具体的,假定从枝干节点A分裂为两个子节点L和R(L和R可以是叶子节点,也 可以是枝干节点)。那么,节点A的权重可以定义为:Specifically, assume that the branch node A is split into two child nodes, L and R (L and R can be leaf nodes or branch nodes). Then, the weight of node A can be defined as:
节点L的损失值+节点R的损失值-A的损失值。The loss value of node L + the loss value of node R-the loss value of A.
其中,节点L的损失值基于落入节点L的样本事件的标定风险值与节点L的风险分值的差值而确定。更具体的,该损失值可以是各个样本的标定风险值与节点的风险分值的差值的平方和。或者,在其他例子中,也可以是上述差值的方均根。类似可以得出节点R的损失值,节点A的损失值,进而得到节点A的权重。Among them, the loss value of node L is determined based on the difference between the calibrated risk value of the sample event falling into node L and the risk score of node L. More specifically, the loss value may be the sum of the squares of the difference between the nominal risk value of each sample and the risk score of the node. Or, in other examples, it may also be the root mean square of the difference. Similarly, the loss value of node R and the loss value of node A can be obtained, and then the weight of node A can be obtained.
通过以上方式为每个枝干节点赋予了节点权重。由于每个枝干节点还对应一项特征,节点权重可以从一定意义上反映,在本次分裂时,该特征所起的作用,并在一定程度上反映该特征对决策路径的贡献度。Through the above method, each branch node is given a node weight. Since each branch node also corresponds to a feature, the node weight can reflect in a certain sense, the role played by the feature during this split, and to a certain extent reflect the contribution of the feature to the decision path.
基于以上训练得到的GBDT模型,就可以对未知结果的事件进行风险评估。并且,由于以上GBDT模型中决策树的特点,还可以更好地对风险评估结果进行解释。Based on the GBDT model obtained from the above training, the risk assessment of events with unknown results can be carried out. Moreover, due to the characteristics of the decision tree in the above GBDT model, the risk assessment results can be better explained.
下面描述使用GBDT模型进行风险评估的过程。也就是,在图2的步骤25,利用训练得到的GBDT模型,对待分析的事件进行风险评估。为了描述的清楚和简单,将待分析的事件称为第二事件。The following describes the process of risk assessment using the GBDT model. That is, in step 25 of FIG. 2, the GBDT model obtained by training is used to perform risk assessment on the event to be analyzed. For clarity and simplicity of description, the event to be analyzed is called the second event.
图4示出在一个实施例中对第二事件进行风险评估的流程图,也就是上述步骤25的子步骤。可以理解,为了对第二事件进行评估,首先要构建第二事件的事件特征,事件特征的构建过程与GBDT模型训练阶段中样本事件的事件特征的构建方式相对应。Fig. 4 shows a flow chart of performing risk assessment on the second event in an embodiment, that is, the sub-steps of step 25 above. It can be understood that in order to evaluate the second event, the event feature of the second event must first be constructed, and the construction process of the event feature corresponds to the construction method of the event feature of the sample event in the GBDT model training phase.
具体的,在步骤251,获取第二事件的事件类型,以及至少一个第二事件要素。Specifically, in step 251, the event type of the second event and at least one second event element are acquired.
在一个实施例中,可以由用户直接输入第二事件的事件类型和事件要素。例如,当用户想要查询或评估某个事件的风险度或影响度时,可以直接在查询接口中输入第二事件的描述,例如“FF公司用户数据泄露”,然后选择事件类型“信息泄露”,接着,在根据事件类型提供的要素模版中,输入该事件的事件要素,例如,实施主体,数据类别,事件等级,等等。In one embodiment, the event type and event elements of the second event may be directly input by the user. For example, when a user wants to query or evaluate the risk or impact of an event, he can directly enter the description of the second event in the query interface, such as "FF Company User Data Leakage", and then select the event type "Information Leakage" Then, in the element template provided according to the event type, enter the event elements of the event, such as the implementation subject, data category, event level, and so on.
在另一实施例中,可以将描述第二事件的文本输入到评估系统,由评估系统进行事件识别和要素提取。上述输入文本例如可以是财经资讯等新闻报道,或者互联网上的各种文章等等。事件识别和要素提取的过程与前述的步骤21相似。也就是,采用自然语言处理模型,从输入文本中识别出第二事件以及第二事件类型;并根据第二事件类型,从所述输入文本中提取第二事件的事件要素。In another embodiment, the text describing the second event may be input to the evaluation system, and the evaluation system performs event identification and element extraction. The above-mentioned input text can be, for example, news reports such as financial information, or various articles on the Internet. The process of event recognition and element extraction is similar to the aforementioned step 21. That is, the natural language processing model is used to identify the second event and the second event type from the input text; and according to the second event type, the event element of the second event is extracted from the input text.
在得到第二事件的事件要素后,在步骤252,在与第二事件的领域相关的至少一个知识图谱中,获取与第二事件的事件要素相关联的关联要素。具体的,可以在知识图谱中,将第二事件的事件要素映射为第二节点,然后将与第二节点关联的节点作为关联要素。这个过程与前述步骤22相似,不再赘述。After the event element of the second event is obtained, in step 252, in at least one knowledge graph related to the field of the second event, the associated element associated with the event element of the second event is obtained. Specifically, in the knowledge graph, the event element of the second event may be mapped to the second node, and then the node associated with the second node may be used as the associated element. This process is similar to the aforementioned step 22 and will not be repeated here.
然后,在步骤253,根据第二事件的事件类型,事件要素,以及关联要素,确定第二事件的事件特征,下文称为第二事件特征。第二事件特征可以表示为特征向量V。如此,为第二事件构建了事件特征。Then, in step 253, the event characteristics of the second event are determined according to the event type, event elements, and related elements of the second event, which are hereinafter referred to as second event characteristics. The second event feature can be expressed as a feature vector V. In this way, an event feature is constructed for the second event.
接着,在步骤254,将第二事件的事件特征V输入到前述训练得到的GBDT模型,根据模型输出确定第二事件的风险值。Next, in step 254, the event feature V of the second event is input to the GBDT model obtained by the aforementioned training, and the risk value of the second event is determined according to the model output.
如前所述,训练得到的GBDT模型包括至少一棵决策树,决策树中的枝干节点对应有分裂特征和特征阈值。在将第二事件特征V输入GBDT模型后,在决策树的每个枝干节点i处,将特征向量V中与枝干节点的分裂特征对应的特征的特征值与特征阈值比对,根据比对结果,将第二事件划分到下一层级的节点,直到划分到叶子节点。As mentioned above, the GBDT model obtained by training includes at least one decision tree, and the branch nodes in the decision tree correspond to split features and feature thresholds. After the second event feature V is input into the GBDT model, at each branch node i of the decision tree, the feature value of the feature corresponding to the split feature of the branch node in the feature vector V is compared with the feature threshold, and according to the comparison For the result, the second event is divided into nodes of the next level until it is divided into leaf nodes.
图5示出在一个实施例中第二事件在决策树中的划分过程,该决策树与图3所示的决策树相同。具体的,假定节点0处的分裂特征为f1“事件类型”,特征阈值为0.5;节点2处的分裂特征为f3“实施主体”,特征阈值为0.6。将第二事件的事件特征向量V输入该决策树。在节点0处,假定第二事件特征V中,“事件类型”对应的特征值为0.8,大于该分裂特征的特征阈值0.5,于是第二事件从节点0被划分到节点2。接着,在节点2处,判断分裂特征“实施主体”。假定第二事件特征向量V中“实施主体”这一特征的特征值为0.2,小于该分裂特征的特征阈值0.6,于是,第二事件被接着划分到节点5。如此继续,直到第二事件被划分到叶子节点16。FIG. 5 shows the division process of the second event in the decision tree in one embodiment, which is the same as the decision tree shown in FIG. 3. Specifically, assume that the split feature at node 0 is f1 "event type", and the feature threshold is 0.5; the split feature at node 2 is f3 "implementing subject", and the feature threshold is 0.6. The event feature vector V of the second event is input into the decision tree. At node 0, suppose that in the second event feature V, the feature value corresponding to the "event type" is 0.8, which is greater than the feature threshold value 0.5 of the split feature, so the second event is divided from node 0 to node 2. Next, at node 2, judge the split feature "implementation subject". Assuming that the feature value of the feature "implementing subject" in the second event feature vector V is 0.2, which is smaller than the feature threshold of the split feature 0.6, the second event is then divided into node 5. This continues until the second event is divided into the leaf node 16.
如前所述,通过训练,每个叶子节点得到有对应分值,因此,GBDT模型可以输出第二事件所划分到的叶子节点的分值,于是,在步骤254,可以将模型输出的叶子节点的分值作为第二事件的风险值。例如,图5中叶子节点16的分值0.062即可作为第二事件的风险值。在GBDT模型包括多个决策树的情况下,在每棵决策树中第二事件都会被划分到对应叶子节点。此时,GBDT模型可以确定出在各个决策树中第二事件所在的叶子节点的对应分值,并将各个叶子节点对应分值的和值,即总分值,作为输出结果。于是,可以将GBDT模型输出的该总分值作为第二事件的风险值。As mentioned above, through training, each leaf node gets a corresponding score. Therefore, the GBDT model can output the score of the leaf node to which the second event is divided. Therefore, in step 254, the leaf node output by the model can be As the risk value of the second event. For example, the score 0.062 of the leaf node 16 in FIG. 5 can be used as the risk value of the second event. In the case that the GBDT model includes multiple decision trees, the second event in each decision tree will be divided into corresponding leaf nodes. At this time, the GBDT model can determine the corresponding score of the leaf node where the second event is located in each decision tree, and use the sum of the corresponding scores of each leaf node, that is, the total score, as the output result. Therefore, the total score output by the GBDT model can be used as the risk value of the second event.
以上,通过将第二事件的事件特征输入训练的GBDT模型,即可根据模型输出 确定出第二事件的风险值,从而对第二事件进行定量的风险评估。Above, by inputting the event characteristics of the second event into the trained GBDT model, the risk value of the second event can be determined according to the model output, so as to perform a quantitative risk assessment of the second event.
此外,在一个实施例中,步骤25中对第二事件进行风险评估还可以包括,在步骤254给出第二事件的风险值之后,对第二事件的风险值进行特征解释。In addition, in one embodiment, performing risk assessment on the second event in step 25 may also include, after the risk value of the second event is given in step 254, performing characteristic interpretation on the risk value of the second event.
图6示出在一个实施例中进行特征解释的步骤流程。如图6所示,在步骤61,根据第二事件的事件特征确定第二事件在决策树中的决策路径。如前所述,为了给出第二事件的风险值,在决策树的各个枝干节点处,根据第二事件的对应特征的特征值,将第二事件划分到子节点,直到到达叶子节点。如此,在决策树中从根节点到第二事件所划分到的叶子节点所经过的路径即为决策路径。Fig. 6 shows a flow of steps for feature interpretation in an embodiment. As shown in FIG. 6, in step 61, the decision path of the second event in the decision tree is determined according to the event characteristics of the second event. As mentioned above, in order to give the risk value of the second event, at each branch node of the decision tree, the second event is divided into child nodes according to the characteristic value of the corresponding feature of the second event until the leaf node is reached. In this way, the path taken from the root node to the leaf node to which the second event is divided in the decision tree is the decision path.
例如,如图5所示,第二事件最终被划分到了叶子节点16,从根节点0,经过节点2,节点5,节点11,到达节点16的路径即为第二事件的决策路径。For example, as shown in FIG. 5, the second event is finally divided into leaf nodes 16, and the path from root node 0 through node 2, node 5, node 11 to node 16 is the decision path of the second event.
可以理解,在GBDT模型包含多个决策树的情况下,可以在每个决策树中都确定出对应的决策路径。It can be understood that when the GBDT model contains multiple decision trees, the corresponding decision path can be determined in each decision tree.
接着在步骤62,确定决策路径所经过的各个枝干节点,并获取各个枝干节点对应的特征以及节点权重。Next, in step 62, each branch node through which the decision path passes is determined, and the characteristics and node weights corresponding to each branch node are obtained.
可以理解,决策路径的起点为决策树的根节点,终点为第二事件所划分到的叶子节点,除叶子节点之外的节点可以作为枝干节点。如此,可以确定出决策路径所包含的各个枝干节点。在决策路径为多条路径的情况下,确定出多条路径中包含的各个枝干节点。It can be understood that the starting point of the decision path is the root node of the decision tree, and the ending point is the leaf node to which the second event is divided, and nodes other than the leaf nodes can be used as branch nodes. In this way, each branch node included in the decision path can be determined. In the case where the decision path is multiple paths, each branch node included in the multiple paths is determined.
如前所述,根据本说明书的实施例,为决策树中的各个枝干节点赋予了一定的节点权重。如此,可以确定出决策路径中各个枝干节点的节点权重。As mentioned above, according to the embodiment of this specification, each branch node in the decision tree is given a certain node weight. In this way, the node weight of each branch node in the decision path can be determined.
于是,在步骤63,对于第二事件的事件特征中包含的某项特征,称为第一特征,根据上述各个枝干节点中对应于该第一特征的至少一个枝干节点的节点权重,确定该第一特征的特征权重,作为该第一特征对于所述风险值的重要性。Therefore, in step 63, a certain feature included in the event feature of the second event is called the first feature, and the node weight of at least one branch node corresponding to the first feature among the above-mentioned branch nodes is determined. The feature weight of the first feature is used as the importance of the first feature to the risk value.
需要理解的是,决策树中每个枝干节点对应一项特征,但是一项特征可以出现在多个决策树的多个枝干节点中,甚至同一棵决策树的多个枝干节点中。因此,对于上述第一特征,可以首先从决策路径包含的枝干节点中确定出与该第一特征对应的至少一个枝干节点,获取该至少一个枝干节点的节点权重,据此确定该特征的特征权重。具体的,在一个例子中,第一特征的特征权重可以是,与该第一特征对应的上述至少一个枝干节点的节点权重的平均值。如此,获取到了第一特征的特征权重,该特征权重就可以 反映第一特征对于第二事件的风险值的贡献度或重要性。相应的,可以获取第二事件的事件特征中各个特征的特征权重,作为对第二事件的风险值的贡献度或重要性。It should be understood that each branch node in a decision tree corresponds to a feature, but a feature can appear in multiple branch nodes of multiple decision trees, or even multiple branch nodes of the same decision tree. Therefore, for the above-mentioned first feature, at least one branch node corresponding to the first feature can be determined from the branch nodes included in the decision path, the node weight of the at least one branch node can be obtained, and the feature can be determined accordingly The feature weight. Specifically, in an example, the feature weight of the first feature may be an average value of the node weights of the at least one branch node corresponding to the first feature. In this way, the feature weight of the first feature is obtained, and the feature weight can reflect the contribution or importance of the first feature to the risk value of the second event. Correspondingly, the feature weight of each feature in the event feature of the second event can be obtained as the contribution or importance to the risk value of the second event.
在一个实施例中,可以根据各个特征的特征权重的排序,对相应的特征进行排序,从而表示出,对第二事件的风险值产生影响的特征的重要性排序。In one embodiment, the corresponding features can be ranked according to the ranking of the feature weights of each feature, thereby indicating the importance ranking of the features that affect the risk value of the second event.
例如,在一个具体例子中,第二事件为“上市公司历史财务造假”。通过以上实施例的方法,可以得出,对该事件的风险值产生影响的特征按照重要性依次为:“处罚类型”,“事实类型”,“股票表现”和“处罚组织”。For example, in a specific example, the second event is "listed company historical financial fraud." Through the method of the above embodiment, it can be concluded that the characteristics that have an impact on the risk value of the event are in order of importance: "penalty type", "fact type", "stock performance" and "penalty organization".
简而言之,在GBDT模型包含的决策树中,第二事件经由决策路径被划分到叶子节点,进而通过叶子节点的分值确定出第二事件的风险值。此外,决策路径经过多个枝干节点,每个枝干节点对应一项特征,因此决策路径可以对应于,所经过的各个枝干节点的分裂特征的特征组合。通过各个枝干节点的节点权重,可以衡量对应特征对于最终的风险值结果的贡献度或重要性,也就是,对风险值结果进行了特征解释。因此,在以上过程中,不仅通过GBDT模型确定出第二事件的风险值,还可以对该风险值进行特征解释,也就是说明,得出这样的风险值,各项特征所起的作用的大小。In short, in the decision tree included in the GBDT model, the second event is divided into leaf nodes via the decision path, and the risk value of the second event is determined by the score of the leaf node. In addition, the decision path passes through multiple branch nodes, and each branch node corresponds to a feature. Therefore, the decision path can correspond to the feature combination of the split features of each branch node passed. Through the node weight of each branch node, the contribution or importance of the corresponding feature to the final risk value result can be measured, that is, the characteristic interpretation of the risk value result is performed. Therefore, in the above process, not only the risk value of the second event is determined through the GBDT model, but also the characteristic interpretation of the risk value can be performed, that is to say, the magnitude of the role played by each characteristic when the risk value is obtained .
以上描述了对于待评估的第二事件,通过知识图谱对事件要素进行扩展后得到第二事件的全面的事件特征,将事件特征输入训练好的GBDT模型得到第二事件的风险值的过程。在此基础上,还可以利用GBDT模型中的参数对得到的风险值进行特征解释。以上评估过程适用于能够获得第二事件的对应要素,进而能够构建事件特征的情况。The above describes the process of obtaining the comprehensive event characteristics of the second event after expanding the event elements through the knowledge graph for the second event to be evaluated, and inputting the event characteristics into the trained GBDT model to obtain the risk value of the second event. On this basis, the parameters in the GBDT model can also be used to characterize the obtained risk value. The above evaluation process is applicable to the situation where the corresponding elements of the second event can be obtained, and then the event characteristics can be constructed.
根据一种实施方式,以上训练得到的GBDT模型还可以适用于,对于无法获得完整事件特征的事件进行条件预测,也就是,当只能获得事件的很少一部分要素时,给出不同条件或不同情况下事件的不同风险走向的评估。According to an implementation manner, the GBDT model obtained through the above training can also be applied to conditional prediction of events for which complete event characteristics cannot be obtained, that is, when only a few elements of the event can be obtained, different conditions or different conditions are given. The assessment of the different risk trends of the event under the circumstances.
例如,想要评估“某公司疫苗造假”事件的可能影响。假定只能获得该事件的事件类型为“产品造假”,实施主体为某公司,其他要素难以获取。此时,也可以利用以上训练得到的GBDT模型,给出该事件在不同情况下的风险走向评估,例如,在满足什么条件的情况下,该事件会产生极大的舆论风险影响,在满足什么条件的情况下,该事件的影响会最小化。下面描述对于这样的第二事件的评估过程。For example, you want to evaluate the possible impact of the "vaccine fraud of a company" incident. Assume that the event type that can only obtain the event is "product fraud", the implementation subject is a company, and other elements are difficult to obtain. At this time, the GBDT model obtained by the above training can also be used to give an assessment of the risk trend of the event in different situations, for example, under what conditions are met, the event will have a great impact on public opinion risk, and what is being met? Under conditions, the impact of the event will be minimized. The evaluation process for such a second event is described below.
图7示出根据一个实施例对第二事件进行评估的步骤流程图。Fig. 7 shows a flowchart of steps for evaluating a second event according to an embodiment.
如图7所示,首先,在步骤71,获取第二事件的至少一个事件要素。如上所述,该步骤流程适用于第二事件要素不完整的情况,因此,该步骤71中获取的事件要素可 以是少量的、不完整的事件要素,例如只有实施主体,甚至只能得到事件类型。例如,对于上述“某公司疫苗造假”事件,假定只能获得该事件的事件类型为“产品造假”,实施主体为某公司。As shown in Fig. 7, first, in step 71, at least one event element of the second event is acquired. As mentioned above, this step process is suitable for the case where the second event element is incomplete. Therefore, the event element obtained in step 71 can be a small number of incomplete event elements, for example, only the implementation subject, or even the event type. . For example, for the above-mentioned "vaccine fraud of a certain company" incident, it is assumed that the event type that can only obtain the incident is "product fraud" and the subject of implementation is a certain company.
接着,在步骤72,根据所述至少一个事件要素,在决策树中对第二事件进行划分,基于划分的停止节点确定决策树的子树。Next, in step 72, the second event is divided in the decision tree according to the at least one event element, and the subtree of the decision tree is determined based on the divided stop nodes.
可以理解,由于事件要素不完整,事件特征不完整,因此,往往无法在决策树中得到从根节点到叶子节点的完整的决策路径。此时,可以根据已得到的要素,在决策树中对第二事件进行划分,确定出无法继续划分而划分停止的停止节点,并基于该停止节点确定出决策树的子树,该子树即为停止节点所覆盖的节点区域。It can be understood that due to incomplete event elements and incomplete event characteristics, it is often impossible to obtain a complete decision path from the root node to the leaf node in the decision tree. At this point, the second event can be divided in the decision tree according to the obtained elements, the stop node that cannot be divided and the division stopped is determined, and the subtree of the decision tree is determined based on the stop node. The subtree is The node area covered by the stop node.
结合图3的决策树示意图进行描述。首先在节点0处,判断分裂特征“事件类型”。假定第二事件“某公司疫苗造假”的事件类型为0.3,小于特征阈值0.5,于是,第二事件被划分到节点1。节点1处的分裂特征为f2“处罚类型”。然而,如上所述,由于第二事件的要素不完整,无法获得到该项特征,于是第二事件无法继续划分,节点1即为停止节点。节点1覆盖的节点区域即为上述的子树。It is described in conjunction with the schematic diagram of the decision tree in FIG. First, at node 0, determine the split feature "event type". Assuming that the event type of the second event "a vaccine fraud by a company" is 0.3, which is less than the characteristic threshold 0.5, the second event is then classified to node 1. The split feature at node 1 is f2 "penalty type". However, as described above, because the elements of the second event are incomplete, this feature cannot be obtained, so the second event cannot be divided, and node 1 is the stop node. The node area covered by node 1 is the aforementioned subtree.
然后,在步骤73,确定上述子树中满足预定条件的第一叶子节点,以及从根节点到该第一叶子节点的条件路径。Then, in step 73, the first leaf node in the aforementioned subtree that meets the predetermined condition and the conditional path from the root node to the first leaf node are determined.
上述预定条件可以根据评估需要设定,例如可以是,风险最大,风险最小,风险值满足一定阈值,等等。The above-mentioned predetermined conditions can be set according to evaluation needs, for example, the risk is the largest, the risk is the smallest, the risk value meets a certain threshold, and so on.
如果预定条件为风险最大,那么,就从子树所包含的各个叶子节点中,选择分值最大的叶子节点作为上述第一叶子节点。从根节点到该叶子节点的路径即为上述条件路径。If the predetermined condition is that the risk is the greatest, then the leaf node with the largest score is selected from the leaf nodes included in the subtree as the first leaf node. The path from the root node to the leaf node is the above conditional path.
沿用上例并结合图3,停止节点为节点1,确定出的子树包含叶子节点7,8,9,10,假定其中节点8的分值最大,那么可以将节点8确定为风险最大条件下的叶子节点,将从节点0到节点8的路径,即包含节点0,1,3,8的路径作为上述条件路径。Using the above example and combining with Figure 3, the stop node is node 1, and the determined subtree contains leaf nodes 7, 8, 9, 10. Assuming that node 8 has the largest score, then node 8 can be determined as the maximum risk condition The leaf node of, the path from node 0 to node 8, that is, the path containing nodes 0, 1, 3, and 8 is used as the above conditional path.
其他预定条件的情况下,则相应地根据各个叶子节点的分值选择出相应的叶子节点作为第一叶子节点。In the case of other predetermined conditions, the corresponding leaf node is selected as the first leaf node according to the score of each leaf node.
接着在步骤74,获取所述条件路径中包含的枝干节点所对应的特征组合,将所述特征组合作为所述第二事件在所述预定条件下的影响特征。Next, in step 74, the feature combination corresponding to the branch nodes included in the conditional path is obtained, and the feature combination is used as the influence feature of the second event under the predetermined condition.
可以理解,条件路径对应于,在假设出现的预定条件下,第二事件的划分路径。因此,该路径中包含的枝干节点对应的特征组合即为,对第二事件产生影响、使其满足上述预定条件的那些特征。例如,如果预定条件为风险最大,那么此时条件路径对应的特征组合即为,导致第二事件出现最大风险的影响特征。如此,对第二事件进行条件预测和解释,给出不同条件下的不同影响特征,帮助预测事件的后续走向。It can be understood that the conditional path corresponds to the division path of the second event under a predetermined condition assumed to occur. Therefore, the feature combinations corresponding to the branch and trunk nodes included in the path are those that have an impact on the second event and make it meet the aforementioned predetermined conditions. For example, if the predetermined condition is the maximum risk, then the feature combination corresponding to the conditional path at this time is the impact feature that causes the second event to appear the maximum risk. In this way, the conditional prediction and interpretation of the second event are performed, and different impact characteristics under different conditions are given to help predict the subsequent trend of the event.
进一步的,根据一种实施方式,还可以提供以下信息作为第二事件的评估。例如,在一个实施例中,可以提供上述第一叶子节点的分值,作为预定条件下第二事件的风险值。例如,在预定条件为风险最大的情况下,可以提供节点8的分值,作为第二事件的可能的最大风险值。Further, according to an implementation manner, the following information can also be provided as an assessment of the second event. For example, in one embodiment, the score of the above-mentioned first leaf node may be provided as the risk value of the second event under predetermined conditions. For example, in the case where the predetermined condition is the maximum risk, the score of node 8 may be provided as the possible maximum risk value of the second event.
在一个实施例中,可以根据上述条件路径中枝干节点的节点权重,确定上述特征组合中各项特征的重要度。这一过程与前述步骤63类似。In an embodiment, the importance of each feature in the above-mentioned feature combination may be determined according to the node weight of the branch node in the above-mentioned conditional path. This process is similar to the aforementioned step 63.
通过以上方式,可以对于要素较少、特征不完整的第二事件进行评估,给出出现不同风险结果时第二事件对应满足的特征条件,从而更好地利用GBDT模型的特点,对事件未来的风险性进行解释和预测。Through the above method, the second event with fewer elements and incomplete features can be evaluated, and the corresponding characteristic conditions that the second event will meet when different risk results appear are given, so as to make better use of the characteristics of the GBDT model to assess the future of the event. The risk is explained and predicted.
根据另一方面的实施例,提供了一种事件风险评估的装置,该装置可以部署在任何具有计算、处理能力的设备、平台或设备集群中。图8示出根据一个实施例的事件评估装置的示意性框图。如图8所示,该评估装置800包括:According to another embodiment, a device for event risk assessment is provided. The device can be deployed in any device, platform or device cluster with computing and processing capabilities. Fig. 8 shows a schematic block diagram of an event evaluation device according to an embodiment. As shown in Figure 8, the evaluation device 800 includes:
提取单元81,配置为采用自然语言处理模型,从内容文本库中提取多个样本事件,所述多个样本事件包括第一样本事件,所述提取多个样本事件包括,识别第一样本事件及其对应的第一事件类型,并根据第一事件类型,提取所述第一样本事件的至少一个第一事件要素;The extraction unit 81 is configured to use a natural language processing model to extract a plurality of sample events from the content text library, the plurality of sample events including a first sample event, and the extracting a plurality of sample events includes, identifying the first sample An event and its corresponding first event type, and extract at least one first event element of the first sample event according to the first event type;
关联单元82,配置为在与所述第一样本事件相关联的至少一个领域所对应的至少一个知识图谱中,获取与所述至少一个第一事件要素相关联的至少一个第一关联要素;The associating unit 82 is configured to obtain at least one first associated element associated with the at least one first event element in at least one knowledge graph corresponding to at least one field associated with the first sample event;
确定单元83,配置为根据所述第一事件类型,所述至少一个第一事件要素,以及所述至少一个第一关联要素,确定所述第一样本事件的事件特征;The determining unit 83 is configured to determine the event feature of the first sample event according to the first event type, the at least one first event element, and the at least one first correlation element;
训练单元84,配置为根据所述多个样本事件中各个样本事件的事件特征,以及各个样本事件的标定风险值,训练梯度提升决策树GBDT模型,得到训练的GBDT模型;The training unit 84 is configured to train the gradient boosting decision tree GBDT model according to the event characteristics of each sample event in the multiple sample events and the calibration risk value of each sample event to obtain the trained GBDT model;
评估单元85,配置为利用所述训练的GBDT模型,对待分析的第二事件进行风 险评估。The evaluation unit 85 is configured to use the trained GBDT model to perform risk evaluation on the second event to be analyzed.
在一个实施例中,所述提取单元81具体配置为:确定所述第一事件类型对应的第一模板;利用所述第一模板,从所述内容文本库中提取所述第一样本事件的至少一个第一事件要素。In an embodiment, the extracting unit 81 is specifically configured to: determine a first template corresponding to the first event type; use the first template to extract the first sample event from the content text library At least one element of the first event.
根据一个实施例,上述第一事件要素包括以下中的至少一个:事件时间、事件地点、实施主体、事件客体、事实类型、事件等级。According to one embodiment, the aforementioned first event element includes at least one of the following: event time, event location, implementation subject, event object, fact type, and event level.
在一个实施例中,所述关联单元82具体配置为:In an embodiment, the associating unit 82 is specifically configured as follows:
将所述至少一个第一事件要素映射为所述至少一个知识图谱中的第一节点;将所述至少一个知识图谱中与所述第一节点直接连接的节点作为所述至少一个第一关联要素。Map the at least one first event element to the first node in the at least one knowledge graph; use the node directly connected to the first node in the at least one knowledge graph as the at least one first associated element .
根据一个实施例,上述知识图谱可以包括以下中的一个或多个:企业知识图谱,产品知识图谱,人物知识图谱,信息知识图谱,股票知识图谱,基金知识图谱,机构知识图谱。According to an embodiment, the above-mentioned knowledge graph may include one or more of the following: enterprise knowledge graph, product knowledge graph, character knowledge graph, information knowledge graph, stock knowledge graph, fund knowledge graph, and institution knowledge graph.
根据一种实施方式,所述评估单元85包括:According to an embodiment, the evaluation unit 85 includes:
要素获取模块851,配置为获取第二事件的事件类型,以及至少一个第二事件要素;The element acquisition module 851 is configured to acquire the event type of the second event and at least one second event element;
要素关联模块852,配置为在所述至少一个知识图谱中,获取与所述至少一个第二事件要素相关联的至少一个第二关联要素;The element association module 852 is configured to obtain at least one second association element associated with the at least one second event element in the at least one knowledge graph;
第一确定模块853,配置为根据所述第二事件的事件类型,所述至少一个第二事件要素,以及所述至少一个第二关联要素,确定所述第二事件的事件特征;The first determining module 853 is configured to determine the event feature of the second event according to the event type of the second event, the at least one second event element, and the at least one second correlation element;
第二确定模块854,配置为将所述第二事件的事件特征输入所述训练的GBDT模型,根据模型输出确定所述第二事件的风险值。The second determining module 854 is configured to input the event characteristics of the second event into the trained GBDT model, and determine the risk value of the second event according to the model output.
具体的,在一个实施例中,所述要素获取模块851配置为:Specifically, in one embodiment, the element acquisition module 851 is configured to:
从输入文本中识别出所述第二事件以及第二事件类型;Identifying the second event and the second event type from the input text;
根据第二事件类型,从所述输入文本中提取所述至少一个第二事件要素。According to the second event type, the at least one second event element is extracted from the input text.
在另一实施例中,所述要素获取模块851配置为:In another embodiment, the element acquisition module 851 is configured to:
接收输入的第二事件,以及所述至少一个第二事件要素。The input second event and the at least one second event element are received.
根据一个实施例,训练得到的GBDT模型包括至少一棵决策树,所述决策树包括枝干节点和叶子节点,每个枝干节点对应一项特征,并具有训练得到的风险分值以及节点权重,其中节点权重基于该枝干节点以及分裂后节点各自的节点损失值确定,所述节点损失值基于落入该节点的样本事件的标定风险值与该节点的风险分值之差而确定;According to one embodiment, the GBDT model obtained by training includes at least one decision tree, the decision tree includes branch nodes and leaf nodes, each branch node corresponds to a feature, and has the risk score and node weight obtained by training. , Wherein the node weight is determined based on the respective node loss values of the branch node and the split node, and the node loss value is determined based on the difference between the nominal risk value of the sample event falling into the node and the risk score of the node;
相应的,在一个实施例中,评估单元85还包括(未示出):Correspondingly, in an embodiment, the evaluation unit 85 further includes (not shown):
决策路径确定模块,配置为根据所述第二事件的事件特征确定所述第二事件在所述决策树中的决策路径;A decision path determining module, configured to determine the decision path of the second event in the decision tree according to the event characteristics of the second event;
节点权重确定模块,配置为确定所述决策路径所经过的各个枝干节点,并获取各个枝干节点对应的特征以及节点权重;A node weight determination module, configured to determine each branch node through which the decision path passes, and obtain the characteristics and node weights corresponding to each branch node;
重要性确定模块,配置为对于所述第二事件的事件特征中包含的第一特征,根据所述各个枝干节点中对应于该第一特征的至少一个枝干节点的节点权重,确定该第一特征的特征权重,作为该第一特征对于所述风险值的重要性。The importance determination module is configured to determine the first feature included in the event feature of the second event according to the node weight of at least one branch node corresponding to the first feature in each branch node The feature weight of a feature is used as the importance of the first feature to the risk value.
根据另一种实施方式,评估单元85包括(未示出):According to another embodiment, the evaluation unit 85 includes (not shown):
要素获取模块,配置为获取第二事件的至少一个第二事件要素;An element acquisition module configured to acquire at least one second event element of the second event;
子树确定模块,配置为根据所述至少一个第二事件要素在所述决策树中对第二事件进行划分,基于划分的停止节点确定所述决策树的子树;A subtree determining module, configured to divide a second event in the decision tree according to the at least one second event element, and determine a subtree of the decision tree based on the divided stop nodes;
条件路径确定模块,配置为确定所述子树中满足预定条件的第一叶子节点,以及从根节点到该第一叶子节点的条件路径;A conditional path determining module, configured to determine a first leaf node in the subtree that meets a predetermined condition and a conditional path from the root node to the first leaf node;
特征确定模块,配置为获取所述条件路径中包含的枝干节点所对应的特征组合,将所述特征组合作为所述第二事件在所述预定条件下的影响特征。The feature determination module is configured to obtain a feature combination corresponding to a branch node included in the conditional path, and use the feature combination as an impact feature of the second event under the predetermined condition.
在一个实施例中,所述决策树中每个叶子节点通过训练得到有风险分值,每个枝干节点对应一项特征,并具有训练得到的风险分值以及节点权重,其中节点权重基于该枝干节点以及分裂后节点各自的节点损失值确定,所述节点损失值基于落入该节点的样本事件的标定风险值与该节点的风险分值之差而确定;In an embodiment, each leaf node in the decision tree obtains a risk score through training, and each branch node corresponds to a feature, and has the risk score obtained by training and the node weight, wherein the node weight is based on the The node loss value of the branch node and the node after the split is determined, and the node loss value is determined based on the difference between the calibrated risk value of the sample event falling into the node and the risk score of the node;
相应的,所述评估单元还包括以下中的一项或多项:Correspondingly, the evaluation unit further includes one or more of the following:
第三确定模块,配置为确定所述第一叶子节点对应的第一风险分值,作为所述预定条件下第二事件的风险值;A third determining module, configured to determine the first risk score corresponding to the first leaf node as the risk value of the second event under the predetermined condition;
第四确定模块,配置为根据所述条件路径中各个枝干节点的节点权重,确定所 述特征组合中与所述各个枝干节点对应的各项特征的重要度。The fourth determining module is configured to determine the importance of each feature corresponding to each branch node in the feature combination according to the node weight of each branch node in the conditional path.
通过以上的装置,实现GBDT模型的训练和使用,对事件风险进行有效的评估和解释。Through the above devices, the training and use of the GBDT model can be realized, and the event risk can be effectively evaluated and explained.
根据另一方面的实施例,还提供一种计算机可读存储介质,其上存储有计算机程序,当所述计算机程序在计算机中执行时,令计算机执行结合图2,所描述的方法。According to another embodiment, there is also provided a computer-readable storage medium having a computer program stored thereon, and when the computer program is executed in a computer, the computer is caused to execute the method described in conjunction with FIG. 2.
根据再一方面的实施例,还提供一种计算设备,包括存储器和处理器,所述存储器中存储有可执行代码,所述处理器执行所述可执行代码时,实现结合图2和图4所述的方法。According to an embodiment of still another aspect, there is also provided a computing device, including a memory and a processor, the memory stores executable code, and when the processor executes the executable code, a combination of FIGS. 2 and 4 is implemented. The method described.
本领域技术人员应该可以意识到,在上述一个或多个示例中,本发明所描述的功能可以用硬件、软件、固件或它们的任意组合来实现。当使用软件实现时,可以将这些功能存储在计算机可读介质中或者作为计算机可读介质上的一个或多个指令或代码进行传输。Those skilled in the art should be aware that in one or more of the above examples, the functions described in the present invention can be implemented by hardware, software, firmware or any combination thereof. When implemented by software, these functions can be stored in a computer-readable medium or transmitted as one or more instructions or codes on the computer-readable medium.
以上所述的具体实施方式,对本发明的目的、技术方案和有益效果进行了进一步详细说明,所应理解的是,以上所述仅为本发明的具体实施方式而已,并不用于限定本发明的保护范围,凡在本发明的技术方案的基础之上,所做的任何修改、等同替换、改进等,均应包括在本发明的保护范围之内。The specific embodiments described above further describe the purpose, technical solutions and beneficial effects of the present invention in further detail. It should be understood that the above descriptions are only specific embodiments of the present invention and are not intended to limit the scope of the present invention. The protection scope, any modification, equivalent replacement, improvement, etc. made on the basis of the technical solution of the present invention shall be included in the protection scope of the present invention.

Claims (24)

  1. 一种计算机执行的事件风险评估方法,包括:A computer-executed event risk assessment method includes:
    采用自然语言处理模型,从内容文本库中提取多个样本事件,所述多个样本事件包括第一样本事件,所述提取多个样本事件包括,识别第一样本事件及其对应的第一事件类型,并根据第一事件类型,提取所述第一样本事件的至少一个第一事件要素;Using a natural language processing model, extracting multiple sample events from the content text database, the multiple sample events including a first sample event, and the extracting multiple sample events includes identifying the first sample event and its corresponding first sample event An event type, and at least one first event element of the first sample event is extracted according to the first event type;
    在与所述第一样本事件相关联的至少一个领域所对应的至少一个知识图谱中,获取与所述至少一个第一事件要素相关联的至少一个第一关联要素;Acquiring at least one first associated element associated with the at least one first event element in at least one knowledge graph corresponding to at least one field associated with the first sample event;
    根据所述第一事件类型,所述至少一个第一事件要素,以及所述至少一个第一关联要素,确定所述第一样本事件的事件特征;Determine the event characteristics of the first sample event according to the first event type, the at least one first event element, and the at least one first correlation element;
    根据所述多个样本事件中各个样本事件的事件特征,以及各个样本事件的标定风险值,训练梯度提升决策树GBDT模型,得到训练的GBDT模型;Training the gradient boosting decision tree GBDT model according to the event characteristics of each sample event in the multiple sample events and the calibration risk value of each sample event to obtain the trained GBDT model;
    利用所述训练的GBDT模型,对待分析的第二事件进行风险评估。Using the trained GBDT model, the risk assessment of the second event to be analyzed is performed.
  2. 根据权利要求1所述的方法,其中,所述根据第一事件类型,提取所述第一样本事件的至少一个第一事件要素,包括:The method according to claim 1, wherein said extracting at least one first event element of said first sample event according to a first event type comprises:
    确定所述第一事件类型对应的第一模板;Determine the first template corresponding to the first event type;
    利用所述第一模板,从所述内容文本库中提取所述第一样本事件的至少一个第一事件要素。Using the first template, extract at least one first event element of the first sample event from the content text library.
  3. 根据权利要求1或2所述的方法,其中,所述至少一个第一事件要素包括以下中的至少一个:事件时间、事件地点、实施主体、事件客体、事实类型、事件等级。The method according to claim 1 or 2, wherein the at least one first event element includes at least one of the following: event time, event location, implementation subject, event object, fact type, event level.
  4. 根据权利要求1所述的方法,其中,获取与所述至少一个事件要素相关联的至少一个关联要素,包括:The method according to claim 1, wherein obtaining at least one associated element associated with the at least one event element comprises:
    将所述至少一个第一事件要素映射为所述至少一个知识图谱中的第一节点;Mapping the at least one first event element to the first node in the at least one knowledge graph;
    将所述至少一个知识图谱中与所述第一节点直接连接的节点作为所述至少一个第一关联要素。A node directly connected to the first node in the at least one knowledge graph is used as the at least one first associated element.
  5. 根据权利要求1或4所述的方法,其中,所述至少一个知识图谱包括,企业知识图谱,产品知识图谱,人物知识图谱,信息知识图谱,股票知识图谱,基金知识图谱,机构知识图谱。The method according to claim 1 or 4, wherein the at least one knowledge graph comprises an enterprise knowledge graph, a product knowledge graph, a character knowledge graph, an information knowledge graph, a stock knowledge graph, a fund knowledge graph, and an institution knowledge graph.
  6. 根据权利要求1所述的方法,其中,利用所述训练的GBDT模型,对待分析的第二事件进行风险评估,包括:The method according to claim 1, wherein using the trained GBDT model to perform risk assessment on the second event to be analyzed comprises:
    获取第二事件的事件类型,以及至少一个第二事件要素;Acquiring the event type of the second event and at least one second event element;
    在所述至少一个知识图谱中,获取与所述至少一个第二事件要素相关联的至少一个 第二关联要素;In the at least one knowledge graph, acquiring at least one second associated element associated with the at least one second event element;
    根据所述第二事件的事件类型,所述至少一个第二事件要素,以及所述至少一个第二关联要素,确定所述第二事件的事件特征;Determine the event characteristics of the second event according to the event type of the second event, the at least one second event element, and the at least one second correlation element;
    将所述第二事件的事件特征输入所述训练的GBDT模型,根据模型输出确定所述第二事件的风险值。The event characteristics of the second event are input into the trained GBDT model, and the risk value of the second event is determined according to the model output.
  7. 根据权利要求6所述的方法,其中,获取第二事件的事件类型,以及至少一个第二事件要素,包括:The method according to claim 6, wherein acquiring the event type of the second event and at least one second event element comprises:
    从输入文本中识别出所述第二事件以及第二事件类型;Identifying the second event and the second event type from the input text;
    根据第二事件类型,从所述输入文本中提取所述至少一个第二事件要素。According to the second event type, the at least one second event element is extracted from the input text.
  8. 根据权利要求6所述的方法,其中,获取第二事件的事件类型,以及至少一个第二事件要素,包括:The method according to claim 6, wherein acquiring the event type of the second event and at least one second event element comprises:
    接收输入的第二事件,以及所述至少一个第二事件要素。The input second event and the at least one second event element are received.
  9. 根据权利要求6所述的方法,其中,所述训练的GBDT模型包括至少一棵决策树,所述决策树包括枝干节点和叶子节点,每个枝干节点对应一项特征,并具有训练得到的风险分值以及节点权重,其中节点权重基于该枝干节点以及分裂后节点各自的节点损失值确定,所述节点损失值基于落入该节点的样本事件的标定风险值与该节点的风险分值之差而确定;The method according to claim 6, wherein the trained GBDT model includes at least one decision tree, the decision tree includes a branch node and a leaf node, each branch node corresponds to a feature, and has training results The node weight is determined based on the node loss value of the branch node and the node after the split, and the node loss value is based on the calibrated risk value of the sample event falling into the node and the node's risk score Value difference;
    所述利用所述训练的GBDT模型,对待分析的第二事件进行风险评估,还包括:The use of the trained GBDT model to perform risk assessment on the second event to be analyzed further includes:
    根据所述第二事件的事件特征确定所述第二事件在所述决策树中的决策路径;Determining the decision path of the second event in the decision tree according to the event characteristics of the second event;
    确定所述决策路径所经过的各个枝干节点,并获取各个枝干节点对应的特征以及节点权重;Determine each branch node passed by the decision path, and obtain the feature and node weight corresponding to each branch node;
    对于所述第二事件的事件特征中包含的第一特征,根据所述各个枝干节点中对应于该第一特征的至少一个枝干节点的节点权重,确定该第一特征的特征权重,作为该第一特征对于所述风险值的重要性。For the first feature included in the event feature of the second event, the feature weight of the first feature is determined according to the node weight of at least one branch node corresponding to the first feature in each branch node, as The importance of this first characteristic to the risk value.
  10. 根据权利要求1所述的方法,其中,所述训练的GBDT模型包括至少一棵决策树,所述决策树包括枝干节点和叶子节点;The method according to claim 1, wherein the trained GBDT model includes at least one decision tree, and the decision tree includes branch nodes and leaf nodes;
    所述利用所述训练的GBDT模型,对待分析的第二事件进行风险评估,包括:The use of the trained GBDT model to perform risk assessment on the second event to be analyzed includes:
    获取第二事件的至少一个第二事件要素;Acquiring at least one second event element of the second event;
    根据所述至少一个第二事件要素,在所述决策树中对第二事件进行划分,基于划分的停止节点确定所述决策树的子树;Dividing a second event in the decision tree according to the at least one second event element, and determining a subtree of the decision tree based on the divided stop nodes;
    确定所述子树中满足预定条件的第一叶子节点,以及从根节点到该第一叶子节点的 条件路径;Determining a first leaf node in the subtree that meets a predetermined condition, and a conditional path from the root node to the first leaf node;
    获取所述条件路径中包含的枝干节点所对应的特征组合,将所述特征组合作为所述第二事件在所述预定条件下的影响特征。The feature combination corresponding to the branch and trunk nodes included in the conditional path is acquired, and the feature combination is used as the influence feature of the second event under the predetermined condition.
  11. 根据权利要求10所述的方法,其中,所述决策树中每个叶子节点具有训练得到的风险分值,每个枝干节点对应一项特征,并具有训练得到的风险分值以及节点权重,其中节点权重基于该枝干节点以及分裂后节点各自的节点损失值确定,所述节点损失值基于落入该节点的样本事件的标定风险值与该节点的风险分值之差而确定;The method according to claim 10, wherein each leaf node in the decision tree has a risk score obtained by training, each branch node corresponds to a feature, and has a risk score obtained by training and a node weight, The node weight is determined based on the respective node loss values of the branch node and the split node, and the node loss value is determined based on the difference between the calibrated risk value of the sample event falling into the node and the risk score of the node;
    所述利用所述训练的GBDT模型,对待分析的第二事件进行风险评估,还包括以下中的一项或多项:The use of the trained GBDT model to perform risk assessment on the second event to be analyzed further includes one or more of the following:
    确定所述第一叶子节点对应的第一风险分值,作为所述预定条件下第二事件的风险值;Determining the first risk score corresponding to the first leaf node as the risk value of the second event under the predetermined condition;
    根据所述条件路径中各个枝干节点的节点权重,确定所述特征组合中与所述各个枝干节点对应的各项特征的重要度。The importance of each feature corresponding to each branch node in the feature combination is determined according to the node weight of each branch node in the conditional path.
  12. 一种计算机执行的事件风险评估装置,包括:A computer-executed event risk assessment device includes:
    提取单元,配置为采用自然语言处理模型,从内容文本库中提取多个样本事件,所述多个样本事件包括第一样本事件,所述提取多个样本事件包括,识别第一样本事件及其对应的第一事件类型,并根据第一事件类型,提取所述第一样本事件的至少一个第一事件要素;The extraction unit is configured to use a natural language processing model to extract a plurality of sample events from the content text library, the plurality of sample events including a first sample event, and the extracting the plurality of sample events includes identifying the first sample event And its corresponding first event type, and extracting at least one first event element of the first sample event according to the first event type;
    关联单元,配置为在与所述第一样本事件相关联的至少一个领域所对应的至少一个知识图谱中,获取与所述至少一个第一事件要素相关联的至少一个第一关联要素;An associating unit configured to obtain at least one first associated element associated with the at least one first event element in at least one knowledge graph corresponding to at least one field associated with the first sample event;
    确定单元,配置为根据所述第一事件类型,所述至少一个第一事件要素,以及所述至少一个第一关联要素,确定所述第一样本事件的事件特征;A determining unit configured to determine the event feature of the first sample event according to the first event type, the at least one first event element, and the at least one first correlation element;
    训练单元,配置为根据所述多个样本事件中各个样本事件的事件特征,以及各个样本事件的标定风险值,训练梯度提升决策树GBDT模型,得到训练的GBDT模型;The training unit is configured to train the gradient boosting decision tree GBDT model according to the event characteristics of each sample event in the multiple sample events and the calibration risk value of each sample event to obtain the trained GBDT model;
    评估单元,配置为利用所述训练的GBDT模型,对待分析的第二事件进行风险评估。The evaluation unit is configured to use the trained GBDT model to perform risk evaluation on the second event to be analyzed.
  13. 根据权利要求12所述的装置,其中,所述提取单元配置为:The device according to claim 12, wherein the extraction unit is configured to:
    确定所述第一事件类型对应的第一模板;Determine the first template corresponding to the first event type;
    利用所述第一模板,从所述内容文本库中提取所述第一样本事件的至少一个第一事件要素。Using the first template, extract at least one first event element of the first sample event from the content text library.
  14. 根据权利要求12或13所述的装置,其中,所述至少一个第一事件要素包括以下中的至少一个:事件时间、事件地点、实施主体、事件客体、事实类型、事件等级。The device according to claim 12 or 13, wherein the at least one first event element includes at least one of the following: event time, event location, implementation subject, event object, fact type, and event level.
  15. 根据权利要求12所述的装置,其中,所述关联单元配置为:The device according to claim 12, wherein the associating unit is configured to:
    将所述至少一个第一事件要素映射为所述至少一个知识图谱中的第一节点;Mapping the at least one first event element to the first node in the at least one knowledge graph;
    将所述至少一个知识图谱中与所述第一节点直接连接的节点作为所述至少一个第一关联要素。A node directly connected to the first node in the at least one knowledge graph is used as the at least one first associated element.
  16. 根据权利要求12或15所述的装置,其中,所述至少一个知识图谱包括,企业知识图谱,产品知识图谱,人物知识图谱,信息知识图谱,股票知识图谱,基金知识图谱,机构知识图谱。The device according to claim 12 or 15, wherein the at least one knowledge graph comprises an enterprise knowledge graph, a product knowledge graph, a character knowledge graph, an information knowledge graph, a stock knowledge graph, a fund knowledge graph, and an institution knowledge graph.
  17. 根据权利要求12所述的装置,其中,所述评估单元包括:The device according to claim 12, wherein the evaluation unit comprises:
    要素获取模块,配置为获取第二事件的事件类型,以及至少一个第二事件要素;The element acquisition module is configured to acquire the event type of the second event and at least one second event element;
    要素关联模块,配置为在所述至少一个知识图谱中,获取与所述至少一个第二事件要素相关联的至少一个第二关联要素;An element association module, configured to obtain at least one second association element associated with the at least one second event element in the at least one knowledge graph;
    第一确定模块,配置为根据所述第二事件的事件类型,所述至少一个第二事件要素,以及所述至少一个第二关联要素,确定所述第二事件的事件特征;A first determining module configured to determine the event feature of the second event according to the event type of the second event, the at least one second event element, and the at least one second correlation element;
    第二确定模块,配置为将所述第二事件的事件特征输入所述训练的GBDT模型,根据模型输出确定所述第二事件的风险值。The second determining module is configured to input the event characteristics of the second event into the trained GBDT model, and determine the risk value of the second event according to the model output.
  18. 根据权利要求17所述的装置,其中,所述要素获取模块配置为:The apparatus according to claim 17, wherein the element acquisition module is configured to:
    从输入文本中识别出所述第二事件以及第二事件类型;Identifying the second event and the second event type from the input text;
    根据第二事件类型,从所述输入文本中提取所述至少一个第二事件要素。According to the second event type, the at least one second event element is extracted from the input text.
  19. 根据权利要求17所述的装置,其中,所述要素获取模块配置为:The apparatus according to claim 17, wherein the element acquisition module is configured to:
    接收输入的第二事件,以及所述至少一个第二事件要素。The input second event and the at least one second event element are received.
  20. 根据权利要求17所述的装置,其中,所述训练的GBDT模型包括至少一棵决策树,所述决策树包括枝干节点和叶子节点,每个枝干节点对应一项特征,并具有训练得到的风险分值以及节点权重,其中节点权重基于该枝干节点以及分裂后节点各自的节点损失值确定,所述节点损失值基于落入该节点的样本事件的标定风险值与该节点的风险分值之差而确定;The device according to claim 17, wherein the trained GBDT model includes at least one decision tree, the decision tree includes a branch node and a leaf node, and each branch node corresponds to a feature and has the training result The node weight is determined based on the node loss value of the branch node and the node after the split, and the node loss value is based on the calibrated risk value of the sample event falling into the node and the node's risk score Value difference;
    所述评估单元还包括:The evaluation unit also includes:
    决策路径确定模块,配置为根据所述第二事件的事件特征确定所述第二事件在所述决策树中的决策路径;A decision path determining module, configured to determine the decision path of the second event in the decision tree according to the event characteristics of the second event;
    节点权重确定模块,配置为确定所述决策路径所经过的各个枝干节点,并获取各个枝干节点对应的特征以及节点权重;A node weight determination module, configured to determine each branch node through which the decision path passes, and obtain the characteristics and node weights corresponding to each branch node;
    重要性确定模块,配置为对于所述第二事件的事件特征中包含的第一特征,根据所 述各个枝干节点中对应于该第一特征的至少一个枝干节点的节点权重,确定该第一特征的特征权重,作为该第一特征对于所述风险值的重要性。The importance determination module is configured to determine the first feature included in the event feature of the second event according to the node weight of at least one branch node corresponding to the first feature in each branch node The feature weight of a feature is used as the importance of the first feature to the risk value.
  21. 根据权利要求12所述的装置,其中,所述训练的GBDT模型包括至少一棵决策树,所述决策树包括枝干节点和叶子节点;The device according to claim 12, wherein the trained GBDT model includes at least one decision tree, and the decision tree includes branch nodes and leaf nodes;
    所述评估单元包括:The evaluation unit includes:
    要素获取模块,配置为获取第二事件的至少一个第二事件要素;An element acquisition module configured to acquire at least one second event element of the second event;
    子树确定模块,配置为根据所述至少一个第二事件要素在所述决策树中对第二事件进行划分,基于划分的停止节点确定所述决策树的子树;A subtree determining module, configured to divide a second event in the decision tree according to the at least one second event element, and determine a subtree of the decision tree based on the divided stop nodes;
    条件路径确定模块,配置为确定所述子树中满足预定条件的第一叶子节点,以及从根节点到该第一叶子节点的条件路径;A conditional path determining module, configured to determine a first leaf node in the subtree that meets a predetermined condition and a conditional path from the root node to the first leaf node;
    特征确定模块,配置为获取所述条件路径中包含的枝干节点所对应的特征组合,将所述特征组合作为所述第二事件在所述预定条件下的影响特征。The feature determination module is configured to obtain a feature combination corresponding to a branch node included in the conditional path, and use the feature combination as an impact feature of the second event under the predetermined condition.
  22. 根据权利要求21所述的装置,其中,所述决策树中每个叶子节点具有训练得到的风险分值,每个枝干节点对应一项特征,并具有训练得到的风险分值以及节点权重,其中节点权重基于该枝干节点以及分裂后节点各自的节点损失值确定,所述节点损失值基于落入该节点的样本事件的标定风险值与该节点的风险分值之差而确定;The device according to claim 21, wherein each leaf node in the decision tree has a risk score obtained by training, and each branch node corresponds to a feature, and has a risk score obtained by training and a node weight, The node weight is determined based on the respective node loss values of the branch node and the split node, and the node loss value is determined based on the difference between the calibrated risk value of the sample event falling into the node and the risk score of the node;
    所述评估单元还包括以下中的一项或多项:The evaluation unit also includes one or more of the following:
    第三确定模块,配置为确定所述第一叶子节点对应的第一风险分值,作为所述预定条件下第二事件的风险值;A third determining module, configured to determine the first risk score corresponding to the first leaf node as the risk value of the second event under the predetermined condition;
    第四确定模块,配置为根据所述条件路径中各个枝干节点的节点权重,确定所述特征组合中与所述各个枝干节点对应的各项特征的重要度。The fourth determining module is configured to determine the importance of each feature corresponding to each branch node in the feature combination according to the node weight of each branch node in the conditional path.
  23. 一种计算机可读存储介质,其上存储有计算机程序,当所述计算机程序在计算机中执行时,令计算机执行权利要求1-11中任一项的所述的方法。A computer-readable storage medium having a computer program stored thereon, and when the computer program is executed in a computer, the computer is caused to execute the method of any one of claims 1-11.
  24. 一种计算设备,包括存储器和处理器,其特征在于,所述存储器中存储有可执行代码,所述处理器执行所述可执行代码时,实现权利要求1-11中任一项所述的方法。A computing device, comprising a memory and a processor, characterized in that executable code is stored in the memory, and when the processor executes the executable code, the device described in any one of claims 1-11 is implemented method.
PCT/CN2019/129863 2019-02-01 2019-12-30 Computer implemented event risk assessment method and device WO2020156000A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910105245.3 2019-02-01
CN201910105245.3A CN110008349B (en) 2019-02-01 2019-02-01 Computer-implemented method and apparatus for event risk assessment

Publications (1)

Publication Number Publication Date
WO2020156000A1 true WO2020156000A1 (en) 2020-08-06

Family

ID=67165700

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/129863 WO2020156000A1 (en) 2019-02-01 2019-12-30 Computer implemented event risk assessment method and device

Country Status (3)

Country Link
CN (1) CN110008349B (en)
TW (1) TWI723528B (en)
WO (1) WO2020156000A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115935265A (en) * 2023-03-03 2023-04-07 支付宝(杭州)信息技术有限公司 Method for training risk recognition model, risk recognition method and corresponding device

Families Citing this family (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110008349B (en) * 2019-02-01 2020-11-10 创新先进技术有限公司 Computer-implemented method and apparatus for event risk assessment
CN110516075B (en) * 2019-07-22 2022-04-15 平安科技(深圳)有限公司 Early warning report generation method and device based on machine learning and computer equipment
CN112580916B (en) * 2019-09-30 2024-05-28 深圳无域科技技术有限公司 Data evaluation method, device, computer equipment and storage medium
CN110704742B (en) * 2019-09-30 2021-04-27 北京三快在线科技有限公司 Feature extraction method and device
CN110968700B (en) * 2019-11-01 2023-04-07 数地工场(南京)科技有限公司 Method and device for constructing domain event map integrating multiple types of affairs and entity knowledge
CN111191853B (en) * 2020-01-06 2022-07-15 支付宝(杭州)信息技术有限公司 Risk prediction method and device and risk query method and device
CN111401914B (en) * 2020-04-02 2022-07-22 支付宝(杭州)信息技术有限公司 Risk assessment model training and risk assessment method and device
CN111915207B (en) * 2020-08-11 2021-07-30 中国民航科学技术研究院 Civil aviation safety risk analysis method and device based on knowledge graph
CN113190682B (en) * 2021-06-30 2021-09-28 平安科技(深圳)有限公司 Method and device for acquiring event influence degree based on tree model and computer equipment
WO2023065211A1 (en) * 2021-10-21 2023-04-27 华为技术有限公司 Information acquisition method and apparatus
CN114328907A (en) * 2021-10-22 2022-04-12 浙江嘉兴数字城市实验室有限公司 Natural language processing method for early warning risk upgrade event
CN113992429B (en) * 2021-12-22 2022-04-29 支付宝(杭州)信息技术有限公司 Event processing method, device and equipment
CN117573814B (en) * 2024-01-17 2024-05-10 中电科大数据研究院有限公司 Public opinion situation assessment method, device and system and storage medium
CN118013053A (en) * 2024-04-08 2024-05-10 深圳市规划和自然资源数据管理中心(深圳市空间地理信息中心) Improved three-dimensional text analysis system

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108009229A (en) * 2017-11-29 2018-05-08 厦门市美亚柏科信息股份有限公司 Method, terminal device and the storage medium that public sentiment event data is found
CN108399509A (en) * 2018-04-12 2018-08-14 阿里巴巴集团控股有限公司 Determine the method and device of the risk probability of service request event
WO2018209254A1 (en) * 2017-05-11 2018-11-15 Hubspot, Inc. Methods and systems for automated generation of personalized messages
CN110008349A (en) * 2019-02-01 2019-07-12 阿里巴巴集团控股有限公司 The method and device for the event risk assessment that computer executes

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6334192B1 (en) * 1998-03-09 2001-12-25 Ronald S. Karpf Computer system and method for a self administered risk assessment
CN107301577A (en) * 2016-04-15 2017-10-27 阿里巴巴集团控股有限公司 Training method, credit estimation method and the device of credit evaluation model
CN108629413B (en) * 2017-03-15 2020-06-16 创新先进技术有限公司 Neural network model training and transaction behavior risk identification method and device
CN107785058A (en) * 2017-07-24 2018-03-09 平安科技(深圳)有限公司 Anti- fraud recognition methods, storage medium and the server for carrying safety brain
CN108596434B (en) * 2018-03-23 2019-08-02 卫盈联信息技术(深圳)有限公司 Fraud detection and methods of risk assessment, system, equipment and storage medium
CN108681750A (en) * 2018-05-21 2018-10-19 阿里巴巴集团控股有限公司 The feature of GBDT models explains method and apparatus

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018209254A1 (en) * 2017-05-11 2018-11-15 Hubspot, Inc. Methods and systems for automated generation of personalized messages
CN108009229A (en) * 2017-11-29 2018-05-08 厦门市美亚柏科信息股份有限公司 Method, terminal device and the storage medium that public sentiment event data is found
CN108399509A (en) * 2018-04-12 2018-08-14 阿里巴巴集团控股有限公司 Determine the method and device of the risk probability of service request event
CN110008349A (en) * 2019-02-01 2019-07-12 阿里巴巴集团控股有限公司 The method and device for the event risk assessment that computer executes

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
XU XU-YANG, HAN YONG-FENG, SONG WEN-ZHENG: "Overview and Prospect of Event Extraction Technology", JOURNAL OF INFORMATION ENGINEERING UNIVERSITY, vol. 12, no. 1, 28 February 2011 (2011-02-28), CN, pages 113 - 118, XP009522475, ISSN: 1671-0673 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115935265A (en) * 2023-03-03 2023-04-07 支付宝(杭州)信息技术有限公司 Method for training risk recognition model, risk recognition method and corresponding device

Also Published As

Publication number Publication date
CN110008349B (en) 2020-11-10
CN110008349A (en) 2019-07-12
TWI723528B (en) 2021-04-01
TW202030685A (en) 2020-08-16

Similar Documents

Publication Publication Date Title
TWI723528B (en) Computer-executed event risk assessment method and device, computer-readable storage medium and computing equipment
US10692019B2 (en) Failure feedback system for enhancing machine learning accuracy by synthetic data generation
US10347019B2 (en) Intelligent data munging
Hellendoorn et al. Will they like this? evaluating code contributions with language models
US20190095801A1 (en) Cognitive recommendations for data preparation
Schelter et al. Fairprep: Promoting data to a first-class citizen in studies on fairness-enhancing interventions
CN109543925B (en) Risk prediction method and device based on machine learning, computer equipment and storage medium
US20070094216A1 (en) Uncertainty management in a decision-making system
WO2018184518A1 (en) Microblog data processing method and device, computer device and storage medium
US11562262B2 (en) Model variable candidate generation device and method
KR102019207B1 (en) Apparatus and method for assessing data quality for text analysis
CN114389834B (en) Method, device, equipment and product for identifying abnormal call of API gateway
JP2018198045A (en) Apparatus and method for generation of natural language processing event
Ékes et al. The efficiency of bankruptcy forecast models in the Hungarian SME sector
CN113434418A (en) Knowledge-driven software defect detection and analysis method and system
CN111694957A (en) Question list classification method and device based on graph neural network and storage medium
JP7207540B2 (en) LEARNING SUPPORT DEVICE, LEARNING SUPPORT METHOD, AND PROGRAM
CN113238908A (en) Server performance test data analysis method and related device
US10867249B1 (en) Method for deriving variable importance on case level for predictive modeling techniques
US11636418B2 (en) Currency reduction for predictive human resources synchronization rectification
JP2023145767A (en) Vocabulary extraction support system and vocabulary extraction support method
Yi-bin et al. Improvement of ID3 algorithm based on simplified information entropy and coordination degree
CN107886233B (en) Service quality evaluation method and system for customer service
CN115098389B (en) REST interface test case generation method based on dependency model
El Bekri et al. Assuring data quality by placing the user in the loop

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19913923

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19913923

Country of ref document: EP

Kind code of ref document: A1