CN110163242B

CN110163242B - Risk identification method and device and server

Info

Publication number: CN110163242B
Application number: CN201910266016.XA
Authority: CN
Inventors: 周绪刚
Original assignee: ANT Financial Hang Zhou Network Technology Co Ltd
Current assignee: ANT Financial Hang Zhou Network Technology Co Ltd
Priority date: 2019-04-03
Filing date: 2019-04-03
Publication date: 2023-04-07
Anticipated expiration: 2039-04-03
Also published as: CN110163242A

Abstract

The embodiment of the specification provides a risk identification method, which includes the steps of obtaining internal information of a sample to be identified, obtaining external information corresponding to the sample to be identified through crawling from a public network based on the internal information, then determining characteristic information based on the internal information and the external information of the sample, finally inputting the characteristic information into a target risk identification model, and performing risk identification on the sample to be identified through the target risk identification model to obtain a risk identification result. Therefore, more comprehensive external information can be mined through the internal information of the sample, so that the user portrait corresponding to the sample is more comprehensive. And the target risk identification model is obtained by training based on the internal information and the external information of the training sample, and the internal information and the external information of the sample are considered in the input of the target risk identification model. Therefore, some potential risks in the external information can be identified, so that the risk prevention and control capability is more comprehensive.

Description

Risk identification method and device and server

Technical Field

The embodiment of the specification relates to the technical field of internet, in particular to a risk identification method, a risk identification device and a server.

Background

With the rapid development of the internet, more and more services can be realized through the network, such as internet services of online payment, online shopping and the like. The Internet brings convenience to life of people and brings risks. Illegal persons may commit electronic service fraud, causing losses to other users. The existing risk identification mainly depends on information such as related information reserved in a system by a user and transaction information in the system to carry out risk identification model training, the risk prevention and control capability is single, and the control on whether residual risks exist and the exploration on whether new risks exist mainly depend on compliance experience and manual operation exploration. Therefore, in order to improve the overall wind control capability, a scheme capable of accurately and fully identifying risks of the sample needs to be designed urgently.

Disclosure of Invention

The embodiment of the specification provides a risk identification method, a risk identification device and a server.

In a first aspect, an embodiment of the present specification provides a risk identification method, including: obtaining internal information of a sample to be identified, and based on the internal information, crawling from a public network to obtain external information corresponding to the sample to be identified; determining feature information based on the internal information and the external information; inputting the characteristic information into a target risk identification model, and carrying out risk identification on the sample to be identified through the target risk identification model to obtain a risk identification result.

In a second aspect, an embodiment of the present specification provides a risk recognition model training method, including: the method comprises the steps that external information of a training sample is obtained through crawling from a public network aiming at each training sample in a training sample set based on internal information of the training sample, wherein the training sample set comprises black samples with calibrated attributes and unknown samples with uncalibrated attributes; for each training sample in a training sample set, determining characteristic information corresponding to the training sample based on internal information and external information corresponding to each training sample in the training sample set; and training a risk recognition model by adopting a semi-supervised machine learning algorithm based on the characteristic information of all the training samples in the training sample set to obtain a target risk recognition model.

In a third aspect, an embodiment of the present specification provides a risk identification apparatus, including: the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring internal information of a sample to be identified and acquiring external information corresponding to the sample to be identified from a public network based on the internal information; a determination unit configured to determine feature information based on the internal information and the external information; and the identification unit is used for inputting the characteristic information into a target risk identification model, and performing risk identification on the sample to be identified through the target risk identification model to obtain a risk identification result.

In a fourth aspect, an embodiment of the present specification provides a risk recognition model training device, including: the acquisition unit is used for crawling external information of a training sample from an open network based on internal information of the training sample for each training sample in a training sample set, wherein the training sample set comprises a black sample with calibrated attributes and an unknown sample without calibrated attributes; the device comprises a determining unit, a judging unit and a judging unit, wherein the determining unit is used for determining the characteristic information corresponding to each training sample in a training sample set based on the internal information and the external information corresponding to each training sample in the training sample set; and the training unit is used for training a risk recognition model by adopting a semi-supervised machine learning algorithm based on the characteristic information of all the training samples in the training sample set to obtain a target risk recognition model.

In a fifth aspect, embodiments of the present specification provide a server, including a memory, a processor, and a computer program stored on the memory and executable on the processor, where the processor implements the steps of any one of the risk identification method and the risk identification model training method when executing the program.

In a sixth aspect, the present specification provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the steps of the risk identification model training method described in any one of the above.

The embodiment of the specification has the following beneficial effects:

in the embodiment of the specification, firstly, internal information of a sample to be identified is obtained, external information corresponding to the sample to be identified is obtained through crawling from a public network based on the internal information, then, characteristic information is determined based on the internal information and the external information of the sample, finally, the characteristic information is input into a target risk identification model, risk identification is carried out on the sample to be identified through the target risk identification model, and a risk identification result is obtained. Therefore, more comprehensive external information can be mined through the internal information of the sample, so that the user portrait corresponding to the sample is more comprehensive. And the target risk recognition model is obtained by training based on the internal information and the external information of the training sample, and the internal information and the external information of the sample are considered in the input of the target risk recognition model. Therefore, some potential risks in the external information can be identified, so that the risk prevention and control capability is more comprehensive, and the unknown risks can be effectively prevented and controlled.

Drawings

FIG. 1 is a schematic diagram of a risk identification application scenario according to an embodiment of the present disclosure;

FIG. 2 is a flowchart of a risk identification method according to a first aspect of an embodiment of the present disclosure;

FIG. 3 is a schematic structural diagram of a risk identification device according to a third aspect of the embodiments of the present disclosure;

fig. 4 is a schematic structural diagram of a server in the fifth aspect of the embodiment of the present specification.

Detailed Description

In order to better understand the technical solutions, the technical solutions of the embodiments of the present specification are described in detail below with reference to the drawings and specific embodiments, and it should be understood that the specific features of the embodiments and embodiments of the present specification are detailed descriptions of the technical solutions of the embodiments of the present specification, and are not limitations of the technical solutions of the present specification, and the technical features of the embodiments and embodiments of the present specification may be combined with each other without conflict.

Please refer to fig. 1, which is a schematic diagram of an application scenario of risk recognition model training in the embodiment of the present specification. The terminal 100 is located on the user side and communicates with the server 200 on the network side. The user may generate real-time events and some related traffic data through an APP or website in the terminal 100. The server 200 collects real-time events generated by each terminal through the system, and then training samples can be selected. The embodiment of the specification can be applied to a wind control scene of risk sample identification.

In a first aspect, an embodiment of the present specification provides a risk identification method, please refer to fig. 2, which includes steps S201 to S203.

S201: obtaining internal information of a sample to be identified, and based on the internal information, crawling from a public network to obtain external information corresponding to the sample to be identified;

s202: determining feature information based on the internal information and the external information;

s203: and inputting the characteristic information into the target risk identification model, and performing risk identification on the sample to be identified through the target risk identification model to obtain a risk identification result.

The target risk identification model is obtained by training the following steps:

for each training sample in a training sample set, based on internal information of the training sample, external information of the training sample is obtained by crawling from a public network, wherein the training sample set comprises black samples with calibrated attributes and unknown samples with uncalibrated attributes;

determining characteristic information corresponding to each training sample in a training sample set;

and training the risk recognition model by adopting a semi-supervised machine learning algorithm based on the characteristic information of all training samples in the training sample set to obtain a target risk recognition model.

Specifically, in this embodiment, before performing risk identification on a sample to be identified by using a target risk identification model, training of the target risk identification model needs to be performed in advance, and during training, a training sample is determined first, where the training sample may be service data generated in a system by each terminal side as shown in the foregoing, and the training sample includes a black sample with marked attributes and an unknown sample with unknown attributes. For example: in the anti-money laundering scenario, the training sample is a user applying for fund transfer, wherein the sample corresponding to the money laundering user is determined to be a black sample. For another example: in an insurance claim settlement scene, training samples are users applying for claim settlement, wherein the samples corresponding to the cheating and insurance users are determined to be black samples, the determined anti-money laundering facts and the cheating and insurance claim settlement scene are fewer in black samples, and a large number of black sample marks are lacked, so that the accuracy of the risk recognition model training model is greatly reduced, and how to solve the model training problem in the scene is very important work.

According to the method, the internally marked sample is introduced, the external information crawled by an external network is fully utilized to carry out risk mining, the method is complementary with internal risk monitoring, the external information is crawled, then the internal information and the external information of the sample are extracted to corresponding characteristic information, the internal specific scene marked black sample information is combined, model training is carried out on the training sample by using a semi-supervised machine learning algorithm, the obtained target risk recognition model can defend some novel unknown risks, and the recognition precision of the target risk recognition model can be improved along with the training sample, and the value of external information data on internal risk prevention and control is improved. The internal information in this embodiment is information that can be acquired inside the system, and the external information is information that is disclosed in an open network.

First, in the method in this embodiment, in step S201, for each training sample in the training sample set, based on the internal information of the training sample, external information of the training sample is obtained by crawling from the public network. Specifically, the method can be realized by the following steps:

and for each training sample in the training sample set, based on the identity information in the internal information of the training sample, crawling the external information of the training sample from the public network. The method comprises the steps that for each training sample in a training sample set, identity information in internal information of the training sample is used as key information, the key information is searched through a preset search engine, and abstract information of each search result is obtained from the search results in a crawling mode and is used as external information of the training sample; and/or aiming at each training sample in the training sample set, taking the identity information in the internal information of the training sample as key information, and taking the information corresponding to the key information and crawled from a preset portal website as external information of the training sample.

Specifically, in this embodiment, the internal information of the training sample is information existing inside the system, and includes identity information of the user registered in the system, registration related information, and historical operation information of the user in the system. The system may be a financial system and the historical operating information may be historical financial transaction information corresponding to the user. The system may also be a shopping system and the historical operating information may be historical financial transaction information corresponding to the user. The system can also be a shopping system, and the historical operation information can be historical shopping information corresponding to the user.

For example: the internal information includes basic attributes of the user, such as the age, sex, occupation, hobby, etc., and comprehensive evaluation indexes, such as: the account security level of the user, the risk score of the garbage registration, the risk score of the cheating and the like. The portrait information data is characterized from the basic information of the user to the account risk, and the comprehensive grading portrait mainly comes from the evaluation and characterization of the account in some wind control systems or marketing systems.

Meanwhile, the internal information can also comprise historical transaction information in the system, mainly refers to historical transaction behaviors of the user, and can be mainly divided into two types, wherein one type is the transaction detail of the user within a period of historical time, and the transaction detail comprises the time, the amount, the payee, the transaction equipment, the IP and the like of the transaction. Another type is summary data such as the number of transactions, the cumulative amount of transactions, etc. over a user's historical period of time.

The internal information may also include device media information, mainly describing the attributes and composite scores of the device, such as: the activation time of a device, the number of historically logged-in accounts of a device, etc. In addition, the evaluation comprehensive scoring of the equipment by some wind control is also included, such as: whether theft has occurred on a device, and the history of a device doing false transactions.

The input information related to the privacy of the user in the input information can be obtained by a user authorization mode. In a specific implementation process, the internal information may be set according to the needs of an actual scene, and the present application is not limited herein.

Further, for the external information of the user, since many pieces of information of the user are disclosed in the external network, for example: there may be a certain rule for the external information corresponding to the risky user. Some information of the user can be disclosed in some public networks, such as information that websites can expose the user to illegal activities or information of dishonest behaviors of the user. Or some illegal websites have the relevant information of the user. The external information may also include the professional background of the user, the names of some network marketing platforms, and whether some negative events occur in the company where the user is located. Therefore, the external information also has important reference value for constructing the risk identification model. Therefore, the internal information of the training sample can be used to crawl the external information of the user, and the utilized internal information can be the identity information reserved in the system by the user, such as: and information such as names, addresses, contact information and the like is crawled in the public network to obtain external information.

Wherein, the key information extracted from the internal information can be searched by a preset search engine, such as: the names, the contact ways, the addresses and the like correspond to a plurality of search results, and since the external information searched by the search engine website is more and complicated, in order to reduce the amount of information processing, only abstract information of each search result is extracted as the external information, and the external information also comprises the website address of the search result. Meanwhile, external information crawling, such as an exposure website and a blacking website, can be performed from a preset portal website, and key information extracted from the internal information is searched in the preset portal website, such as: name, contact address or the like, corresponding to the search result as external information.

And then vectorizing the text information crawled to a plurality of relevant external parts on the user dimension aiming at the external information corresponding to each training sample. The external text information can be converted into a vector form by adopting a preset word vector algorithm. And then, carrying out feature extraction on the internal information and the external information corresponding to the training sample from a preset dimension to form a feature vector corresponding to the training sample. In a specific implementation process, the preset word vector algorithm may be one-hot registration (one-hot registration), word vectorization (Skip-gram), continuous bag of word model (CBOW), and the like, and may be set according to actual needs in the specific implementation process, which is not limited in this application.

Furthermore, the method in this embodiment may perform feature extraction from multiple dimensions on information in the form of a word vector converted from external information and internal information, and form a feature vector corresponding to the training sample.

The preset dimension comprises a similarity dimension, and the feature extraction from the similarity dimension can be realized through the following steps: determining a target characteristic vector corresponding to a black sample in a training sample set, converting external information and internal information of the training sample into information in a word vector form for each training sample in the training sample set, and extracting a similarity characteristic vector corresponding to the target characteristic vector from the information in the word vector form.

Specifically, for a black sample with known attributes, a document corresponding to internal information and external information of the black sample can be converted into a word vector form, and the target feature vector may be a feature vector of a mean value of features of the information document corresponding to the black sample, or a feature vector customized by counting the external information of the black sample. Furthermore, for other training samples, for each crawled document, corresponding feature vectors can be extracted in a word2vec mode, for the feature vectors of the training samples, the similarity between the feature vectors and the target feature vectors is calculated, and the similarity feature vectors are formed after the mean value of the similarity is taken.

Further, the preset dimension comprises a statistical dimension, and the extracting of the statistical feature vector from the statistical dimension can be realized by the following steps: for each training sample in the training sample set, a statistical feature vector is extracted from internal information and external information of the training sample.

Specifically, for the external information of the training sample, the basic statistical characteristics of the sample can be counted, including the total information quantity that can be crawled by the engine, the information quantity under each key website, the length of each piece of information, and other information. Suppose that the external information of the training sample includes 4 search results corresponding to 4 articles, and the statistical features include 2 websites where the two articles are located and 4 articles, and further, the statistical features can be expressed in a vector form to obtain a statistical feature vector. For the internal information, the statistical feature vector may also be obtained in the same manner, which is not described herein again.

Further, the predetermined dimension includes a uniqueness dimension from which the extraction of the uniqueness feature vector can be achieved by:

the method comprises the steps of converting external information and internal information of a training sample into information in a word vector form aiming at each training sample in a training sample set, and then extracting a uniqueness feature vector of the training sample from the information in the word vector form, wherein the uniqueness feature vector comprises uniqueness values of a plurality of elements corresponding to the training sample, and the uniqueness value of each element is the ratio of the number of times of the element appearing in the information corresponding to the training sample to the total number of times of the element appearing in the information corresponding to the training sample set.

Specifically, for training samples, for each crawled document, words in the document can be vectorized in a word2vec manner, so that the uniqueness of the sample in the word vector dimension can be calculated among all samples. Specifically, the training sample set includes 100 training samples, the number of times that the word a appears in the document information corresponding to the training sample 1 is 2, the total number of times that the word a appears in the document information corresponding to all 100 training samples is 10, and the uniqueness of the word a in the training sample 1 is 2/10= 0.2.

Further, the preset dimension comprises an informative dimension, and the extraction of the unique informative feature vector from the informative dimension can be realized by the following steps:

and (2) converting external information and internal information of the training samples into information in a vector form for each training sample in the training sample set, and then processing the information in the word vector form through a characteristic mean value to obtain an informative characteristic vector or processing the information in the word vector form through principal component analysis to obtain the informative characteristic vector.

Specifically, for training samples, vectorization can be performed on each crawled document in a doc2vec mode, after a section of a client crawls back through doc2vec to form vectors, the vectors are combined to obtain a mean value and the mean value, and an intelligence feature vector is formed. Or, obtaining an intelligence feature vector through PCA processing. Such as: the external information corresponding to the training sample comprises four articles, then the four articles are converted into a 100-dimensional vector, and PCA conversion is carried out on the four vectors to obtain the converted information characteristic vector. Suppose that 2 feature vectors corresponding to the maximum feature values are selected to form an intelligence feature vector, so that a 200-dimensional intelligence feature vector is formed.

In this embodiment, the feature vector extraction may be performed from the 4 dimensions, including extracting a similarity feature vector from the similarity dimension, extracting a statistical feature vector from the statistical dimension, extracting a uniqueness feature vector from the uniqueness dimension, extracting an informative feature vector from the informative dimension, and then performing subsequent model training.

Furthermore, for each training sample, different feature vectors can be extracted from the multiple dimensions, then a high-dimensional feature vector is formed, the internally marked (reported) sample is used as a black sample, a large number of other internally unmarked samples are used as unmarked samples, learning is carried out through a semi-supervised machine learning algorithm to obtain a corresponding target risk recognition model, and the attributes of the rest training samples can be predicted by adopting the target risk recognition model.

In this embodiment, the semi-supervised machine Learning algorithm includes a semi-supervised (Positive and unmarked Learning) machine Learning algorithm, which is a machine Learning algorithm for semi-supervised Learning, and refers to that only part of training samples are marked samples and the rest are unmarked samples in training samples for training a machine Learning model, and the Learning process of the marked samples is assisted by the unmarked samples. The method is applied to a machine learning process aiming at marked positive samples and unmarked samples, wherein only a few marked black samples exist in training samples collected by a modeling party, and the rest samples are unmarked unknown samples.

After the black samples and the unknown samples in the training sample set are constructed, the training samples can be trained based on a semi-supervised machine learning algorithm to construct a training model of the target risk recognition model. Specifically, a semi-supervised learning algorithm is adopted to train the feature information of each training sample. For semi-supervised machine learning algorithms, a variety of machine learning strategies may typically be involved. For example: semi-supervised machine learning algorithms contain typical machine learning strategies, including two classes, two-stage (two-stage) and cost-sensitive (cost-sensitive) methods. In the two-stage method, an algorithm firstly mines and finds potential reliable negative samples in the unmarked samples based on known positive samples and unmarked samples, and then converts the problem into a traditional supervised machine learning process based on the known positive samples and the mined reliable negative samples to train a classification model.

For the cost-sensitive strategy, the algorithm assumes that the proportion of positive samples in unmarked samples is very low, and sets a higher cost-sensitive weight for the positive samples relative to the negative samples by directly treating the unmarked samples as the negative samples. For example, a higher cost-sensitive weight is usually set for the loss function corresponding to the positive sample in the objective equation based on the cost-sensitive semi-supervised machine learning algorithm. By setting higher cost-sensitive weight for the positive samples, the cost of mistaking one positive sample by the finally trained classification model is far higher than the cost of mistaking one negative sample, so that the unknown samples can be classified by directly learning a cost-sensitive classifier by using the positive samples and the unmarked samples (as the negative samples). In this embodiment, the training samples may be trained based on a cost-sensitive semi-supervised machine learning algorithm, or may be trained by a two-stage method. In the specific implementation process, the setting can be performed as required, and the application is not limited herein.

In the present embodiment, a two-stage semi-supervised machine learning algorithm is mainly used for detailed description. Taking the backwashing black money scene as an example, it is assumed that the black sample in the training sample set is marked as 1, which indicates that the sample is a known sample of the washing black money, and the white sample is marked as 0, which indicates that the training sample corresponds to the insurance data to be normal. After performing two-classification model training on black samples and white samples sampled based on the white sample sampling probability, obtaining a risk identification model training model, then evaluating unknown samples by adopting the risk identification model training model to obtain a black sample score of each unknown sample marked as a black sample, wherein the black sample score is a numerical value in a range of 0-1 and indicates the probability that the unknown sample belongs to the black sample. Of course, the black samples and the white samples and the corresponding black sample scores may be defined in other manners, and the application is not limited herein.

And performing multiple rounds of training on the training samples according to the mode, obtaining a corresponding risk recognition model training model after each round of training, judging whether the risk recognition model training model meets a preset convergence condition, and if the model converges, taking the risk recognition model training model obtained by the rounds of training as a final target risk recognition model training model. And if the model is not converged, continuously training according to the method after the black sample concentration of each unknown sample is updated until the trained model reaches the convergence condition.

Specifically, in this embodiment, a risk identification model training model corresponding to each training round is obtained in each training round, the model is used to score the black sample of each unknown sample, and the unknown sample can be labeled according to the score. In this way, the attribute evaluation result of each unknown sample in the training round can be determined, and the evaluation result can include the black sample score and the attribute information of the unknown sample in the training round.

Specifically, in this embodiment, if the unknown samples marked as black samples in the previous round are consistent with the unknown samples marked as black samples in the current round, which indicates that the mark of each unknown sample has not changed, the model reaches convergence. For example, the black samples in the unknown samples in the previous training round include unknown sample 1, unknown sample 2, unknown sample 5, and unknown sample 10. The black samples in the training of the round also comprise unknown samples 1, unknown samples 2, unknown samples 5 and unknown samples 10, which shows that the unknown samples have no change, and the risk identification model trained by the round has reached convergence. And taking the risk recognition model training model obtained by the round of training as a target risk recognition model training model.

Further, the preset convergence condition for determining whether the model reaches the convergence may be set according to actual needs, and the above example is only one example of specific implementation and is not limited to the present application. For example: the number of unknown samples consistent with the number of unknown samples marked as black samples in the previous round can be set to reach the preset ratio, which indicates that the mark of each unknown sample has not changed, and the model reaches convergence.

And then after the target risk recognition model is trained, risk recognition can be carried out on the sample to be recognized. Specifically, step S201 is adopted to obtain internal information of the sample to be identified, and based on the internal information, external information corresponding to the sample to be identified is obtained by crawling from the public network.

The external information acquisition mode of the sample to be recognized is consistent with the external information acquisition mode of the training sample, identity information in the internal information is used as key information, the key information is searched through a preset search engine, and abstract information of each search result is obtained from the search results in a crawling mode and used as the external information; and/or using the identity information in the internal information as key information, and using the information corresponding to the key information and crawled from the preset portal website as external information.

Specifically, the internal information of the sample to be recognized is information existing inside the system, and includes identity information, registration related information, and historical operation information of the user in the system, which correspond to the sample. The system may be a financial system and the historical operating information may be historical financial transaction information corresponding to the user. The system may also be a shopping system and the historical operating information may be historical financial transaction information corresponding to the user. The system can also be a shopping system, and the historical operation information can be historical shopping information corresponding to the user.

For example: the internal information includes basic attributes of the user corresponding to the sample, such as user age, gender, occupation, hobby, and the like, and comprehensive evaluation indexes, such as: the account security level of the user, the spam registration risk score, the cheating risk score, and the like. The portrait information data is characterized from the basic information of the user to the account risk, and the comprehensive grading portrait mainly comes from the evaluation and characterization of the account in some wind control systems or marketing systems.

The internal information may also include device media information, mainly describing the attributes and composite scores of the device, such as: the activation time of a device, the number of historically logged-in accounts of a device, etc. In addition, the evaluation comprehensive scoring of the equipment by some wind control is also included, such as: whether theft has occurred on a device, and a device has historically been doing false transactions.

The input information related to the privacy of the user in the input information can be obtained through a user authorization mode. In a specific implementation process, the internal information may be set according to the needs of an actual scene, and the present application is not limited herein.

Furthermore, for the external information of the sample to be identified, since many pieces of information of the user are disclosed in the external network, for example: the external information corresponding to the risky user may have a certain rule. Some information about the user may be disclosed in some public networks, such as a website that may expose information about the user's participation in illegal activities or information about the user's dishonest behavior. Or some illegal websites have the relevant information of the user. The external information may also include the professional background of the user, the names of some network marketing platforms, and whether some negative events occur in the company where the user is located. Therefore, the internal information of the sample to be identified can be used for crawling the external information of the user, and the utilized internal information can be the identity information reserved in the system by the user, such as: and information such as names, addresses, contact information and the like is crawled in the public network to obtain external information.

Wherein, the key information extracted from the internal information can be searched by a preset search engine, such as: the names, the contact ways, the addresses and the like correspond to a plurality of search results, and since the external information searched by the search engine website is more and complicated, in order to reduce the amount of information processing, only abstract information of each search result is extracted as the external information, and the external information also comprises the website address of the search result. Meanwhile, external information crawling such as an exposure website and a blacking website can be performed from a preset portal website, and key information extracted from the internal information is searched in the preset portal website, such as: name, contact address or the like, corresponding to the search result as external information.

And then vectorizing the text information crawled to a plurality of relevant external parts on the user dimension aiming at the external information corresponding to the sample to be identified. The external text information can be converted into a vector form by adopting a preset word vector algorithm. In a specific implementation process, the preset word vector algorithm may be one-hot prediction (one-hot prediction), word vectorization (Skip-gram), continuous bag of words model (CBOW), and the like, and may be set according to actual needs in the specific implementation process, which is not limited in this application. And then, performing feature extraction on a preset dimension consistent with a training sample during training of the target risk recognition model, wherein the dimension of the feature vector of the sample to be recognized is consistent with the training sample.

Further, in step S202, feature information of the sample to be recognized is determined, specifically, for information in the form of word vectors converted from external information and internal information, feature extraction may be performed from multiple dimensions, and a feature vector corresponding to the sample to be recognized is formed.

The preset dimension comprises a similarity dimension, and the feature extraction from the similarity dimension can be realized through the following steps: determining a target characteristic vector corresponding to a preset black sample; and after converting the internal information and the external information into information in a word vector form, extracting a similarity characteristic vector corresponding to the target characteristic vector from the information in the word vector form.

Specifically, for a black sample with known attributes, a document corresponding to internal information and external information of the black sample can be converted into a word vector form, and the target feature vector may be a feature vector of a mean value of features of the information document corresponding to the black sample, or a feature vector customized by counting the external information of the black sample. Furthermore, for the sample to be identified, for each crawled document, a corresponding feature vector can be extracted in a word2vec mode, for the feature vector of the sample to be identified, the similarity between the feature vector and the target feature vector is calculated, and a mean value of the similarity is taken to form a similarity feature vector.

Further, the preset dimension comprises a statistical dimension, and the extracting of the statistical feature vector from the statistical dimension can be realized by the following steps: and extracting statistical characteristic vectors from the internal information and the external information of the sample to be identified aiming at the sample to be identified.

Specifically, for the external information of the sample to be identified, the basic statistical characteristics of the sample can be counted, including the total information quantity that can be crawled by the engine, the information quantity under each key website, the length of each piece of information, and the like. Suppose that the external information of the sample to be identified includes 4 search results corresponding to 4 articles, the statistical characteristics include 2 websites where the two articles are located and 4 articles, and then the statistical characteristics can be expressed in a vector form to obtain a statistical characteristic vector. For the internal information, the statistical feature vector may also be obtained in the same manner, which is not described herein again.

Further, the predetermined dimension includes a uniqueness dimension, and the extracting of the uniqueness feature vector from the uniqueness dimension may be performed by:

after the internal information and the external information are converted into information in a word vector form, a uniqueness feature vector is extracted from the information in the word vector form, wherein the uniqueness feature vector comprises uniqueness values corresponding to a plurality of elements appearing in the internal information and the external information respectively, and the uniqueness value of each element is the ratio of the number of times of the element appearing in the internal information and the external information to the total number of times of the element appearing in the information corresponding to a preset sample set.

Specifically, for the samples to be identified, for each crawled document, words in the document can be vectorized in a word2vec mode, and therefore the uniqueness of the samples in the word vector dimension can be calculated among all the samples. Specifically, the preset sample set includes 100 samples to be recognized, the number of times that the word a appears in the document information corresponding to the sample 1 to be recognized is 2, the total number of times that the word a appears in the document information corresponding to all 100 samples to be recognized is 10, and the uniqueness of the word a in the document information corresponding to the sample 1 to be recognized is 2/10= 0.2.

and after the internal information and the external information are converted into information in a word vector form, the information in the word vector form is processed through a characteristic mean value to obtain an informative characteristic vector or the information in the word vector form is processed through principal component analysis to obtain the informative characteristic vector.

Specifically, for a sample to be identified, for each crawled document, vectorization in the document can be performed in a doc2vec mode, after a section of the client crawls back through doc2vec to become vectors, the vectors are combined to obtain an average value and an average value, and an intelligence feature vector is formed. Or obtaining the intelligence feature vector through PCA processing. Such as: the external information corresponding to the sample to be identified comprises four articles, then the four articles are converted into a 100-dimensional vector, and PCA conversion is carried out on the four vectors to obtain the converted information characteristic vector. Suppose that 2 feature vectors corresponding to the maximum feature values are selected to form an intelligence feature vector, so that a 200-dimensional intelligence feature vector is formed.

In this embodiment, for a sample to be recognized, the dimension of feature extraction on the sample to be recognized needs to be consistent with that of a training sample, and it is ensured that feature information of the sample to be recognized can be used as an input feature of a target risk recognition model.

If the training sample of the target risk identification model performs feature vector extraction from the above 4 dimensions, then the features of the sample to be identified also need to be extracted from the 4 dimensions.

Finally, in step S203, the feature information of the sample to be recognized is input into the target risk recognition model, and the sample to be recognized is subjected to risk recognition through the target risk recognition model, so as to obtain a risk recognition result.

Specifically, in this embodiment, model training is performed by using a known black sample and a white sample with a high remaining confidence level after screening out a potential black sample, so that the obtained target risk identification model training model has high evaluation accuracy, and risk identification model training can be performed on a newly-introduced sample, and an evaluation result can include the black sample score and indicate the risk score of the newly-introduced sample. Further, a risk score threshold value can be set, if the risk score of the newly-entered sample is larger than the risk score threshold value, the newly-entered sample can be identified as the risk sample, prompt information is output, and related personnel can timely know the risk degree of the newly-entered sample and timely carry out risk regulation and control.

In a second aspect, based on the same inventive concept, an embodiment of the present specification provides a risk recognition model training method, including:

for each training sample in a training sample set, determining characteristic information corresponding to the training sample based on internal information and external information corresponding to each training sample in the training sample set;

and training a risk recognition model by adopting a semi-supervised machine learning algorithm based on the characteristic information of all the training samples in the training sample set to obtain a target risk recognition model.

The specific process of training the target risk recognition model in this embodiment has been described in detail in the foregoing first aspect embodiment, and is not described herein again.

In a third aspect, based on the same inventive concept, an embodiment of this specification provides a risk identification apparatus, please refer to fig. 3, including:

the acquisition unit 301 is configured to acquire internal information of a sample to be identified, and acquire external information corresponding to the sample to be identified by crawling from a public network based on the internal information;

a determination unit 302 for determining feature information based on the internal information and the external information;

and the identifying unit 303 is configured to input the feature information into the target risk identification model, and perform risk identification on the sample to be identified through the target risk identification model to obtain a risk identification result.

In an optional implementation manner, the obtaining unit 301 is specifically configured to:

searching the key information by using identity information in the internal information as key information through a preset search engine, and crawling the summary information of each search result from the search results as external information; and/or

And using the identity information in the internal information as key information, and using the information corresponding to the key information and crawled from a preset portal website as external information.

In an alternative implementation, the determining unit 302 is specifically configured to:

and performing feature extraction on the internal information and the external information from a preset dimension to form a feature vector corresponding to the sample to be identified.

In an optional implementation manner, the preset dimensions include a similarity dimension, and the determining unit 302 is specifically configured to:

determining a target characteristic vector corresponding to a preset black sample;

and after converting the internal information and the external information into information in a word vector form, extracting a similarity characteristic vector corresponding to the target characteristic vector from the information in the word vector form.

In an optional implementation manner, the preset dimensions include statistical dimensions, and the determining unit 302 is specifically configured to:

and extracting statistical characteristics from the internal information and the external information to form a statistical characteristic vector.

In an alternative implementation, the preset dimension includes a uniqueness dimension, and the determining unit 302 is specifically configured to:

In an optional implementation manner, the preset dimension includes an informative dimension, and the determining unit 302 is specifically configured to:

In an alternative implementation, the apparatus further includes a model training unit, and the model training unit is specifically configured to:

before inputting the characteristic information into a target risk identification model, crawling external information of a training sample from an open network according to internal information of the training sample aiming at each training sample in a training sample set, wherein the training sample set comprises a black sample with calibrated attributes and an unknown sample with uncalibrated attributes;

In a fourth aspect, based on the same inventive concept as the foregoing embodiment, the present invention further provides a risk recognition model training apparatus, including:

the acquisition unit is used for crawling external information of a training sample from an open network based on internal information of the training sample for each training sample in a training sample set, wherein the training sample set comprises a black sample with calibrated attributes and an unknown sample without calibrated attributes;

the device comprises a determining unit, a judging unit and a judging unit, wherein the determining unit is used for determining the characteristic information corresponding to a training sample based on the internal information and the external information corresponding to each training sample in a training sample set;

and the training unit is used for training a risk recognition model by adopting a semi-supervised machine learning algorithm based on the characteristic information of all the training samples in the training sample set to obtain a target risk recognition model.

In a fifth aspect, based on the same inventive concept as the previous embodiment, the present invention further provides a server, as shown in fig. 4, including a memory 404, a processor 402, and a computer program stored in the memory 404 and executable on the processor 402, wherein the processor 402 executes the computer program to implement the steps of any one of the foregoing risk identification method and risk identification model training method.

Where in fig. 4 a bus architecture (represented by bus 400) is shown, bus 400 may include any number of interconnected buses and bridges, and bus 400 links together various circuits including one or more processors, represented by processor 402, and memory, represented by memory 404. The bus 400 may also link together various other circuits such as peripherals, voltage regulators, power management circuits, and the like, which are well known in the art, and therefore, will not be described any further herein. A bus interface 406 provides an interface between the bus 400 and the receiver 401 and transmitter 403. The receiver 401 and the transmitter 403 may be the same element, i.e., a transceiver, providing a means for communicating with various other apparatus over a transmission medium. The processor 402 is responsible for managing the bus 400 and general processing, while the memory 404 may be used for storing data used by the processor 402 in performing operations.

In a sixth aspect, based on the inventive concept of risk identification model training as in the previous embodiments, the present invention further provides a computer readable storage medium having stored thereon a computer program, which when executed by a processor, implements the steps of any one of the foregoing risk identification method and the method of risk identification model training.

The description has been described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the description. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While the preferred embodiments of the present specification have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all changes and modifications that fall within the scope of the specification.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present specification without departing from the spirit and scope of the specification. Thus, if such modifications and variations of the present specification fall within the scope of the claims of the present specification and their equivalents, the specification is intended to include such modifications and variations.

Claims

1. A risk identification method, comprising:

obtaining internal information of a sample to be identified, and based on the internal information, crawling from a public network to obtain external information corresponding to the sample to be identified;

determining feature information based on the internal information and the external information, including: performing feature extraction on the internal information and the external information from preset dimensions to form feature vectors corresponding to the samples to be identified, wherein when the preset dimensions comprise similarity dimensions, target feature vectors corresponding to preset black samples are determined; after the internal information and the external information are converted into information in a word vector form, extracting a similarity characteristic vector corresponding to the target characteristic vector from the information in the word vector form;

inputting the characteristic information into a target risk identification model, and carrying out risk identification on the sample to be identified through the target risk identification model to obtain a risk identification result.

2. The method of claim 1, wherein the crawling external information corresponding to the sample to be identified from an open network based on the internal information comprises:

identity information in the internal information is used as key information, the key information is searched through a preset search engine, and abstract information of each search result is obtained from the search results in a crawling mode and is used as external information; and/or

And taking the identity information in the internal information as key information, and taking information corresponding to the key information and crawled from a preset portal website as external information.

3. The method of claim 1, wherein the preset dimensions include statistical dimensions, and the extracting features from the internal information and the external information from the preset dimensions to form feature vectors corresponding to the samples to be identified comprises:

4. The method according to claim 1, wherein the preset dimension comprises a uniqueness dimension, and the feature extraction of the internal information and the external information from the preset dimension to form a feature vector corresponding to the sample to be recognized comprises:

after the internal information and the external information are converted into information in a word vector form, a uniqueness feature vector is extracted from the information in the word vector form, wherein the uniqueness feature vector comprises uniqueness values respectively corresponding to a plurality of elements appearing in the internal information and the external information, and the uniqueness value of each element is the ratio of the number of times of the element appearing in the internal information and the external information to the total number of times of the element appearing in the information corresponding to a preset sample set.

5. The method of claim 1, wherein the preset dimension comprises an informative dimension, and the extracting the features of the internal information and the external information from the preset dimension to form a feature vector corresponding to the sample to be identified comprises:

and after the internal information and the external information are converted into information in a word vector form, processing the information in the word vector form through a characteristic mean value to obtain an informative characteristic vector or analyzing and processing the information in the word vector form through a principal component to obtain the informative characteristic vector.

6. The method of any of claims 1-5, wherein before inputting the feature information into a target risk recognition model, the target risk recognition model is obtained by training:

and training a risk recognition model by adopting a semi-supervised machine learning algorithm based on the characteristic information of all training samples in the training sample set to obtain a target risk recognition model.

7. A risk recognition model training method comprises the following steps:

for each training sample in a training sample set, determining feature information corresponding to the training sample based on internal information and external information corresponding to the training sample, including: extracting features of the internal information and the external information from preset dimensions to form feature vectors corresponding to the samples to be identified, wherein when the preset dimensions comprise similarity dimensions, target feature vectors corresponding to preset black samples are determined; after the internal information and the external information are converted into information in a word vector form, extracting a similarity characteristic vector corresponding to the target characteristic vector from the information in the word vector form;

8. A risk identification device comprising:

the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring internal information of a sample to be identified and acquiring external information corresponding to the sample to be identified from a public network based on the internal information;

a determination unit configured to determine feature information based on the internal information and the external information, including: extracting features of the internal information and the external information from preset dimensions to form feature vectors corresponding to the samples to be identified, wherein when the preset dimensions comprise similarity dimensions, target feature vectors corresponding to preset black samples are determined; after the internal information and the external information are converted into information in a word vector form, extracting a similarity characteristic vector corresponding to the target characteristic vector from the information in the word vector form;

and the identification unit is used for inputting the characteristic information into a target risk identification model, and performing risk identification on the sample to be identified through the target risk identification model to obtain a risk identification result.

9. The apparatus according to claim 8, wherein the obtaining unit is specifically configured to:

identity information in the internal information is used as key information, the key information is searched through a preset search engine, and summary information of each search result is obtained through crawling from the search results and is used as external information; and/or

10. The apparatus according to claim 8, wherein the predetermined dimension includes a statistical dimension, and the determining unit is specifically configured to:

11. The apparatus according to claim 8, the predetermined dimension comprising a uniqueness dimension, the determining unit being specifically configured to:

12. The apparatus of claim 8, the predetermined dimension comprising an informative dimension, the determining unit being specifically configured to:

13. The apparatus according to any of claims 8-12, the apparatus further comprising a model training unit, the model training unit being specifically configured to:

before inputting the characteristic information into a target risk identification model, crawling external information of a training sample from an open network according to internal information of the training sample aiming at each training sample in a training sample set, wherein the training sample set comprises black samples with calibrated attributes and unknown samples without calibrated attributes;

14. A risk recognition model training apparatus, comprising:

the determining unit is used for determining the characteristic information corresponding to the training sample based on the internal information and the external information corresponding to the training sample for each training sample in the training sample set, and comprises the following steps: performing feature extraction on the internal information and the external information from preset dimensions to form feature vectors corresponding to the samples to be identified, wherein when the preset dimensions comprise similarity dimensions, target feature vectors corresponding to preset black samples are determined; after the internal information and the external information are converted into information in a word vector form, extracting a similarity characteristic vector corresponding to the target characteristic vector from the information in the word vector form;

and the training unit is used for training the risk recognition model by adopting a semi-supervised machine learning algorithm based on the characteristic information of all the training samples in the training sample set to obtain a target risk recognition model.

15. A server comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the method of any one of claims 1 to 7 when the program is executed.

16. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 7.