CN112308453B

CN112308453B - Risk identification model training method, user risk identification method and related devices

Info

Publication number: CN112308453B
Application number: CN202011301542.4A
Authority: CN
Inventors: 刘宏剑; 杨青
Original assignee: Du Xiaoman Technology Beijing Co Ltd
Current assignee: Du Xiaoman Technology Beijing Co Ltd
Priority date: 2020-11-19
Filing date: 2020-11-19
Publication date: 2023-04-28
Anticipated expiration: 2040-11-19
Also published as: CN112308453A

Abstract

The invention discloses a risk identification model training method, a user risk identification method and a related device, wherein the training method comprises the following steps: performing de-duplication processing on the search logs in the initial sample, and sequencing each word by using a keyword dictionary, wherein the keyword dictionary is set according to the importance degree of the word; intercepting the sequencing result into at least one input text according to a preset length; and training the risk recognition model by taking at least one input text as a training sample to obtain a target risk recognition model. According to the method, the training samples are obtained by sorting the search logs according to the keyword dictionary through the duplication removal, and the words are intercepted to be of the preset length, so that compared with the splicing mode, the length of the training samples is shortened, the training efficiency is improved, and even if the training samples are intercepted, the words with higher importance degree are reserved and the training accuracy is guaranteed as the training samples are sorted based on the keyword dictionary.

Description

Risk identification model training method, user risk identification method and related devices

Technical Field

The present invention relates to the field of data processing technologies, and in particular, to a risk identification model training method, a user risk identification method, and related devices.

Background

For each user, when searching information based on a network, a large number of search logs are generated, the search logs usually exist in the form of texts, risks of the user can be identified by using the texts, in the existing risk identification process, models are firstly built and trained based on methods such as TextCNN, LSTM and pre-training neural networks to obtain a risk identification model, and risk identification is carried out based on the risk identification model, wherein the training process for the TextCNN model comprises the following steps: and splicing the user search logs into a long text, and training the risk labels of the users by using the textCNN neural network. The training process for the LSTM model includes: the user search logs are spliced into a long text, and the RNN neural network with long-term and short-term memory is utilized to train the risk labels of the users. The training process for the pre-trained neural network model includes: and splicing the user search logs into a long text, training by using a large-scale corpus to obtain a pre-trained neural network, and performing fine adjustment on risk labels of users.

However, the time of long text input processing by using the neural network is long and the recognition effect is poor at present, so that the accuracy and the efficiency of the neural network model obtained by training cannot meet high requirements, and the user risk recognition process is directly influenced.

Disclosure of Invention

In view of the above, the present invention provides a risk recognition model training method, a user risk recognition method and a related device, which are used for solving the problems that the time of long text input processing by using a neural network is long and the recognition effect is poor at present, so that the accuracy and efficiency of the neural network model obtained by the training cannot meet higher requirements, and the user risk recognition process is directly affected. The specific scheme is as follows:

a risk identification model training method, comprising:

acquiring an initial sample;

performing de-duplication processing on the search logs in the initial sample to obtain each word;

the method comprises the steps of sequencing each word by utilizing a keyword dictionary to obtain a sequencing result, wherein the keyword dictionary is pre-established and comprises a plurality of words, and the sequence of the words is set according to the importance degree of the words;

intercepting the sequencing result into at least one input text according to a preset length;

and training the risk recognition model by taking the at least one input text as a training sample to obtain a target risk recognition model, wherein the risk recognition model is constructed based on the Embedding layer and the transformation structure.

In the above method, optionally, the process of establishing the keyword dictionary includes:

all search logs of each user are spliced to obtain spliced texts;

word segmentation is carried out on the spliced text to obtain each word;

calculating the high-low risk discrimination degree and the frequency of occurrence corresponding to each word, and taking the product of the high-low risk discrimination degree and the frequency as the importance value of the word;

and sequencing the words based on the importance values to obtain the keyword dictionary.

The method, optionally, calculates the high-low risk discrimination degree corresponding to each word, including:

counting the proportion H of high-risk users and the proportion L of low-risk users of each user searching the word;

acquiring the proportion H 'of high-risk users and the proportion L' of low-risk users in all users, and based on a preset formula

A high-low risk discrimination is calculated, where R represents the high-low risk discrimination.

The method, optionally, constructs a risk identification model based on the Embedding layer and the transformation structure, including:

training a text prediction model based on a preset training corpus to obtain a target text prediction model, wherein the target text training model comprises: the Embedding layer and the transducer structure;

when training is completed, acquiring the Embedding layer and the transformation structure;

and adding a risk identification layer, and constructing the risk identification model based on the sequence of the Embedding layer, the Transformer structure and the risk identification layer.

The method, optionally, further comprises:

acquiring the length of the sequencing result;

and adding blank in the sequencing result to supplement the blank to the preset length under the condition that the length is smaller than the preset length.

A user risk identification method, comprising:

under the condition that a risk identification request for a current user is received, a target risk identification model is called, wherein the target risk identification model is obtained by training based on the training method;

obtaining a current search log of the current user, and performing deduplication processing on the current search log to obtain each current word;

sequencing each current word according to the keyword dictionary to obtain a current sequencing result;

intercepting the sequencing result into a current input text according to a preset length;

and transmitting the current input text to the target risk recognition model to perform risk recognition.

A risk identification model training device, comprising:

the initial sample acquisition module is used for acquiring an initial sample;

the first duplicate removal module is used for carrying out duplicate removal processing on the search logs in the initial sample to obtain each word;

the first ordering module is used for ordering the words by utilizing a keyword dictionary to obtain an ordering result, wherein the keyword dictionary is pre-established and comprises a plurality of words, and the order of the words is set according to the importance degree of the words;

the first intercepting module is used for intercepting the sequencing result into at least one input text according to a preset length;

the training module is used for training the risk recognition model by taking the at least one input text as a training sample to obtain a target risk recognition model, wherein the risk recognition model is constructed based on the Embedding layer and the transformation structure.

In the above apparatus, optionally, the process of establishing the keyword dictionary in the first ranking module includes:

the splicing unit is used for splicing all the search logs of each user to obtain a spliced text;

the word segmentation unit is used for segmenting the spliced text to obtain each word;

the computing unit is used for computing the high-low risk discrimination degree and the frequency of occurrence corresponding to each word, and taking the product of the high-low risk discrimination degree and the frequency as the importance value of the word;

and the ordering unit is used for ordering the words based on the importance values to obtain the keyword dictionary.

The above apparatus, optionally, wherein the training module constructs a risk identification model based on the Embedding layer and the transformation structure, and the risk identification model comprises:

the training unit is used for training the text prediction model based on a preset training corpus to obtain a target text prediction model, wherein the target text training model comprises: the Embedding layer and the transducer structure;

the acquisition unit is used for acquiring the Embedding layer and the transformation structure when training is completed;

the building unit is used for adding a risk identification layer and building the risk identification model based on the sequence of the Embedding layer, the transformation structure and the risk identification layer.

A user risk identification device comprising:

the invoking module is used for invoking a target risk identification model under the condition of receiving a risk identification request for a current user, wherein the target risk identification model is obtained by training based on the training method;

the second deduplication module is used for acquiring the current search log of the current user, and performing deduplication processing on the current search log to obtain each current word;

the second ordering module is used for ordering each current word according to the keyword dictionary to obtain a current ordering result;

the second intercepting module is used for intercepting the sequencing result into a current input text according to a preset length;

and the recognition module is used for transmitting the current input text to the target risk recognition model to perform risk recognition.

Compared with the prior art, the invention has the following advantages:

the invention discloses a risk identification model training method, a user risk identification method and a related device, wherein the training method comprises the following steps: acquiring an initial sample; performing de-duplication processing on the search logs in the initial sample to obtain each word; ordering each word by using a keyword dictionary to obtain an ordering result, wherein the keyword dictionary is pre-established and comprises a plurality of words, and the sequence of the words is set according to the importance degree of the words; intercepting the sequencing result into at least one input text according to a preset length; and training the risk recognition model by taking at least one input text as a training sample to obtain a target risk recognition model. In the training process, the training samples are obtained by performing repeated processing on the search logs, sorting the obtained words according to the keyword dictionary, and intercepting the sorting result into a preset length, so that compared with the prior art, the length of the training samples is shortened, the training efficiency is improved, and even if the training samples are intercepted, the words with higher importance degree are reserved because the training samples are sorted based on the keyword dictionary, and meanwhile, the training accuracy is also ensured.

Drawings

In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of a risk identification model training method in the prior art;

FIG. 2 is a schematic diagram of a text prediction model according to an embodiment of the present disclosure;

FIG. 3 is a schematic diagram of a risk identification model according to an embodiment of the present disclosure;

FIG. 4 is a flowchart of a method for identifying risk of a user according to an embodiment of the present application;

FIG. 5 is a block diagram of a risk identification model training device according to an embodiment of the present disclosure;

fig. 6 is a block diagram of a user risk identification device according to an embodiment of the present application.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

The invention discloses a risk recognition model training method, a user risk recognition method and a related device, which are used for training a risk recognition model in advance before user risk recognition in the process of recognizing user risk, wherein a preferable existing risk recognition model is constructed based on methods such as textCNN, LSTM and a pre-training neural network, and in the existing risk recognition model training process, a search log of a user is spliced into a long text, the long text is used as a training sample for training, but the time of inputting and processing the long text by using the neural network is longer at present, and the recognition effect is poor, so that the accuracy and the efficiency of the neural network model obtained by training cannot meet higher requirements, and the user risk recognition process is directly influenced. Therefore, in order to solve the problems of low accuracy and low efficiency, the invention provides a risk identification model training method, wherein the execution flow of the training method is shown in fig. 1, and the method comprises the following steps:

s101, acquiring an initial sample;

in the embodiment of the invention, the initial sample is obtained from a preset sample library, wherein the initial sample comprises at least one risk level of a user and a search log corresponding to the user.

S102, performing deduplication processing on the search logs in the initial sample to obtain each word;

in the embodiment of the invention, the search logs in the initial sample are obtained, and the search logs are subjected to the de-duplication processing aiming at each search log, wherein the de-duplication method is as follows: the method for analyzing the search log includes the steps of segmenting the search log, obtaining initial words corresponding to the search log without limiting a specific analysis method, removing repeated terms from the repeated words in a hash mode, and obtaining the words.

S103, ordering the words by using a keyword dictionary to obtain an ordering result, wherein the keyword dictionary is pre-established and comprises a plurality of words, and the order of the words is set according to the importance degree of the words;

in the embodiment of the invention, a keyword dictionary is pre-established, wherein the establishment process of the keyword dictionary comprises the steps of splicing search log texts of each user and performing word segmentation; then traversing the search logs of all users, counting the number of users corresponding to each word (i.e. the number of users searching the word), recording as word frequency T, simultaneously counting the proportion H of high risk users and the proportion L of low risk users in the users searching the word, recording the proportion H 'of high risk users and the proportion L' of low risk users in all users, and based on a preset formula

The word importance calculation formula is T.R, words are ordered from big to small by using the word importance, and a keyword dictionary is constructed.

Further, different formulas may be used when calculating word importance using frequency and risk differentiation of words, but the main idea is to use the product of frequency and risk differentiation.

In the embodiment of the invention, after the construction of the keyword dictionary is completed, traversing the keyword dictionary for each word, determining the sequence of the keyword dictionary in the keyword dictionary, and sequencing the words according to each sequence to obtain a sequencing result.

S104, intercepting the sequencing result into at least one input text according to a preset length;

in the embodiment of the invention, a preset length is set based on experience or a specific application scene, preferably, the length of each sorting result in the initial sample is obtained, the length is compared with the preset length, the length is intercepted into an input text according to the preset length when the length is larger than the preset length, and a blank is added in the sorting result to supplement the sorting result into the preset length when the length is smaller than the preset length, and the input text is taken as the input text, wherein the input text is at least one.

S105, training the risk recognition model by taking the at least one input text as a training sample to obtain a target risk recognition model, wherein the risk recognition model is constructed based on an Embedding layer and a transformation structure.

In the embodiment of the invention, a risk recognition model is pre-built, the risk recognition model is built based on an Embedding layer and a transform structure, wherein the Embedding layer and the transform structure are obtained in a target text prediction model, the target text prediction model trains the text prediction model based on a preset training corpus, a schematic diagram of the text prediction model is shown in FIG. 2, and the Embedding layer in the text prediction model represents vectors for mapping characters into fixed dimensions; the maskaedMultiSelfAttention layer represents a self-attention mechanism, the LayerNorm layer represents regularization, the feed forward layer represents two fully connected layers, and the text prediction layer represents one fully connected layer and a loss function; the entire portion of the dashed box is called a transducer structure, which is repeated 4 times. The reason for repeating four times is that the parameters can be adjusted more; the multi-layer structure can obtain information with higher abstraction level, such as sentence length obtained by a lower layer network, word meaning information, and grammar structure information obtained by a higher layer network; higher layers obtain semantic information. The number of operations repeated too many times is not limited to four, but the effect is not remarkably improved, so four times is a compromise between the number of operations and the effect. The text input layer is preceded by an additional character representing the information of the entire sentence. For a text input of a certain length, a text output of the same length precedes the text prediction layer, each position is a vector representing text information of the corresponding position, and the first position represents information of the entire sentence.

The training is performed by adopting a preset training corpus based on a general neural network model training method, and the specific training process is related to the selection of a loss function and is not described herein. After training is completed, a target text prediction model is obtained, wherein the target text prediction model has the capability of representing the characteristics of a text, and prediction is performed based on the target text prediction model, for example, the preset training corpus is 'today's weather true good ', partial words or characters in the training corpus are randomly replaced by masks, and the training corpus can become' today's [ M ] weather true good' after random replacement; then, the replaced text is input as a target text prediction model, and after word Embedding (Embedding) and 4 converters structures, the replaced text is predicted, in this example, the word "day" is predicted after the position corresponding to [ M ], and classification loss is used.

Acquiring the Embedding layer and the transformation structure in the target text training model, adding a risk identification layer, and constructing a risk identification model based on the sequence of the Embedding layer, the transformation structure and the risk identification layer, wherein the structural block diagram of the risk identification model is shown in fig. 3, and the Embedding layer in the text prediction model in the risk identification model represents a vector for mapping characters into a fixed dimension; the maskedfatttion layer represents a self-attention mechanism, the LayerNorm layer represents regularization, the FeedForward layer represents two fully connected layers, the whole part of the dashed frame is called a transducer structure, the structure is repeated for 4 times, preferably risk identification layer risk labels are high risk or low risk, two-class labels are output, and the training sample is transmitted to the risk identification model for training based on the at least one input text as a training sample, so that a target risk identification model is obtained.

Further, for the target risk recognition model, risk recognition may be performed based on LSTM and TextCNN, and for a risk recognition layer in the target risk recognition model, other forms of labels may be output, and in the embodiment of the present invention, the specific existence form of the label is not limited.

The invention discloses a risk identification model training method, which comprises the following steps: acquiring an initial sample; performing de-duplication processing on the search logs in the initial sample to obtain each word; ordering each word by using a keyword dictionary to obtain an ordering result, wherein the keyword dictionary is pre-established and comprises a plurality of words, and the sequence of the words is set according to the importance degree of the words; intercepting the sequencing result into at least one input text according to a preset length; and training the risk recognition model by taking at least one input text as a training sample to obtain a target risk recognition model. In the training process, the training samples are obtained by performing repeated processing on the search logs, sorting the obtained words according to the keyword dictionary, and intercepting the sorting result into a preset length, so that compared with the prior art, the length of the training samples is shortened, the training efficiency is improved, and even if the training samples are intercepted, the words with higher importance degree are reserved because the training samples are sorted based on the keyword dictionary, and meanwhile, the training accuracy is also ensured.

In the embodiment of the present invention, based on the target risk identification model, the embodiment of the present invention further provides a user risk identification method, where an execution flow of the identification method is shown in fig. 4, and the method includes the steps of:

s201, under the condition that a risk identification request for a current user is received, a target risk identification model is called, wherein the target risk identification model is obtained by training based on the training method;

in the embodiment of the invention, the target risk recognition model is invoked under the condition that the risk recognition request for the current user is received, wherein the target risk recognition model is obtained by training based on the training method, and the risk recognition is performed on the current user based on the target risk recognition model.

S202, acquiring a current search log of the current user, and performing deduplication processing on the current search log to obtain each current word;

in the embodiment of the present invention, a current search log corresponding to the current user is obtained based on the name, the number or other preferred identifiers of the current user, where the current search log is a log of the search of the current user, and the current log is subjected to deduplication processing to obtain each current word, where a process of deduplication processing is the same as a process described in S102, and is not described herein again.

S203, sorting the current words according to a keyword dictionary to obtain a current sorting result;

in the embodiment of the present invention, the sorting process is the same as that described in S103, and will not be described here again.

S204, intercepting the sequencing result into a current input text according to a preset length;

in the embodiment of the present invention, the intercepting process is the same as the description in the 104, and will not be described herein.

S205, transmitting the current input text to the target risk recognition model to perform risk recognition.

In the embodiment of the invention, the current input text is transmitted to the risk recognition model for recognition, and whether the current user is a low risk user or a high risk user is determined.

The invention discloses a user risk identification method, which comprises the following steps: under the condition that a risk identification request for a current user is received, a target risk identification model is called, wherein the target risk identification model is obtained by training based on the training method; obtaining a current search log of the current user, and performing deduplication processing on the current search log to obtain each current word; sequencing each current word according to the keyword dictionary to obtain a current sequencing result; intercepting the sequencing result into a current input text according to a preset length; and transmitting the current input text to the target risk recognition model to perform risk recognition. In the above identification process, the current input text is obtained by sorting each single label word obtained by reprocessing the current search log according to the keyword dictionary, and intercepting the sorting result into a preset length, compared with the direct splicing mode in the prior art, the length of the current input text is shortened, the training efficiency is improved, and even if the current input text is intercepted, the word segmentation with higher importance degree is reserved because the current input text is sorted based on the keyword dictionary, and meanwhile, the accuracy of identification is also ensured.

Further, the risk recognition method disclosed by the invention is a method for constructing a keyword dictionary by comprehensively utilizing the frequency of words and the risk discrimination of the words; the coverage rate of the words and the distinguishing property of the words are considered, so that the selected words can better distinguish the risk users, a keyword dictionary is formed, and the keyword dictionary is used for extracting the key information from the long text; the important words are arranged in front, so that the words can be kept as much as possible when the text is cut off; the risk recognition model is finely adjusted based on the user search log and the risk tag to obtain a target risk recognition model, and risk recognition is carried out based on the target risk recognition model.

Based on the above-mentioned risk identification model training method, in the embodiment of the present invention, a training device for a risk identification model is further provided, where a structural block diagram of the training device is shown in fig. 5, and the training device includes:

an initial sample acquisition module 301, a first deduplication module 302, a first ordering module 303, and a first interception module 304, and a training module 305.

Wherein, the liquid crystal display device comprises a liquid crystal display device,

the initial sample acquiring module 301 is configured to acquire an initial sample;

the first deduplication module 302 is configured to perform deduplication processing on the search logs in the initial sample, so as to obtain each word;

the first ranking module 303 is configured to rank the words by using a keyword dictionary, where the keyword dictionary is pre-established and includes a plurality of words, and the order of the words is set according to the importance degrees of the words;

the first intercepting module 304 is configured to intercept the ranking result into at least one input text according to a preset length;

the training module 305 is configured to train the risk recognition model with the at least one input text as a training sample to obtain a target risk recognition model, where the risk recognition model is constructed based on an Embedding layer and a transformation structure.

The invention discloses a risk identification model training device, which comprises: acquiring an initial sample; performing de-duplication processing on the search logs in the initial sample to obtain each word; ordering each word by using a keyword dictionary to obtain an ordering result, wherein the keyword dictionary is pre-established and comprises a plurality of words, and the sequence of the words is set according to the importance degree of the words; intercepting the sequencing result into at least one input text according to a preset length; and training the risk recognition model by taking at least one input text as a training sample to obtain a target risk recognition model. In the training process, the training samples are obtained by performing repeated processing on the search logs, sorting the obtained words according to the keyword dictionary, and intercepting the sorting result into a preset length, so that compared with the prior art, the length of the training samples is shortened, the training efficiency is improved, and even if the training samples are intercepted, the words with higher importance degree are reserved because the training samples are sorted based on the keyword dictionary, and meanwhile, the training accuracy is also ensured.

In the embodiment of the present invention, the process of establishing the keyword dictionary in the first ranking module 303 includes:

a concatenation unit 306, a word segmentation unit 307, a calculation unit 308 and a ranking unit 309.

the splicing unit 306 is configured to splice all the search logs of each user to obtain a spliced text;

the word segmentation unit 307 is configured to segment the spliced text to obtain each word;

the calculating unit 308 is configured to calculate a high-low risk discrimination degree and a frequency of occurrence corresponding to each word, and take a product of the high-low risk discrimination degree and the frequency as an importance value of the word;

the ranking unit 309 is configured to rank the words based on importance values, to obtain the keyword dictionary.

In the embodiment of the present invention, the building of the risk identification model based on the Embedding layer and the transformation structure in the training module 305 includes:

training unit 310, acquisition unit 311, and construction unit 312.

the training unit 310 is configured to train the text prediction model based on a preset training corpus to obtain a target text prediction model, where the target text training model includes: the Embedding layer and the transducer structure;

the obtaining unit 311 is configured to obtain the Embedding layer and the transform structure when training is completed;

the construction unit 312 is configured to add a risk identification layer, and construct the risk identification model based on the order of the Embedding layer, the transformation structure, and the risk identification layer.

Based on the above-mentioned user risk identification method, in the embodiment of the present invention, there is further provided a user risk identification device, where a structural block diagram of the identification device is shown in fig. 6, and the identification device includes:

a calling module 401, a second deduplication module 402, a second ordering module 403, a second interception module 404, and an identification module 405.

the invoking module 401 is configured to invoke a target risk recognition model when a risk recognition request for a current user is received, where the target risk recognition model is obtained by training based on the training method;

the second deduplication module 402 is configured to obtain a current search log of the current user, and perform deduplication processing on the current search log to obtain each current word;

the second ordering module 403 is configured to order the current words according to a keyword dictionary, to obtain a current ordering result;

the second intercepting module 404 is configured to intercept the sorting result into a current input text according to a preset length;

the recognition module 405 is configured to transmit the current input text to the target risk recognition model for risk recognition.

The invention discloses a user risk identification device, which comprises: under the condition that a risk identification request for a current user is received, a target risk identification model is called, wherein the target risk identification model is obtained by training based on the training method; obtaining a current search log of the current user, and performing deduplication processing on the current search log to obtain each current word; sequencing each current word according to the keyword dictionary to obtain a current sequencing result; intercepting the sequencing result into a current input text according to a preset length; and transmitting the current input text to the target risk recognition model to perform risk recognition. In the above identification process, the current input text is obtained by sorting each single label word obtained by reprocessing the current search log according to the keyword dictionary, and intercepting the sorting result into a preset length, compared with the direct splicing mode in the prior art, the length of the current input text is shortened, the training efficiency is improved, and even if the current input text is intercepted, the word segmentation with higher importance degree is reserved because the current input text is sorted based on the keyword dictionary, and meanwhile, the accuracy of identification is also ensured.

It should be noted that, in the present specification, each embodiment is described in a progressive manner, and each embodiment is mainly described as different from other embodiments, and identical and similar parts between the embodiments are all enough to be referred to each other. For the apparatus class embodiments, the description is relatively simple as it is substantially similar to the method embodiments, and reference is made to the description of the method embodiments for relevant points.

Finally, it is further noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

For convenience of description, the above devices are described as being functionally divided into various units, respectively. Of course, the functions of each element may be implemented in the same piece or pieces of software and/or hardware when implementing the present invention.

From the above description of embodiments, it will be apparent to those skilled in the art that the present invention may be implemented in software plus a necessary general hardware platform. Based on such understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the embodiments or some parts of the embodiments of the present invention.

The foregoing describes in detail a risk recognition model training method, a user risk recognition method and related devices, and specific examples are applied to illustrate the principles and embodiments of the present invention, and the above examples are only used to help understand the method and core ideas of the present invention; meanwhile, as those skilled in the art will have variations in the specific embodiments and application scope in accordance with the ideas of the present invention, the present description should not be construed as limiting the present invention in view of the above.

Claims

1. A risk identification model training method, comprising:

acquiring an initial sample;

performing de-duplication processing on the search log in the initial sample to obtain each word, including: word segmentation processing is carried out on the search logs to obtain initial words corresponding to the search logs; removing repeated words in the initial words by adopting a hash mode;

sequencing each word by using a keyword dictionary to obtain a sequencing result; the sorting result is obtained by sorting the words according to the importance of the words in the order from big to small; the keyword dictionary is pre-established and comprises a plurality of words, and the sequence of the words is set according to the importance degree of the words;

the establishment process of the keyword dictionary comprises the following steps: all search logs of each user are spliced to obtain spliced texts; word segmentation is carried out on the spliced text to obtain each word; traversing the search logs of all users, and counting the word frequency of each wordA number T; counting the proportion H of high-risk users and the proportion L of low-risk users of each user searching the word; acquiring the proportion H 'of high-risk users and the proportion L' of low-risk users in all users, and based on a preset formula R=

Calculating a high-low risk discrimination, wherein R represents the high-low risk discrimination; obtaining the importance of the word based on a word importance calculation formula T; sequencing all words according to the importance of the words from big to small, and constructing a keyword dictionary;

training the risk recognition model by taking the at least one input text as a training sample to obtain a target risk recognition model, wherein the risk recognition model is constructed based on an Embedding layer and a transform structure;

the constructing a risk identification model based on the Embedding layer and the Transformer structure comprises the following steps:

training a text prediction model based on a preset training corpus to obtain a target text prediction model, wherein the target text training model comprises: the Embedding layer and the transducer structure; when training is completed, acquiring the Embedding layer and the transformation structure; and adding a risk identification layer, and constructing the risk identification model based on the sequence of the Embedding layer, the Transformer structure and the risk identification layer.

2. The method as recited in claim 1, further comprising:

acquiring the length of the sequencing result;

3. A method for identifying risk of a user, comprising:

under the condition that a risk identification request for a current user is received, a target risk identification model is called, wherein the target risk identification model is obtained by training based on the training method according to any one of claims 1-2;

obtaining a current search log of the current user, performing deduplication processing on the current search log to obtain each current word, wherein the method comprises the following steps: word segmentation processing is carried out on the search logs to obtain initial words corresponding to the search logs; removing repeated words in the initial words by adopting a hash mode;

sequencing each current word according to the keyword dictionary to obtain a current sequencing result; the sorting result is obtained by sorting the words according to the importance of the words in the order from big to small; the keyword dictionary is pre-established and comprises a plurality of words, and the sequence of the words is set according to the importance degree of the words;

the establishment process of the keyword dictionary comprises the following steps: all search logs of each user are spliced to obtain spliced texts; word segmentation is carried out on the spliced text to obtain each word; traversing search logs of all users, and counting word frequency T of each word; counting the proportion H of high-risk users and the proportion L of low-risk users of each user searching the word; acquiring the proportion H 'of high-risk users and the proportion L' of low-risk users in all users, and based on a preset formula R=

4. A risk identification model training device, comprising:

the initial sample acquisition module is used for acquiring an initial sample;

the first deduplication module is configured to perform deduplication processing on the search log in the initial sample to obtain each word, where the first deduplication module includes: word segmentation processing is carried out on the search logs to obtain initial words corresponding to the search logs; removing repeated words in the initial words by adopting a hash mode;

the first ordering module is used for ordering the words by using a keyword dictionary to obtain an ordering result, wherein the ordering result is obtained by ordering the words according to the importance of the words from big to small, the keyword dictionary is pre-established and comprises a plurality of words, and the order of the words is set according to the importance degree of the words;

the establishment process of the keyword dictionary in the first ordering module comprises the following steps: the device comprises a splicing unit, a word segmentation unit, a calculation unit and a sequencing unit;

the computing unit is used for traversing the search logs of all users and counting the word frequency T of each word; counting the proportion H of high-risk users and the proportion L of low-risk users of each user searching the word; acquiring the proportion H 'of high-risk users and the proportion L' of low-risk users in all users, and based on a preset formula R=

Calculating a high-low risk discrimination, wherein R represents the high-low risk discrimination; obtaining the importance of the word based on a word importance calculation formula T;

the ordering unit is used for ordering the words from big to small by utilizing the word importance, and constructing a keyword dictionary;

the training module is used for training the risk recognition model by taking the at least one input text as a training sample to obtain a target risk recognition model, wherein the risk recognition model is constructed based on an Embedding layer and a transformation structure;

the training module comprises: the device comprises a training unit, an acquisition unit and a construction unit;

the training unit is configured to train the text prediction model based on a preset training corpus to obtain a target text prediction model, where the target text training model includes: the Embedding layer and the transducer structure;

the acquiring unit is used for acquiring the Embedding layer and the transform structure when training is completed;

the construction unit is used for adding a risk identification layer and constructing the risk identification model based on the sequence of the Embedding layer, the transformation structure and the risk identification layer.

5. A user risk identification device, comprising:

the invoking module is used for invoking a target risk recognition model under the condition of receiving a risk recognition request for a current user, wherein the target risk recognition model is obtained by training based on the training method according to any one of claims 1-2;

the second deduplication module is configured to obtain a current search log of the current user, perform deduplication processing on the current search log, and obtain each current word, where the second deduplication module includes: word segmentation processing is carried out on the search logs to obtain initial words corresponding to the search logs; removing repeated words in the initial words by adopting a hash mode;

the second ordering module is used for ordering each current word according to the keyword dictionary to obtain a current ordering result; the sorting result is obtained by sorting the words according to the importance of the words in the order from big to small; the keyword dictionary is pre-established and comprises a plurality of words, and the sequence of the words is set according to the importance degree of the words;