CN113220999A

CN113220999A - User feature generation method and device, electronic equipment and storage medium

Info

Publication number: CN113220999A
Application number: CN202110529089.0A
Authority: CN
Inventors: 李原; 杨德将
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2021-05-14
Filing date: 2021-05-14
Publication date: 2021-08-06

Abstract

The disclosure provides a user feature generation method and device, electronic equipment and a storage medium, and relates to the technical field of computers, in particular to the artificial intelligence fields of natural language processing, deep learning and the like. The specific implementation scheme is as follows: acquiring first historical text data corresponding to a target user; analyzing the first historical text data to determine a first word segmentation set corresponding to the target user; determining the number of participles under each topic contained in the first participle set according to the matching degree between each participle in the first participle set and each participle corresponding to each topic; and determining the user characteristics corresponding to the target user according to the number of the participles under each topic contained in the first participle set. Therefore, the user characteristics are determined based on the word segmentation quantity of the target user under each theme, and the accuracy of the obtained user characteristics is improved.

Description

User feature generation method and device, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to the field of artificial intelligence, such as natural language processing and deep learning, and in particular, to a method and an apparatus for generating user characteristics, a method and an apparatus for model training, an electronic device, and a storage medium.

Background

With the continuous development of internet technology, a plurality of products, services and the like based on the internet come into play. In order to improve the service quality and the user experience, the characteristic analysis can be carried out on the user, and personalized and accurate service is provided for the user based on the user characteristics.

Therefore, how to improve the accuracy of the obtained user features is an urgent problem to be solved.

Disclosure of Invention

The disclosure provides a user feature generation method and device, a model training method and device, electronic equipment and a storage medium.

According to an aspect of the present disclosure, there is provided a method for generating a user feature, including:

acquiring first historical text data corresponding to a target user;

analyzing the first historical text data to determine a first word segmentation set corresponding to the target user;

determining the number of the participles under each topic contained in the first participle set according to the matching degree between each participle in the first participle set and each participle corresponding to each topic;

and determining the user characteristics corresponding to the target user according to the number of the participles under each topic contained in the first participle set.

According to an aspect of the present disclosure, there is provided a model training method, including:

acquiring a training data set, wherein the training data set comprises a plurality of historical text data respectively corresponding to a plurality of users;

analyzing a plurality of historical text data corresponding to each user respectively to determine a word segmentation set corresponding to each user;

determining the participles under each topic contained in the participle set corresponding to each user and the labeling risk level corresponding to each user;

inputting the participles under each theme and the corresponding themes in the participle set corresponding to each user into an initial neural network model to obtain the predicted risk level output by the initial neural network model;

and correcting the initial neural network model according to the difference between the predicted risk level and the labeled risk level to generate a wind control model.

According to another aspect of the present disclosure, there is provided a user feature generation apparatus, including:

the first acquisition module is used for acquiring first historical text data corresponding to a target user;

the first analysis module is used for analyzing the first historical text data to determine a first word segmentation set corresponding to the target user;

the first determining module is used for determining the number of the participles under each topic contained in the first participle set according to the matching degree between each participle in the first participle set and each participle corresponding to each topic;

and the second determining module is used for determining the user characteristics corresponding to the target user according to the number of the participles under each topic contained in the first participle set.

According to another aspect of the present disclosure, there is provided a model training apparatus including:

the second acquisition module is used for acquiring a training data set, wherein the training data set comprises a plurality of historical text data respectively corresponding to a plurality of users;

the second analysis module is used for respectively analyzing a plurality of historical text data corresponding to each user so as to determine a word segmentation set corresponding to each user;

an eighth determining module, configured to determine the participles under each topic included in the participle set corresponding to each user, and the labeled risk level corresponding to each user;

the second training module is used for inputting the participles under each theme and the corresponding themes in the participle set corresponding to each user into the initial neural network model so as to obtain the predicted risk level output by the initial neural network model; and correcting the initial neural network model according to the difference between the predicted risk level and the labeled risk level to generate a wind control model.

According to another aspect of the present disclosure, there is provided an electronic device including:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any of the embodiments described above.

According to another aspect of the present disclosure, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method according to any of the above embodiments.

According to another aspect of the present disclosure, a computer program product is provided, comprising a computer program which, when executed by a processor, implements the method according to any of the embodiments described above.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

fig. 1 is a schematic flowchart of a method for generating user characteristics according to an embodiment of the present disclosure;

fig. 2 is a schematic flow chart of another method for generating user characteristics according to an embodiment of the present disclosure;

fig. 3 is a schematic flow chart of another method for generating user characteristics according to an embodiment of the present disclosure;

fig. 4 is a schematic flow chart of another method for generating user characteristics according to the embodiment of the present disclosure;

fig. 5 is a schematic flow chart of another method for generating user characteristics according to an embodiment of the present disclosure;

FIG. 6 is a schematic flow chart diagram illustrating a model training method according to an embodiment of the present disclosure;

fig. 7 is a schematic structural diagram of an apparatus for generating user characteristics according to an embodiment of the present disclosure;

FIG. 8 is a schematic structural diagram of a model training apparatus according to an embodiment of the present disclosure;

FIG. 9 is a block diagram of an electronic device used to implement methods of embodiments of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

A generation method, an apparatus, an electronic device, and a storage medium of a user feature of the embodiments of the present disclosure are described below with reference to the drawings.

Artificial intelligence is the subject of research on the use of computers to simulate certain mental processes and intelligent behaviors (such as learning, reasoning, thinking, planning, etc.) of humans, both in the hardware and software domain. Artificial intelligence hardware technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing, and the like; the artificial intelligence software technology comprises a computer vision technology, a voice recognition technology, a natural language processing technology, deep learning, a big data processing technology, a knowledge map technology and the like.

NLP (Natural Language Processing) is an important direction in the fields of computer science and artificial intelligence, and the content of NLP research includes but is not limited to the following branch fields: text classification, information extraction, automatic summarization, intelligent question answering, topic recommendation, machine translation, subject word recognition, knowledge base construction, deep text representation, named entity recognition, text generation, text analysis (lexical, syntactic, grammatical, etc.), speech recognition and synthesis, and the like.

Deep learning is a new research direction in the field of machine learning. Deep learning is the intrinsic law and expression level of the learning sample data, and the information obtained in the learning process is very helpful for the interpretation of data such as characters, images and sounds. The final aim of the method is to enable the machine to have the analysis and learning capability like a human, and to recognize data such as characters, images and sounds.

Fig. 1 is a schematic flow chart of a method for generating a user characteristic according to an embodiment of the present disclosure.

As shown in fig. 1, the method for generating user characteristics includes:

step 101, obtaining first historical text data corresponding to a target user.

In the present disclosure, historical text data of a certain user within a past preset time period may be acquired, and for convenience of distinction, the user may be referred to as a target user, and the acquired historical text data may be referred to as first historical text data. The first historical text data may be web page content browsed by the target user, browsed video content and the like.

In practical application, historical text data within a preset time length before the operation time can be acquired based on the time of a certain operation performed by a target user. For example, if the user a initiates a credit request at 16/20/2/2020, historical text data of the user a, such as browsed web page contents, online shopping situation, browsed video contents, etc., within 15 days before the time can be acquired.

In the present disclosure, the acquisition, storage, application, and the like of the personal information of the user are all in compliance with the regulations of the relevant laws and regulations, and do not violate the good customs of the public order.

Step 102, analyzing the first historical text data to determine a first word segmentation set corresponding to the target user.

In the present disclosure, the first historical text data may be analyzed, for example, exclamation words, auxiliary words, etc. are removed, and word segmentation processing and de-duplication processing are performed to obtain a plurality of participles, which form a participle set, and are referred to as a first participle set for convenience of distinction.

In practical application, a plurality of historical text data of a target user may be acquired, and each historical text data may be analyzed to obtain a word set of each historical text data. And then, merging the multiple word subsets according to the sequence of the generation time of the historical text data and carrying out duplication elimination to obtain a first word subset. Or, a plurality of historical text data may be sorted according to a time sequence and integrated into one historical text data, that is, the first historical text data may be obtained by integrating a plurality of historical text data, and then, a plurality of participles are obtained by parsing, and the participles form a first participle set.

And 103, determining the number of the participles under each topic contained in the first participle set according to the matching degree between each participle in the first participle set and each participle corresponding to each topic.

In the present disclosure, a plurality of topics and each participle corresponding to each topic may be obtained in advance. The multiple topics and the respective segmentation words corresponding to each topic acquired here may be obtained by manually classifying the segmentation words in multiple documents.

For example, two themes "game" and "travel" are obtained, wherein the participle corresponding to the theme "game" has [ a royal glory opens wild glory and assists in dividing new hero ], the participle corresponding to the theme "travel" has [ scenic spot weather ticket departure destination place ], and the like.

It should be noted that the word segments corresponding to the subject matters in the above examples are only examples, and should not be construed as limiting the disclosure.

After the first word segmentation set is obtained, each word segmentation in the first word segmentation set can be matched with each word segmentation corresponding to each topic, so that the word segmentation under each topic contained in the first word segmentation set is determined according to the matching degree between each word segmentation in the first word segmentation set and each word segmentation corresponding to each topic, and the number of the word segmentation under each topic contained in the first word segmentation set is determined.

When matching is carried out, the distance between the word vectors corresponding to the two participles can be calculated, and the matching degree of the two participles is measured by using the distance. Wherein, the smaller the distance, the higher the matching degree, and the larger the distance, the lower the matching degree.

For example, the participle p1 in the first participle set is matched with each participle in the topic a, and if the matching degree of the participle p1 with a participle in the topic a is greater than a preset matching degree threshold, it can be considered that the participle under the topic is included in the first participle set.

And 104, determining the user characteristics corresponding to the target user according to the number of the participles under each topic contained in the first participle set.

In this disclosure, the number of the participles in each topic included in the first participle set may be greater than a preset number, and the topics with the participle number greater than the preset number may be used as the user characteristics corresponding to the target user. The preset number can be determined according to actual needs.

For example, the preset number is 0, there are 8 topics, and if there are 5 topics, the first term set includes terms under these topics, and these 5 topics can be used as user features corresponding to the target user.

Or, one or more topics with the maximum number of topic sub-segmentations included in the first segmentation set may also be used as the corresponding user characteristics of the target user. For example, if the number of the participles under the topic a and the topic b included in the first participle set is the largest, the topic a and the topic b may be used as the user characteristics corresponding to the target user.

Or, the number of the participles in each topic included in the first participle set can be directly used as the user characteristics corresponding to the target user.

For example, there are 5 topics, and the number of the participles under each topic included in the first participle set is 6, 5, 4, 0, and 0, respectively, so that the number of the participles corresponding to the 5 topics can be the user characteristics corresponding to the target user.

In the embodiment of the disclosure, a first word segmentation set corresponding to a target user is determined by analyzing first historical text data corresponding to the target user, each word segmentation in the first word segmentation set is respectively matched with each word segmentation under each topic to determine the number of words segmentation under each topic contained in the first word segmentation set, and a user characteristic corresponding to the target user is determined according to the number of words segmentation under each topic contained in the first word segmentation set, so that the user characteristic is determined based on the number of words segmentation under each topic of the target user, and the accuracy of the obtained user characteristic is improved.

In an embodiment of the disclosure, after the user characteristics are obtained, the promotion information can be pushed to the target user based on the user characteristics, so that the pushing accuracy of the promotion information can be improved. Fig. 2 is a schematic flow chart of another user feature generation method provided in the embodiment of the present disclosure.

As shown in fig. 2, the method for generating the user characteristics includes:

step 201, obtaining first historical text data corresponding to a target user.

Step 202, analyzing the first historical text data to determine a first word segmentation set corresponding to the target user.

Step 203, determining the number of the participles under each topic contained in the first participle set according to the matching degree between each participle in the first participle set and each participle corresponding to each topic.

And 204, determining the user characteristics corresponding to the target user according to the number of the participles under each topic contained in the first participle set.

In the present disclosure, steps 201 to 204 are similar to steps 101 to 104, and thus are not described herein again.

Step 205, determining the association degree between the user characteristics and each information to be promoted.

In the present disclosure, the association degree between the user characteristics and each piece of information to be promoted can be calculated. The information to be promoted can be advertisement, video, news and the like.

For example, the user features a game and a trip, and the association degree between each piece of information to be promoted and the game and the association degree between each piece of information to be promoted and the trip can be calculated.

And step 206, determining target popularization information according to each association degree.

After the association degree between the user characteristics and each piece of information to be promoted is obtained, the information to be promoted, of which the association degree is greater than a preset association degree threshold value, can be used as the target promotion information.

It is understood that the target promotion information may be one or more.

And step 207, pushing the target popularization information to the target user.

After the target popularization information is obtained, the target popularization information can be pushed to the target user through the client used by the target user.

For example, if the user characteristics corresponding to the target user include a game, news, software, and the like related to the game may be recommended to the user.

In the embodiment of the disclosure, after the user characteristics corresponding to the target user are determined, the association degrees between the user characteristics and each piece of information to be promoted can be determined, the target promotion information is determined according to each association degree, and the target promotion information is pushed to the target user. Therefore, the promotion information is pushed to the target user based on the user characteristics, and therefore the promotion information pushing accuracy can be improved.

In an embodiment of the present disclosure, the information to be promoted may also be pushed to the target user after the number of the participles under each topic included in the first participle set is obtained. For example, it is determined that the number of the participles under the topic b included in the first participle set is the largest, the target promotion information can be determined based on the topic b, and the target promotion information is pushed to the target user.

In one embodiment of the present disclosure, after determining the number of segments under each topic contained in the first set of segments, it may also be determined whether the target user is a target risk user. Fig. 3 is a schematic flow chart of another method for generating user characteristics according to the embodiment of the present disclosure.

As shown in fig. 3, the method for generating user characteristics includes:

step 301, obtaining first historical text data corresponding to a target user.

Step 302, analyzing the first historical text data to determine a first word segmentation set corresponding to the target user.

In the present disclosure, steps 301 to 302 are similar to steps 101 to 102, and therefore are not described herein again.

Step 303, determining the number of the participles in the specified type topic contained in the first participle set, wherein the specified type topic is related to the target risk.

After the number of the participles under each topic contained in the first participle set is obtained, the number of the participles under the specified type topic contained in the first participle set can be determined. Wherein the specified type topic is associated with the target risk.

For example, if the type of topic is designated for installment, the topic is associated with the risk of advanced consumption, and the number of participles under the installment topic contained in the first participle set can be determined. As another example, the target risk may also be a overdue risk of a payment, and the like.

It is noted that the present disclosure is not limited to the specific types of subject matter and target risks.

And 304, determining that the target user is a user with a target risk under the condition that the number of the participles under the specified type of subjects contained in the first participle set is greater than a preset threshold value.

In this disclosure, if the number of the participles in the specified type topic included in the first participle set is greater than the preset threshold, it may be determined that the target user is a user with a target risk.

For example, if the number of the participles in the installment subject included in the first participle set is 20, which is greater than the preset threshold 10, the target user may be considered as a user with a high consumption risk.

For another example, taking a credit wind control scenario as an example, if the number of the participles under the credit topic included in the participle set corresponding to a certain user is greater than a preset threshold, it may be considered that the user has a risk of being overdue for repayment, and then the user may be denied to provide a corresponding service.

In the embodiment of the disclosure, after determining the number of the participles under each topic included in the first participle set, the number of the participles under a specified type topic included in the first participle set may also be determined, where the specified type topic is related to the target risk, and in a case that the number of the participles under the specified type topic included in the first participle set is greater than a preset threshold, the target user is determined to be a user with the target risk. Therefore, whether the user is a target risk user can be identified according to the number of the participles under the specified type of subjects contained in the first participle set, and whether certain services are provided for the target user can be determined.

In one embodiment of the disclosure, a plurality of topics and each participle corresponding to the topics can be determined in a clustering manner. Fig. 4 is a schematic flow chart of another user feature generation method provided in the embodiment of the present disclosure.

As shown in fig. 4, the method for generating user characteristics may further include:

step 401, a plurality of second historical text data corresponding to a plurality of users are obtained.

Step 402, analyzing the plurality of second historical text data corresponding to each user respectively to determine a second word set corresponding to each user.

In the present disclosure, the manner of obtaining the second word set is similar to the manner of obtaining the first word set, and therefore, the details are not repeated herein.

Step 403, clustering the participles in the second participle sets corresponding to the users to obtain a plurality of topic word libraries.

In this disclosure, each user corresponds to one second term set, and an LDA (Latent Dirichlet Allocation) model may be used to cluster a plurality of second term sets. After a plurality of second word subsets are obtained, the number of topics obtained by clustering can be set, the second word subsets corresponding to each user are input into the initial LDA model, and the initial LDA model is trained.

When the LDA model converges, a participle probability distribution may be obtained, where the participle probability distribution includes a probability that each participle belongs to each topic, and then the participles in the plurality of second participle sets may be clustered according to the participle probability distribution, for example, for each topic, a participle with a probability greater than a preset probability may be used as a participle corresponding to the topic, so that a plurality of topic word libraries may be obtained. Wherein, each topic word stock comprises one or more participles.

For example, when an LDA model is trained, the number of topics is set to be m, and m topic word libraries including topic _1, topic _2, … …, topic _ m-1 and topic _ m can be determined based on the word segmentation probability distribution when the model converges.

For example, topic _7: [ the king glory and wild glory and auxiliary division of new hero ]; topic _10 [ credit card apply for card I love card limit love card staged financial repayment ]; topic _11: [ sight spot weather train, bus, ticket, airplane, ticket, departure place, destination ].

The above illustrates the participles contained in the three topic word libraries of topic _7, topic _10 and topic _11, or may also be considered as the participles corresponding to the three topics of topic _7, topic _10 and topic _11 respectively, and these participles are partial participles in the topic word libraries.

And step 404, determining the topics respectively corresponding to each topic word bank according to the matching degree between each participle in each topic word bank and a preset topic.

Because the theme corresponding to each theme thesaurus is uncertain by using the above manner, for example, the theme thesaurus topic _11 is [ the departure place and the destination of the train ticket, the bus ticket, the airplane ticket and the departure place of the airplane ticket in the scenic spot weather, but the theme of topic _11 is uncertain, in the present disclosure, the matching degree between each participle in each theme thesaurus and the preset theme can be calculated, and the theme corresponding to each theme thesaurus is determined according to the matching degree between each participle in each theme thesaurus and the preset theme.

In the present disclosure, a plurality of preset topics may be calculated, a matching degree between each participle in each topic word bank and each preset topic may be calculated, and if the matching degree between each participle in the topic word bank and a certain preset topic is greater than a preset matching degree threshold, a topic corresponding to the topic word bank may be determined as the preset topic. Therefore, the topic corresponding to each topic word library can be determined, and a plurality of topics and participles corresponding to each topic are obtained. For example, the topic corresponding to the topic thesaurus topic _11 may be determined as "travel".

The user feature generation method of the embodiment of the disclosure can be widely applied to items related to modeling and feature development, for example, based on a joint modeling item, credit data of a plurality of users can be acquired, for example, the credit data includes borrowing time of the users, and historical text data related to the users in a preset time before the borrowing time, for example, browsing webpage content, video content and the like, can be acquired based on the borrowing time, and based on the historical text data, a plurality of topic word banks and topics corresponding to each topic word bank are determined.

In this disclosure, the plurality of second historical text data corresponding to the plurality of users may be obtained, and the plurality of second historical text data may be analyzed to determine second word subsets corresponding to the plurality of users, and the words in the plurality of second word subsets may be clustered to obtain a plurality of topic word banks, and the topics corresponding to each topic word bank may be determined according to the matching degree between the word in each topic word bank and the preset topic. Therefore, a plurality of topic word banks and topics corresponding to each topic word bank can be obtained by utilizing a plurality of second historical text data corresponding to a plurality of users, and therefore the user characteristics can be determined by utilizing a plurality of topics and participles corresponding to each topic.

In an embodiment of the disclosure, after determining the topics respectively corresponding to each topic thesaurus, the wind control model may be obtained through training based on a plurality of second branch sets corresponding to a plurality of users. Fig. 5 is a schematic flow chart of another user feature generation method according to the embodiment of the present disclosure.

As shown in fig. 5, after determining each topic corresponding to each topic thesaurus, the method further includes:

step 501, determining the participles under each topic included in each second participle set according to the matching degree between each participle in each second participle set and each participle corresponding to each topic.

In this disclosure, the method for determining the participles under each topic included in the second participle set is similar to the above-mentioned method for determining the participles under each topic included in the first participle set, and therefore, the details are not repeated herein.

Step 502, determining a labeling risk level corresponding to each user based on the number of the participles under each topic contained in each second participle set.

In this disclosure, the number of participles under the topic related to the risk included in each second participle set may be determined based on the number of participles under each topic included in each second participle set, so as to determine the labeled risk level corresponding to each user.

And the more the number of the participles under the topic related to the risk and contained in the second participle set is, the higher the labeling risk level is.

For example, the subject related to the overdue repayment risk includes credit, installment payment, and the like, and the labeling risk level corresponding to each user can be determined according to the credit included in the second participle set corresponding to each user and the number of participles under the installment payment subject.

Step 503, inputting the participles under each topic included in each second participle set and the corresponding topics into the initial neural network model to obtain the predicted risk level output by the initial neural network model.

In the disclosure, the participles and corresponding topics under each topic included in the second participle set may be input into the initial neural network model, and the initial neural network model is used for prediction to obtain a predicted risk level corresponding to the user.

And step 504, correcting the initial neural network model according to the difference between the predicted risk level and the labeled risk level to generate a wind control model.

According to the method, the difference between the predicted risk level and the labeled risk level can be determined, if the difference is larger than a preset threshold value, the initial neural network model can be corrected, and the corrected model is continuously trained by using the residual second word set until the model converges to generate the wind control model.

When the wind control model is trained, the wind control model can be trained in a deep learning mode, and compared with other machine learning methods, the deep learning method has better performance on a large data set.

In this disclosure, the wind control model may be, for example, a credit wind control model, and when determining the tagging risk level, the tagging risk level corresponding to each user may be determined according to the credit included in the second participle set corresponding to each user and the number of participles under the subject of the installment payment, so as to obtain the credit wind control model through training. For another example, the wind control model may be an insurance wind control model, and when determining the labeled risk level, the labeled risk level corresponding to each user may be determined according to the number of the participles in the topic related to insurance included in the second participle set corresponding to each user, so as to obtain the insurance wind control model through training.

In this disclosure, after determining each topic corresponding to each topic word library, determining a participle under each topic included in each second participle set according to a matching degree between each participle in each second participle set and each participle corresponding to each topic, determining a labeling risk level corresponding to each user based on a participle number under each topic included in each second participle set, inputting the participle under each topic included in each second participle set and the corresponding topic into the initial neural network model to obtain a predicted risk level output by the initial neural network model, and correcting the initial neural network model according to a difference between the predicted risk level and the labeling risk level to generate the wind control model. Therefore, the wind control model can be obtained through training by utilizing the participles under each topic contained in the second participle set corresponding to each user.

In practical application, after determining the topics respectively corresponding to each topic word library, training may also be performed by using the participles under each topic included in the second participle set corresponding to each user to obtain a recommendation model, so as to push promotion information to a target user by using the recommendation model.

In one embodiment of the present disclosure, the determination of whether to respond to a user request of a target user may also be made using a wind control model.

In the disclosure, under the condition of obtaining a user request sent by a target user, first historical text data of the target user within a preset time period before the user request is initiated can be obtained, the first historical text data is analyzed, a first word segmentation set is determined, words under each topic included in the first word segmentation set are determined according to the matching degree between each word in the first word segmentation set and each word under each topic, the words under each topic included in the first word segmentation set corresponding to the target user and the corresponding topics can be input to a wind control model, and the wind control model outputs a risk level corresponding to the target user. If the risk level corresponding to the target user is smaller than the preset risk level, the risk of overdue repayment of the target user is relatively low, and the user request of the target user can be responded. It can be understood that, if the risk level corresponding to the target user is greater than or equal to the preset risk level, the user request of the target user may be rejected.

Or, different risk levels corresponding to different user requests may be set, and the user request of the target user may be responded if the risk level corresponding to the target user is less than or equal to the risk level corresponding to the user request.

For example, the user request is a credit request, and the credit request of the user can be responded when the risk level corresponding to the user is determined to be smaller than the preset risk level based on the credit wind control model, so that the corresponding service can be provided for the user.

In the embodiment of the disclosure, under the condition that a user request sent by a target user is obtained, the participles under each topic and the corresponding topics contained in a first participle set corresponding to the target user can be input into a wind control model to determine the risk level corresponding to the target user; and responding to the user request under the condition that the risk level is less than the preset risk level. Therefore, whether the user request of the target user is responded or not can be determined by utilizing the wind control model based on the participles under each topic contained in the first participle set corresponding to the target user, the distinguishing effect of the model on overdue users is improved, and the economic loss can be reduced.

In order to implement the above embodiments, the embodiments of the present disclosure further provide a model training method. Fig. 6 is a schematic flowchart of a model training method according to an embodiment of the present disclosure.

As shown in fig. 6, the model training method includes:

step 601, a training data set is obtained, wherein the training data set comprises a plurality of historical text data corresponding to a plurality of users respectively.

In the present disclosure, a plurality of historical text data, such as browsed web page content, browsed video content, and the like, corresponding to a plurality of users, respectively, may be acquired as a training data set.

Step 602, analyzing the plurality of historical text data corresponding to each user respectively to determine a word segmentation set corresponding to each user.

In the present disclosure, each of the multiple users corresponds to multiple historical text data, and each historical text data may be analyzed, for example, word segmentation processing and deduplication processing are performed to obtain a word segmentation set of each historical text data, and then the multiple word segmentation sets are merged and deduplicated according to a sequence of generation time of the historical text data to obtain a word segmentation set corresponding to the user.

Or, when analyzing the plurality of historical text data corresponding to each user, the plurality of historical text data corresponding to the user may be sorted according to a time sequence to be integrated into one historical text data, that is, the first historical text data may be obtained by integrating the plurality of historical text data, and then, the first historical text data is analyzed to obtain a plurality of participles, and the participles form a participle set.

Step 603, determining the participles under each topic contained in the participle set corresponding to each user and the labeling risk level corresponding to each user.

Step 604, inputting the participles under each topic and the corresponding topics included in the participle set corresponding to each user into the initial neural network model to obtain the predicted risk level output by the initial neural network model.

And 605, correcting the initial neural network model according to the difference between the predicted risk level and the labeled risk level to generate a wind control model.

In the present disclosure, the steps 603-605 are similar to the steps 502-504, and therefore will not be described herein.

In the present disclosure, the wind control model may be, for example, a credit wind control model, and when determining the tagging risk level, the tagging risk level corresponding to each user may be determined according to the credit included in the word segmentation set corresponding to each user and the number of the word segmentation under the subject of the installment payment, so as to train to obtain the credit wind control model. For another example, the wind control model may be an insurance wind control model, and when determining the labeled risk level, the labeled risk level corresponding to each user may be determined according to the number of the participles under the theme related to insurance included in the participle set corresponding to each user, so as to obtain the insurance wind control model through training.

In the embodiment of the disclosure, training data is obtained, a plurality of historical text data corresponding to each user in a training data set are analyzed respectively to determine a word segmentation set corresponding to each user, words segmentation under each topic included in the word segmentation set corresponding to each user and a labeled risk grade corresponding to each user are determined, the words segmentation under each topic included in the word segmentation set corresponding to each user and the corresponding topic are input to an initial neural network model to obtain a predicted risk grade output by the initial neural network model, and the initial neural network model is corrected according to a difference between the predicted risk grade and the labeled risk grade to generate a wind control model. Therefore, the wind control model can be obtained through obtaining the word segmentation set corresponding to each user, determining the word segmentation under each theme contained in the word segmentation set corresponding to each user, and training by using the word segmentation under each theme contained in the word segmentation set corresponding to each user, so that the accuracy of the model is improved.

In an embodiment of the present disclosure, when determining the participles under the respective topics included in the participle set corresponding to each user, a document topic generation model, such as an LDA model, may be adopted to cluster a plurality of participle sets corresponding to a plurality of users.

After a plurality of word segmentation sets are obtained, the number of topics obtained through clustering can be set, each word segmentation in the word segmentation set corresponding to each user is input into an initial document topic generation model, and the initial document topic generation model is trained.

When the document theme generation model is converged, word segmentation probability distribution can be obtained, wherein the word segmentation probability distribution comprises the probability that each word belongs to each theme, then, the words in a plurality of word segmentation sets can be clustered according to the word segmentation probability distribution, for example, for each theme, the words with the probability greater than the preset probability can be used as the words corresponding to the theme, so that each word corresponding to each theme can be obtained, and each word corresponding to each theme forms a theme word bank, namely, a plurality of theme word banks are obtained. The number of the subject word banks is the same as that of the subjects, and each subject word bank comprises one or more participles.

When a plurality of topic word libraries are obtained, a plurality of topic word libraries are obtained through word segmentation probability distribution obtained by utilizing a document topic generation model, and the accuracy of the topic word libraries is improved.

Since a plurality of topic word banks can be obtained by using the document topic generation model, but it is uncertain what the specific topic corresponding to each topic word bank corresponds, in the present disclosure, the matching degree between each participle in each topic word bank and the preset topic can be calculated, and the topic corresponding to each topic word bank is determined according to the matching degree between each participle in each topic word bank and the preset topic.

In the present disclosure, the manner of determining the topic corresponding to each topic thesaurus is similar to the step 404, and therefore, the description thereof is omitted here.

After determining the topic corresponding to each topic word library, the method for determining the participles under each topic included in the first participle set may be similar to the above method for determining the participles under each topic included in the first participle set, and thus the details are not repeated herein.

When the risk level corresponding to each user is labeled, the number of the participles in the participle set corresponding to each user, which contain the topic related to the risk, can be determined according to the number of the participles in each topic contained in the participle set corresponding to each user, and the labeled risk level corresponding to each user can be determined based on the number of the participles in the participle set corresponding to each user, which contain the topic related to the risk.

In the method and the system, the labeling risk level corresponding to each user is determined based on the number of the participles under each topic contained in the participle set corresponding to each user, and the accuracy of labeling is improved.

In order to implement the foregoing embodiment, the embodiment of the present disclosure further provides a device for generating a user characteristic. Fig. 7 is a schematic structural diagram of a user feature generation apparatus according to an embodiment of the present disclosure.

As shown in fig. 7, the apparatus 700 for generating user characteristics includes: a first obtaining module 710, a first parsing module 720, a first determining module 730, and a second determining module 740.

A first obtaining module 710, configured to obtain first historical text data corresponding to a target user;

a first parsing module 720, configured to parse the first historical text data to determine a first word set corresponding to the target user;

a first determining module 730, configured to determine, according to matching degrees between each participle in the first participle set and each participle corresponding to each topic, a word segmentation number under each topic included in the first participle set;

the second determining module 740 is configured to determine, according to the number of the participles under each topic included in the first participle set, a user characteristic corresponding to the target user.

In a possible implementation manner of the embodiment of the present disclosure, the apparatus may further include:

the third determining module is used for determining each association degree between the user characteristics and each piece of information to be promoted;

the fourth determining module is used for determining the target popularization information according to the relevance degrees;

and the pushing module is used for pushing the target popularization information to the target user.

In a possible implementation manner of the embodiment of the present disclosure, the first determining module 730 is further configured to determine the number of participles in a specified type topic included in the first participle set, where the specified type topic is related to a target risk;

the second determining module 740 is further configured to determine that the target user is a user with the target risk when the number of the participles in the specified type topic included in the first participle set is greater than a preset threshold.

In a possible implementation manner of the embodiment of the present disclosure, the first obtaining module 710 is further configured to obtain a plurality of second historical text data corresponding to a plurality of users, respectively;

the first parsing module 720 is further configured to parse a plurality of second historical text data corresponding to each user to determine a second word set corresponding to each user;

the apparatus may further comprise:

the clustering module is used for clustering the participles in the second participle sets corresponding to the users to obtain a plurality of topic word banks;

and the fifth determining module is used for determining the topics corresponding to the topic word banks according to the matching degree between each participle in each topic word bank and a preset topic.

In a possible implementation manner of the embodiment of the present disclosure, the first determining module 730 is further configured to determine the participle under each topic included in each second participle set according to a matching degree between each participle in each second participle set and each participle corresponding to each topic;

the apparatus may further comprise:

a sixth determining module, configured to determine, based on the number of participles under each topic included in each second participle set, a labeling risk level corresponding to each user;

the first training module is used for inputting the participles under each topic and the corresponding topics contained in each second participle set into an initial neural network model so as to obtain the predicted risk level output by the initial neural network model; and correcting the initial neural network model according to the difference between the predicted risk level and the labeled risk level to generate a wind control model.

a seventh determining module, configured to, when a user request sent by the target user is obtained, input a participle and a corresponding topic under each topic included in a first participle set corresponding to the target user to the wind control model to determine a risk level corresponding to the target user;

and the response module is used for responding to the user request under the condition that the risk level is less than a preset risk level.

It should be noted that the explanation of the foregoing embodiment of the method for generating user features is also applicable to the apparatus for generating user features of this embodiment, and therefore, the explanation is not repeated here.

In order to realize the embodiment, the disclosure further provides a model training device. Fig. 8 is a schematic structural diagram of a model training apparatus according to an embodiment of the present disclosure.

As shown in fig. 8, the model training apparatus includes: a second obtaining module 810, a second parsing module 820, an eighth determining module 830, and a second training module 840.

A second obtaining module 810, configured to obtain a training data set, where the training data set includes a plurality of historical text data corresponding to a plurality of users, respectively;

a second parsing module 820, configured to parse the multiple historical text data corresponding to each user, so as to determine a word segmentation set corresponding to each user;

an eighth determining module 830, configured to determine the participles under each topic included in the participle set corresponding to each user, and the labeled risk level corresponding to each user;

the second training module 840 is configured to input the participles under each topic included in the participle set corresponding to each user and the corresponding topics into the initial neural network model, so as to obtain a predicted risk level output by the initial neural network model; and correcting the initial neural network model according to the difference between the predicted risk level and the labeled risk level to generate a wind control model.

In a possible implementation manner of this embodiment of the present disclosure, the eighth determining module 830 includes:

the clustering unit is used for clustering the participles in the participle sets corresponding to the users to obtain a plurality of topic word banks;

the first determining unit is used for determining the topics corresponding to the topic word banks according to the matching degree between each participle in each topic word bank and a preset topic;

a second determining unit, configured to determine, according to a matching degree between each participle in each participle set and each participle corresponding to each topic, a participle under each topic included in each participle set;

and the third determining unit is used for determining the labeling risk level corresponding to each user based on the number of the participles under each topic contained in the participle set corresponding to each user.

In a possible implementation manner of the embodiment of the present disclosure, the clustering unit is configured to:

inputting each participle in the multiple participle sets into a document theme generation model to obtain participle probability distribution, wherein the participle probability distribution comprises the probability that each participle belongs to each theme;

clustering the participles in the participle sets according to the participle probability distribution to obtain a plurality of topic word libraries.

The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.

FIG. 9 illustrates a schematic block diagram of an example electronic device 900 that can be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 9, the apparatus 900 includes a computing unit 901 that can perform various appropriate actions and processes in accordance with a computer program stored in a ROM (Read-Only Memory) 902 or a computer program loaded from a storage unit 908 into a RAM (Random Access Memory) 903. In the RAM 903, various programs and data required for the operation of the device 900 can also be stored. The calculation unit 901, ROM 902, and RAM 903 are connected to each other via a bus 904. An I/O (Input/Output) interface 905 is also connected to the bus 904.

A number of components in the device 900 are connected to the I/O interface 905, including: an input unit 906 such as a keyboard, a mouse, and the like; an output unit 907 such as various types of displays, speakers, and the like; a storage unit 908 such as a magnetic disk, optical disk, or the like; and a communication unit 909 such as a network card, a modem, a wireless communication transceiver, and the like. The communication unit 909 allows the device 900 to exchange information/data with other devices through a computer network such as the internet and/or various telecommunication networks.

The computing unit 901 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of the computing Unit 901 include, but are not limited to, a CPU (Central Processing Unit), a GPU (graphics Processing Unit), various dedicated AI (Artificial Intelligence) computing chips, various computing Units running machine learning model algorithms, a DSP (Digital Signal Processor), and any suitable Processor, controller, microcontroller, and the like. The calculation unit 901 performs the respective methods and processes described above, such as the generation method of the user characteristics. For example, in some embodiments, the method of generating user characteristics may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 908. In some embodiments, part or all of the computer program may be loaded and/or installed onto device 900 via ROM 902 and/or communications unit 909. When the computer program is loaded into the RAM 903 and executed by the computing unit 901, one or more steps of the above described generation method of user features may be performed. Alternatively, in other embodiments, the computing unit 901 may be configured to perform the generation method of the user characteristics by any other suitable means (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be realized in digital electronic circuitry, Integrated circuitry, FPGAs (Field Programmable Gate arrays), ASICs (Application-Specific Integrated circuits), ASSPs (Application Specific Standard products), SOCs (System On Chip, System On a Chip), CPLDs (Complex Programmable Logic devices), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a RAM, a ROM, an EPROM (Electrically Programmable Read-Only-Memory) or flash Memory, an optical fiber, a CD-ROM (Compact Disc Read-Only-Memory), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a Display device (e.g., a CRT (Cathode Ray Tube) or LCD (Liquid Crystal Display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: LAN (Local Area Network), WAN (Wide Area Network), internet, and blockchain Network.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The Server may be a cloud Server, which is also called a cloud computing Server or a cloud host, and is a host product in a cloud computing service system, so as to solve the defects of high management difficulty and weak service expansibility in a conventional physical host and a VPS (Virtual Private Server). The server may also be a server of a distributed system, or a server incorporating a blockchain.

It should be noted that the electronic device described above may also execute the model training method in the present disclosure.

According to an embodiment of the present disclosure, the present disclosure further provides a computer program product, which when executed by an instruction processor in the computer program product, performs the method for generating user features or the method for training models proposed by the above-mentioned embodiment of the present disclosure.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel, sequentially, or in different orders, as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved, and the present disclosure is not limited herein.

The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims

1. A method for generating user characteristics comprises the following steps:

acquiring first historical text data corresponding to a target user;

2. The method of claim 1, wherein after the determining the user characteristic corresponding to the target user, further comprising:

determining the association degree between the user characteristics and each piece of information to be promoted respectively;

determining target popularization information according to each degree of association;

and pushing the target popularization information to the target user.

3. The method of claim 1, wherein said determining a number of tokens under each of said topics included in said first set of tokens comprises:

determining the number of participles under a specified type of subject contained in the first participle set, wherein the specified type of subject is related to a target risk;

determining the user characteristics corresponding to the target user according to the number of the participles under each topic contained in the first participle set, including:

and under the condition that the number of the participles under the specified type of subjects contained in the first participle set is greater than a preset threshold value, determining that the target user is the user with the target risk.

4. The method of claim 1, further comprising:

acquiring a plurality of second historical text data respectively corresponding to a plurality of users;

analyzing a plurality of second historical text data corresponding to each user respectively to determine a second word set corresponding to each user;

clustering the participles in a plurality of second participle sets corresponding to the plurality of users to obtain a plurality of topic word libraries;

and determining the theme corresponding to each topic word bank according to the matching degree between each participle in each topic word bank and a preset theme.

5. The method as claimed in claim 4, wherein after determining the topic corresponding to each topic thesaurus, the method further comprises:

determining participles under each topic contained in each second participle set according to the matching degree between each participle in each second participle set and each participle corresponding to each topic;

determining a labeling risk level corresponding to each user based on the number of the participles under each topic contained in each second participle set;

inputting the participles under each topic contained in each second participle set and the corresponding topics into an initial neural network model to obtain a predicted risk level output by the initial neural network model;

6. The method of claim 5, wherein the method further comprises:

under the condition that a user request sent by the target user is obtained, inputting the participles and corresponding themes under each theme contained in a first participle set corresponding to the target user into the wind control model so as to determine the risk level corresponding to the target user;

and responding to the user request under the condition that the risk level is smaller than a preset risk level.

7. A model training method, comprising:

8. The method of claim 7, wherein the determining the participles under the respective topics contained in the participle set corresponding to each user and the labeled risk level corresponding to each user comprises:

clustering the participles in a plurality of participle sets corresponding to the plurality of users to obtain a plurality of topic word banks;

determining topics corresponding to each topic word bank according to the matching degree between each participle in each topic word bank and a preset topic;

determining the participles under each topic contained in each participle set according to the matching degree between each participle in each participle set and each participle corresponding to each topic;

and determining the labeling risk level corresponding to each user based on the number of the participles under each topic contained in the participle set corresponding to each user.

9. The method of claim 8, wherein clustering the participles in the participle sets corresponding to the plurality of users to obtain a plurality of topic thesauruses comprises:

10. An apparatus for generating user characteristics, comprising:

11. The apparatus of claim 10, wherein the apparatus further comprises:

the third determining module is used for determining the association degree between the user characteristics and each piece of information to be promoted respectively;

the fourth determining module is used for determining the target popularization information according to each association degree;

12. The apparatus of claim 10, wherein the first determining module is further configured to determine a number of participles under a specified type of topic contained in the first participle set, wherein the specified type of topic is associated with a target risk;

the second determining module is further configured to determine that the target user is a user with the target risk when the number of the participles in the specified type topic included in the first participle set is greater than a preset threshold.

13. The apparatus of claim 10, wherein the first obtaining module is further configured to obtain a plurality of second historical text data corresponding to a plurality of users, respectively;

the first analysis module is further configured to analyze a plurality of second historical text data corresponding to each user respectively to determine a second word set corresponding to each user;

the device further comprises:

14. The apparatus of claim 13, wherein the first determining module is further configured to determine the participle under each topic included in each second participle set according to a matching degree between each participle in each second participle set and each participle corresponding to each topic;

the device further comprises:

15. The apparatus of claim 14, wherein the apparatus further comprises:

16. A model training apparatus comprising:

17. The apparatus of claim 16, wherein the eighth determining means comprises:

18. The apparatus of claim 17, wherein the clustering unit is to:

19. An electronic device, comprising:

at least one processor; and

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-9.

20. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-9.

21. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of claims 1-9.