WO2022121163A1

WO2022121163A1 - User behavior tendency identification method, apparatus, and device, and storage medium

Info

Publication number: WO2022121163A1
Application number: PCT/CN2021/083480
Authority: WO
Inventors: 卢春曦; 王健宗; 黄章成
Original assignee: 平安科技（深圳）有限公司
Priority date: 2020-12-11
Filing date: 2021-03-29
Publication date: 2022-06-16
Also published as: CN112527958A

Abstract

A user behavior tendency identification method, apparatus, and device, and a storage medium, which relate to the field of artificial intelligence. The method comprises: obtaining a plurality of pieces of text information published by a plurality of sample users that have a determined behavior tendency and recording parameters; extracting a plurality of keywords from within the pieces of text information and converting the keywords into keyword vectors; using the keyword vectors and the recording parameters as training samples and randomly extracting a plurality of samples from the training samples to obtain a plurality of training sets; constructing a plurality of decision trees according to a preset discrimination indicator and generating a random forest model; and inputting text information published by a user to be detected and corresponding recording parameters into the random forest model for voting, and according to the voting result, determining whether the user has the behavior tendency. User behavior tendencies can be determined quickly by means of speech information published by a user.

Description

User behavior tendency identification method, device, equipment and storage medium

This application claims the priority of the Chinese patent application with the application number 202011436696.4 and the invention titled "User Behavior Tendency Recognition Method, Apparatus, Equipment and Storage Medium" filed with the China Patent Office on December 11, 2020, the entire contents of which are by reference incorporated in the application.

technical field

The present application relates to the field of artificial intelligence, and in particular, to a method, device, device and storage medium for identifying user behavior tendency.

Background technique

With the development of the Internet, the dissemination of information on the Internet is becoming more and more rapid and extensive, and the complex speech information will have different effects on users, especially the remarks made by some users with negative behavior tendencies, which may cause group effects, and then lead to serious consequences. As an information-bearing platform, if some users with negative behavior tendencies can be identified in advance and further interventions can be taken, the impact of adverse consequences can be reduced.

The inventor realized that the current way to deal with users' bad speech is to use sensitive word shielding. This method can only shield some known sensitive words. For some negative but insensitive psychological words, the shielding method cannot be used to eliminate them. influences. For users with a certain characteristic behavioral tendency, it is difficult for the computer to identify them, and can only be determined through the post-judgment mechanism.

SUMMARY OF THE INVENTION

The main purpose of this application is to solve the technical problem of how to flexibly identify user behavior tendencies.

In order to achieve the above purpose, a first aspect of the present application provides a method for identifying a user behavior tendency, which includes: acquiring a plurality of pieces of first text information published by a plurality of sample users with a determined behavior tendency, and the corresponding first text information. a first record parameter; extracting a plurality of keywords in the first text information, counting the number of occurrences of each keyword in the first text information, and performing vectorization processing to obtain a plurality of keyword vectors; Using the keyword vectors and the first record parameters as training samples, randomly extract multiple samples from the training samples for multiple times to obtain multiple training sets; refer to the preset discriminant indicators to construct the respective training samples. The decision tree corresponding to the training set, and the corresponding random forest model is generated according to each decision tree; a plurality of pieces of second text information published by the user to be detected and the second record parameters corresponding to each second text are obtained; The second text information and the second record parameters are input into the random forest model for voting, and a voting result is obtained; according to the voting result, it is determined whether the user to be detected has the behavioral tendency.

A second aspect of the present application provides a user behavior tendency identification device, comprising a memory, a processor, and computer-readable instructions stored on the memory and executable on the processor, the processor executing the computer When the instruction is readable, the following steps are implemented: acquiring multiple pieces of first text information published by multiple sample users with certain behavioral tendencies and first record parameters corresponding to the first text information; extracting the first text information from the first text information , count the number of occurrences of the keywords in the first text information and perform vectorization processing to obtain multiple keyword vectors; use the keyword vectors and the first records The parameters are training samples, and multiple samples are randomly selected from the training samples for multiple times to obtain multiple training sets; with reference to the preset discriminant indicators, a decision tree corresponding to each training set is constructed, and a decision tree corresponding to each training set is constructed according to the each decision tree. Generate a corresponding random forest model; obtain multiple pieces of second text information published by the user to be detected and second record parameters corresponding to the second texts; input the second text information and the second record parameters The random forest model votes to obtain a voting result; according to the voting result, it is determined whether the user to be detected has the behavioral tendency.

A third aspect of the present application provides a computer-readable storage medium, where computer instructions are stored in the computer-readable storage medium, and when the computer instructions are executed on a computer, the computer is caused to perform the following steps: obtaining a data with a certain behavioral tendency A plurality of pieces of first text information published by a plurality of sample users and the first record parameters corresponding to the first text information; extract a plurality of keywords in the first text information, and count the first texts The number of occurrences of each keyword in the information is vectorized to obtain a plurality of keyword vectors; using the keyword vectors and the first record parameters as training samples, the Randomly extract multiple samples at a time to obtain multiple training sets; refer to preset discriminant indicators, construct decision trees corresponding to the training sets, and generate corresponding random forest models according to the decision trees; Multiple pieces of second text information and second record parameters corresponding to the second texts; input the second text information and the second record parameters into the random forest model for voting, and obtain a voting result; The voting result determines whether the user to be detected has the behavioral tendency.

A fourth aspect of the present application provides a user behavior tendency identification device, comprising: a first acquisition module configured to acquire a plurality of pieces of first text information and the first text information published by a plurality of sample users with a determined behavior tendency The corresponding first record parameter; the vectorization module is used to extract a plurality of keywords in the first text information, count the number of occurrences of the keywords in the first text information, and perform vectorization processing , to obtain a plurality of keyword vectors; the sampling module is used for taking the keyword vectors and the first recording parameters as training samples, randomly extracting a plurality of samples from the training samples for many times, and obtaining a plurality of a training set; a building module is used to construct a decision tree corresponding to each training set with reference to preset discriminant indicators, and generate a corresponding random forest model according to each decision tree; a second obtaining module is used to obtain the user to be detected A plurality of published second text information and the second record parameters corresponding to the second texts; a voting module, configured to input the second text information and the second record parameters into the random forest model for Voting to obtain a voting result; a determining module, configured to determine whether the user to be detected has the behavioral tendency according to the voting result.

In the technical solution provided by the present application, speech data published by users with the same type of characteristic behavior tendency are first collected, and keywords in these speech data are extracted as characteristic representations of this type of user. Then use these speech data as training samples for machine learning, build a random forest model, and then input the speech data related to the user to be detected into the model for identification, and determine whether the user to be detected and the sample user have the same behavioral characteristics, if so, Then it can be determined that the user to be detected and the sample user have the same behavioral tendency. This application can extract relevant speech features of users with the same type of characteristic behavioral tendencies, train a random forest model through machine learning, and then identify users with unknown behavioral tendencies to determine whether they have the same type of behavioral tendencies.

Description of drawings

1 is a schematic diagram of a first embodiment of a method for identifying a user behavior tendency in an embodiment of the present application;

2 is a schematic diagram of a second embodiment of a method for identifying a user behavior tendency in an embodiment of the present application;

3 is a schematic diagram of an embodiment of a user behavior tendency identification device in an embodiment of the present application;

FIG. 4 is a schematic diagram of an embodiment of a user behavior tendency identification device in an embodiment of the present application.

Detailed ways

Embodiments of the present application provide a method, device, device, and storage medium for identifying user behavior tendencies. The terms "first", "second" and "third" in the description and claims of the present application and the above drawings , "fourth", etc. (if present) are used to distinguish similar objects and are not necessarily used to describe a particular order or precedence. It is to be understood that data so used can be interchanged under appropriate circumstances so that the embodiments described herein can be practiced in sequences other than those illustrated or described herein. Furthermore, the terms "comprising" or "having" and any variations thereof are intended to cover non-exclusive inclusion, for example, a process, method, system, product or device comprising a series of steps or units is not necessarily limited to those expressly listed steps or units, but may include other steps or units not expressly listed or inherent to these processes, methods, products or devices.

For ease of understanding, the specific process of the embodiment of the present application will be described below. Please refer to FIG. 1 . The first embodiment of the method for identifying a user behavior tendency in the embodiment of the present application includes:

101. Acquire a plurality of pieces of first text information published by a plurality of sample users with certain behavioral tendencies and first record parameters corresponding to the first text information;

It can be understood that the execution subject of the present application may be a user behavior tendency identification device, and may also be a terminal or a server, which is not specifically limited here. The embodiments of the present application take the server as an execution subject as an example for description.

In this embodiment, since the application needs to determine whether an unknown user has a certain type of behavioral tendency, it is necessary to determine the speech characteristics of this type of user, and determine the behavioral tendency of the unknown user by means of feature matching. Therefore, it is necessary to obtain a large amount of sample data in order to extract the speech features that best represent users with the same type of behavioral tendencies. In this embodiment, text information published by sample users with a determined behavioral tendency and a publishing record corresponding to each text information are obtained, which are used to extract speech feature keywords and machine learning training samples.

In this embodiment, a sample user with a certain behavioral tendency may be a sample user with a desire to purchase a certain commodity, a sample user with a certain negative behavioral tendency, a sample user with a certain psychological characteristic, etc. Users of private jets, suicidal users, depressed users, etc. The behavioral tendency of sample users determines the recognition type of the model, and models with different recognition types can identify users with different types of behavioral tendencies.

102. Extract a plurality of keywords in each of the first text information, count the number of occurrences of each of the keywords in each of the first text information, and perform vectorization processing to obtain a plurality of keyword vectors;

In this embodiment, the speech characteristic words of the users with the same type of behavior tendency are extracted from a large amount of sample data, and then the speech of the unknown user is compared with the characteristic words, and then combined with other discriminant indicators, so as to determine whether the unknown user has the same feature.

In this embodiment, after extracting the keywords in the speeches of the users with special behavior tendency, the speech information of the sample users is analyzed. During the analysis, the hit rate of the keywords in the text information published by the sample users needs to be counted. The hit rate It will be used as one of the discriminant indicators when identifying unknown users, which is of great reference significance.

Optionally, step 102 includes:

According to the keywords, respectively determine the keywords included in the first text information published by the sample users;

Counting the occurrences of each keyword in the first text information published by each sample user;

Vector transformation is performed on the occurrence times of each keyword to obtain a keyword vector corresponding to each sample user.

In this optional embodiment, to calculate the hit rate of a keyword, the text needs to be converted into a vector first. In this embodiment, the vector of the text refers to the number of occurrences of each keyword, for example, the extracted texts published by all sample users The keywords in the information are D=(T ₁ , T ₂ , T ₃ , T ₄ , T ₅ ), and the number of occurrences of each keyword in the text information published by a sample user is W=(5,2,0, 1,0), then W can be used as the keyword vector transformation data of the sample user.

103. Using the keyword vectors and the first recording parameters as training samples, randomly extract multiple samples from the training samples for multiple times to obtain multiple training sets;

In this embodiment, the keyword vector corresponding to each sample user and the first record parameter are used as training samples, and multiple samples are randomly selected in the training samples with replacement to obtain training sets, and a decision tree is constructed for each training set. , to generate a random forest model. The reason for random sampling is to make each decision tree different, and the resulting classification results are also different, and the reason for sampling with replacement is to make the intersection between each decision tree and avoid one-sided decision-making. The result is generated by voting on these decision trees, and this voting should be "consensus". If the results generated by each decision tree are completely independent, the final voting result will not be helpful to the solution of the problem at all. Therefore, this embodiment adopts multiple The training set is obtained by random sampling with replacement.

104. With reference to preset discriminant indicators, construct decision trees corresponding to each of the training sets, and generate a corresponding random forest model according to each of the decision trees;

In this embodiment, the sample data in a training set is used as the generation data of a decision tree. In this embodiment, the CART tree algorithm is preferred to generate a classification decision tree. The input of the algorithm is the training set, the Gini index threshold, the sample number threshold, and the output is decision tree. The generation process starts from the follow node, and uses the training set to recursively build the CART classification tree. When the number of samples is less than the preset or there is no feature, the decision subtree is returned, and the current node stops the recursion. In this embodiment, the feature refers to the preset. Discriminant index: Calculate the Gini index of the sample set, if the Gini index is less than the threshold, return to the decision subtree, the current node stops recursion; calculate the Gini index of each feature value of each feature of the current node to the data set, and select the smallest Gini index The features of and the corresponding feature values are classification nodes, establish leaf nodes, and continue to recursively execute the algorithm from the beginning until the conditions for generating a decision tree are met.

Optionally, step 104 includes:

A classification regression tree algorithm is adopted, and a preset discriminant index is used as the feature selection of the decision tree, and each training sample in the each training set is subjected to decision tree classification to obtain a plurality of decision trees;

The decision trees are sequentially combined to obtain a random forest model, wherein the discriminant indicators include keyword vectors, the number of different keywords hit, the total number of hit keywords, average text length, sensitive speech time and sensitive speech days.

Optionally, step 104 further includes:

S1, select a discriminant index as a root node, and calculate the Gini index of each discriminant index value corresponding to the root node to the training set;

S2. Determine whether each Gini index is greater than a preset first threshold and the number of samples in the sample set is greater than a preset second threshold;

S3. If yes, then divide the training set into a plurality of leaf nodes, and select the discriminant index value with the smallest Gini index as the root node, and execute S1-S2 cyclically;

S3. If not, generate a decision tree corresponding to the training set.

105. Acquire multiple pieces of second text information published by the user to be detected and second record parameters corresponding to the second texts;

106. Input the second text information and the second record parameters into the random forest model for voting, and obtain a voting result;

In this embodiment, after extracting the text information and recording parameters published by the user to be detected, it is necessary to count the number of occurrences of keywords in the second text information, so as to obtain the target vector as one of the parameters in the input random forest model. The acquisition method of the target vector in the example is similar to the acquisition method of the keyword vector of the sample user, which can be used for reference.

In this embodiment, there are multiple classification trees in the random forest, and each tree is a weak classifier. The classification results of several weak classifiers are selected by voting to form a strong classifier, which is the method of random forest bagging. Thought.

Optionally, step 106 includes:

Count the times of occurrence of each keyword in the second text information and perform vector transformation to obtain a target vector;

Inputting the target vector and the second record parameter into the random forest model for classification to obtain a classification result;

All decision trees in the random forest model are made to vote on the classification results to obtain voting results.

107. Determine whether the user to be detected has the behavioral tendency according to the voting result.

In this embodiment, the voting result includes having the behavioral tendency and/or not having the behavioral tendency. For example, 80% of the decision trees are classified as having the behavioral tendency, and 20% of the decision trees are classified as not With the behavioral inclination, 80% and 20% are voting ratios, and the voting result with a high voting ratio is used as the recognition result of the model, that is, the detected user has the same behavioral inclination as the sample user.

Optionally, step 107 includes:

Obtain the voting results of all decision trees in the random forest model, wherein the voting results are having the behavioral tendency and/or not having the behavioral tendency;

Calculate the voting ratios corresponding to different behavioral inclinations according to the voting results;

The behavioral tendency with the highest voting ratio is taken as the behavioral tendency of the user to be detected.

In the embodiment of the present application, the speech data published by users with the same type of characteristic behavior tendency is first collected, and the keywords in the speech data are extracted as the characteristic representation of this type of user. Then use these speech data as training samples for machine learning, build a random forest model, and then input the speech data related to the user to be detected into the model for identification, and determine whether the user to be detected and the sample user have the same behavioral characteristics, if so, Then it can be determined that the user to be detected and the sample user have the same behavioral tendency. This application can extract the relevant speech features of users with the same type of characteristic behavioral tendencies, train a random forest model through machine learning, and then identify users with unknown behavioral tendencies to determine whether they have the same type of behavioral tendencies.

Referring to FIG. 2, the second embodiment of the method for identifying user behavior tendency in the embodiment of the present application includes:

201. Acquire a plurality of pieces of first text information published by a plurality of sample users with a certain behavioral tendency and a first record parameter corresponding to each of the first text information;

202. Perform word segmentation processing on the first text information to obtain multiple word units;

203. Calculate the degree of discrimination of each word unit by using the TF-IDF algorithm;

204. Sort the degree of discrimination of each word unit, and extract the word unit with the highest degree of discrimination from the sorting result as a keyword;

In this optional embodiment, word segmentation processing needs to be performed on the text before extracting keywords in the text information. Word segmentation is the basis for processing natural language, so that the machine can understand human language. There are many existing word segmentation algorithms. In this embodiment, the NLP word segmentation algorithm is preferred to perform word segmentation processing on the original text, thereby extracting keywords. The NLP word segmentation algorithm is in the prior art and will not be repeated here.

In this optional embodiment, the TF-IDF algorithm is used to determine speech keywords. The TF-IDF algorithm is a word frequency-inverse text frequency algorithm based on discrete word bags, and is used to evaluate the effect of a word on one of the document sets or corpora. The importance of a word increases proportionally to the number of times it appears in the document, but decreases inversely proportional to its frequency in the corpus. The calculation formula of the discrimination degree W _i of the word i is:

where tf _i refers to the frequency of word i in the document after tokenization, N refers to the total number of documents in the corpus, and df _i refers to the number of documents containing word i. The following example illustrates how this formula is used.

For example, the total number of words in a document is 100, and the word "purchase" appears 4 times, then the frequency of the word "purchase" in the document is 4/100=0.04, that is, the word frequency tf _i =0.04, if "purchase" ” appears in 1000 documents, and if the total number of documents is 10000, the inverse text frequency is

Finally, W _i =0.04×1=0.04, and the calculation result is the degree of discrimination or importance of the word "purchase" in the document set. In this embodiment, the distinguishing degree of each word is sorted, and the top N words of the distinguishing degree are used as the keywords of the behavior-oriented user, which are used as the benchmark data for the vectorization processing of the keywords of the sample users, where N is a preset parameter, N is an integer greater than 0.

205. Count the number of occurrences of each keyword in each of the first text information and perform vectorization processing to obtain multiple keyword vectors;

206. Using the keyword vectors and the first recording parameters as training samples, randomly extract multiple samples from the training samples for multiple times to obtain multiple training sets;

207. With reference to a preset discriminant index, construct a decision tree corresponding to each training set, and generate a corresponding random forest model according to each decision tree;

208. Obtain multiple pieces of second text information published by the user to be detected and second record parameters corresponding to the second texts;

209. Input the second text information and the second record parameters into the random forest model for voting, and obtain a voting result;

210. Determine whether the user to be detected has the behavioral tendency according to the voting result.

In the embodiment of the present application, when analyzing the behavior tendency of users, the keywords in the text information play a crucial role, and can represent the speech characteristics of users with this type of behavior tendency. Keywords in all textual information. The extraction method is to first perform word segmentation on long text information to obtain words that cannot be further divided, then calculate the frequency of occurrence of these words, and use multiple words with high frequency as keywords. The present application obtains representative characteristic keywords through the analysis and calculation of a large amount of data. As one of the discriminant indicators for behavioral tendency identification, it can better predict the user's behavioral tendency. Combined with other discriminant indicators, it can accurately identify the user's behavioral tendency. behavioral tendencies to take further intervention.

The method for identifying the user behavior tendency in the embodiment of the present application has been described above. The following describes the device for identifying the user behavior tendency in the embodiment of the present application. Please refer to FIG. 3 . An embodiment of the device for identifying the user behavior tendency in the embodiment of the present application includes:

A first obtaining module 301, configured to obtain a plurality of pieces of first text information published by a plurality of sample users with certain behavioral tendencies and first record parameters corresponding to the first text information;

The vectorization module 302 is configured to extract a plurality of keywords in the first text information, count the number of times each keyword appears in the first text information, and perform vectorization processing to obtain a plurality of keywords vector;

Sampling module 303, configured to use the keyword vectors and the first recording parameters as training samples, and randomly extract multiple samples from the training samples for multiple times to obtain multiple training sets;

The construction module 304 is used for constructing a decision tree corresponding to each training set with reference to a preset discriminant index, and generating a corresponding random forest model according to each decision tree;

The second obtaining module 305 is configured to obtain multiple pieces of second text information published by the user to be detected and second record parameters corresponding to the second texts;

A voting module 306, configured to input the second text information and the second record parameters into the random forest model for voting to obtain a voting result;

The determining module 307 is configured to determine whether the user to be detected has the behavioral tendency according to the voting result.

Optionally, in an embodiment, the vectorization module 302 includes:

A keyword extraction unit, configured to perform word segmentation processing on the first text information to obtain a plurality of word units; use the TF-IDF algorithm to calculate the degree of discrimination of the word units; sort the degree of discrimination of the word units , and extract the word unit with the highest degree of discrimination as the keyword from the ranking result.

Optionally, in an embodiment, the vectorization module 302 further includes:

A vector transformation unit, configured to determine the keywords contained in the first text information published by the sample users according to the keywords; count the keywords in the first text information published by the sample users Number of occurrences of the word; vector transformation is performed on the number of occurrences of each keyword to obtain a keyword vector corresponding to each sample user.

Optionally, in an embodiment, the building module 304 is specifically used for:

Optionally, in one embodiment, the building module 304 includes:

a computing unit, configured to select a discriminant index as a root node, and calculate the Gini index of each discriminant index value corresponding to the root node to the training set;

a judgment unit, configured to judge whether each Gini index is greater than a preset first threshold and the number of samples in the sample set is greater than a preset second threshold;

a dividing unit, configured to divide the training set into a plurality of leaf nodes and select a Gini index if the Gini indices are greater than a preset first threshold and the number of samples in the sample set is greater than a preset second threshold The smallest discriminant index value is used as the root node, and the calculation unit and the judgment unit are executed cyclically;

A generating unit, configured to generate a decision tree corresponding to the training set if each Gini index is less than a preset first threshold or the number of samples in the sample set is less than a preset second threshold.

Optionally, in an embodiment, the voting module 306 is specifically configured to:

Optionally, in an embodiment, the determining module 307 is specifically configured to:

In the embodiment of the present application, the speech data published by users with the same type of characteristic behavior tendency is first collected, and the keywords in the speech data are extracted as the characteristic representation of this type of user. Then use these speech data as training samples for machine learning, build a random forest model, and then input the speech data related to the user to be detected into the model for identification, and determine whether the user to be detected and the sample user have the same behavioral characteristics, if so, Then it can be determined that the user to be detected and the sample user have the same behavioral tendency. This application can extract relevant speech features of users with the same type of characteristic behavioral tendencies, train a random forest model through machine learning, and then identify users with unknown behavioral tendencies to determine whether they have the same type of behavioral tendencies.

Fig. 3 above describes the user behavior tendency identification device in the embodiment of the present application in detail from the perspective of modular functional entities, and the following describes the user behavior tendency identification device in the embodiment of the present application in detail from the perspective of hardware processing.

FIG. 4 is a schematic structural diagram of a user behavior tendency identification device provided by an embodiment of the present application. The user behavior tendency identification device 400 may vary greatly due to different configurations or performances, and may include one or more processors (central processing units, CPU) 410 (eg, one or more processors) and memory 420, one or more storage media 430 (eg, one or more mass storage devices) that store application programs 433 or data 432. Among them, the memory 420 and the storage medium 430 may be short-term storage or persistent storage. The program stored in the storage medium 430 may include one or more modules (not shown in the figure), and each module may include a series of instruction operations on the user behavior tendency recognition device 400 . Furthermore, the processor 410 may be configured to communicate with the storage medium 430 to execute a series of instruction operations in the storage medium 430 on the user behavior tendency identification device 400 .

The user behavior tendency identification device 400 may also include one or more power supplies 440, one or more wired or wireless network interfaces 450, one or more input and output interfaces 460, and/or, one or more operating systems 431, such as Windows Serve, Mac OS X, Unix, Linux, FreeBSD, and more. Those skilled in the art can understand that the structure of the user behavior tendency identification device shown in FIG. 4 does not constitute a limitation on the user behavior tendency identification device, and may include more or less components than those shown in the figure, or combine some components, or Different component arrangements.

The present application also provides a user behavior tendency identification device, comprising: a memory and at least one processor, wherein instructions are stored in the memory, and the memory and the at least one processor are interconnected by a line; the at least one processor The instructions in the memory are invoked, so that the user behavior tendency identification device executes the steps in the above-mentioned user behavior tendency identification method.

The present application also provides a computer-readable storage medium, and the computer-readable storage medium may be a non-volatile computer-readable storage medium or a volatile computer-readable storage medium. The computer-readable storage medium stores computer instructions, and when the computer instructions are executed on the computer, the computer performs the following steps:

Acquiring a plurality of pieces of first text information published by a plurality of sample users with certain behavioral tendencies and the first record parameters corresponding to the first text information;

Extracting a plurality of keywords in each of the first text information, counting the number of occurrences of each keyword in each of the first text information, and performing vectorization processing to obtain a plurality of keyword vectors;

Taking the keyword vectors and the first recording parameters as training samples, randomly extracting multiple samples from the training samples multiple times to obtain multiple training sets;

With reference to the preset discriminant indicators, construct a decision tree corresponding to each training set, and generate a corresponding random forest model according to each decision tree;

Acquiring multiple pieces of second text information published by the user to be detected and second record parameters corresponding to the second texts;

Inputting the second text information and the second recording parameters into the random forest model for voting to obtain a voting result;

According to the voting result, it is determined whether the user to be detected has the behavioral tendency.

Those skilled in the art can clearly understand that, for the convenience and brevity of description, the specific working process of the system, device and unit described above may refer to the corresponding process in the foregoing method embodiments, which will not be repeated here.

The integrated unit, if implemented in the form of a software functional unit and sold or used as an independent product, may be stored in a computer-readable storage medium. Based on this understanding, the technical solutions of the present application can be embodied in the form of software products in essence, or the parts that contribute to the prior art, or all or part of the technical solutions, and the computer software products are stored in a storage medium , including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of the present application. The aforementioned storage medium includes: U disk, mobile hard disk, read-only memory (ROM), random access memory (RAM), magnetic disk or optical disk and other media that can store program codes .

As mentioned above, the above embodiments are only used to illustrate the technical solutions of the present application, but not to limit them; although the present application has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand: The technical solutions described in the embodiments are modified, or some technical features thereof are equivalently replaced; and these modifications or replacements do not make the essence of the corresponding technical solutions deviate from the spirit and scope of the technical solutions in the embodiments of the present application.

Claims

A method for identifying user behavior tendency, including:

Acquiring multiple pieces of first text information published by multiple sample users with certain behavioral tendencies and first record parameters corresponding to the first text information;

Extracting a plurality of keywords in each of the first text information, counting the number of occurrences of each keyword in each of the first text information, and performing vectorization processing to obtain a plurality of keyword vectors;

Taking the keyword vectors and the first recording parameters as training samples, randomly extracting multiple samples from the training samples multiple times to obtain multiple training sets;

With reference to the preset discriminant indicators, a decision tree corresponding to each training set is constructed, and a corresponding random forest model is generated according to each decision tree;

Acquiring multiple pieces of second text information published by the user to be detected and second record parameters corresponding to the second texts;

Inputting the second text information and the second recording parameters into the random forest model for voting to obtain a voting result;

According to the voting result, it is determined whether the user to be detected has the behavioral tendency.
The method for identifying user behavior tendency according to claim 1, wherein the extracting a plurality of keywords in each of the first text information comprises:

performing word segmentation processing on the first text information to obtain a plurality of word units;

TF-IDF algorithm is used to calculate the degree of discrimination of each word unit;

Sort the degree of discrimination of each word unit, and extract the word unit with the highest degree of discrimination from the sorting result as a keyword.
The method for identifying a user behavior tendency according to claim 1 or 2, wherein the counting the number of occurrences of each keyword in the first text information and performing vectorization processing to obtain a plurality of keyword vectors comprises:

According to the keywords, respectively determine the keywords contained in the first text information published by the sample users;

Counting the occurrences of each keyword in the first text information published by each sample user;

Vector transformation is performed on the occurrence times of each keyword to obtain a keyword vector corresponding to each sample user.
The method for identifying user behavior tendencies according to claim 1, wherein the building a decision tree corresponding to each training set with reference to a preset discriminant index, and generating a corresponding random forest model according to each decision tree comprises:

A classification regression tree algorithm is adopted, and a preset discriminant index is used as the feature selection of the decision tree, and each training sample in the each training set is subjected to decision tree classification to obtain a plurality of decision trees;

The decision trees are sequentially combined to obtain a random forest model, wherein the discriminant indicators include keyword vectors, the number of different keywords hit, the total number of hit keywords, average text length, sensitive speech time and sensitive speech days.
The method for identifying user behavior tendencies according to claim 1 or 4, wherein, with reference to a preset discrimination index, constructing a decision tree corresponding to each training set comprises:

S1, select a discriminant index as a root node, and calculate the Gini index of each discriminant index value corresponding to the root node to the training set;

S2. Determine whether each Gini index is greater than a preset first threshold and the number of samples in the sample set is greater than a preset second threshold;

S3. If yes, then divide the training set into a plurality of leaf nodes, and select the discriminant index value with the smallest Gini index as the root node, and execute S1-S2 cyclically;

S3. If not, generate a decision tree corresponding to the training set.
The method for identifying user behavior tendency according to claim 1, wherein the inputting the second text information and the second recording parameters into the random forest model for voting, and obtaining a voting result comprises:

Count the times of occurrence of each keyword in the second text information and perform vector transformation to obtain a target vector;

Inputting the target vector and the second record parameter into the random forest model for classification to obtain a classification result;

All decision trees in the random forest model are made to vote on the classification results to obtain voting results.
The method for identifying a user behavior tendency according to claim 1 or 6, wherein the determining whether the user to be detected has the behavior tendency according to the voting result comprises:

Obtain the voting results of all decision trees in the random forest model, wherein the voting results are having the behavioral tendency and/or not having the behavioral tendency;

Calculate the voting ratios corresponding to different behavioral inclinations according to the voting results;

The behavioral tendency with the highest voting ratio is taken as the behavioral tendency of the user to be detected.
A user behavior tendency identification device, comprising a memory, a processor, and computer-readable instructions stored on the memory and running on the processor, the processor implementing the following steps when executing the computer-readable instructions :

Acquiring a plurality of pieces of first text information published by a plurality of sample users with certain behavioral tendencies and the first record parameters corresponding to the first text information;

Extracting a plurality of keywords in each of the first text information, counting the number of occurrences of each keyword in each of the first text information, and performing vectorization processing to obtain a plurality of keyword vectors;

Taking the keyword vectors and the first recording parameters as training samples, randomly extracting multiple samples from the training samples multiple times to obtain multiple training sets;

With reference to the preset discriminant indicators, a decision tree corresponding to each training set is constructed, and a corresponding random forest model is generated according to each decision tree;

Acquiring multiple pieces of second text information published by the user to be detected and second record parameters corresponding to the second texts;

Inputting the second text information and the second recording parameters into the random forest model for voting to obtain a voting result;

According to the voting result, it is determined whether the user to be detected has the behavioral tendency.
The user behavior tendency identification device according to claim 8, wherein the processor further implements the following steps when executing the computer program:

performing word segmentation processing on the first text information to obtain a plurality of word units;

TF-IDF algorithm is used to calculate the degree of discrimination of each word unit;

Sort the degree of discrimination of each word unit, and extract the word unit with the highest degree of discrimination from the sorting result as a keyword.
The user behavior tendency identification device according to claim 8 or 9, wherein the processor further implements the following steps when executing the computer program:

According to the keywords, respectively determine the keywords included in the first text information published by the sample users;

Counting the occurrences of each keyword in the first text information published by each sample user;

Vector transformation is performed on the occurrence times of each keyword to obtain a keyword vector corresponding to each sample user.
The user behavior tendency identification device according to claim 8, wherein the processor further implements the following steps when executing the computer program:

A classification regression tree algorithm is adopted, and a preset discriminant index is used as the feature selection of the decision tree, and each training sample in the each training set is subjected to decision tree classification to obtain a plurality of decision trees;

The decision trees are sequentially combined to obtain a random forest model, wherein the discriminant indicators include keyword vectors, the number of different keywords hit, the total number of hit keywords, average text length, sensitive speech time and sensitive speech days.
The device for identifying user behavior tendency according to claim 8 or 11, wherein the processor further implements the following steps when executing the computer program:

S1, select a discriminant index as the root node, and calculate the Gini index of each discriminant index value corresponding to the root node to the training set;

S2. Determine whether each Gini index is greater than a preset first threshold and the number of samples in the sample set is greater than a preset second threshold;

S3. If yes, then divide the training set into a plurality of leaf nodes, and select the discriminant index value with the smallest Gini index as the root node, and execute S1-S2 cyclically;

S3. If not, generate a decision tree corresponding to the training set.
The user behavior tendency identification device according to claim 8, wherein the processor further implements the following steps when executing the computer program:

Count the times of occurrence of each keyword in the second text information and perform vector transformation to obtain a target vector;

Inputting the target vector and the second record parameter into the random forest model for classification to obtain a classification result;

All decision trees in the random forest model are made to vote on the classification results to obtain voting results.
The user behavior tendency identification device according to claim 8 or 13, wherein the processor further implements the following steps when executing the computer program:

Obtain the voting results of all decision trees in the random forest model, wherein the voting results are having the behavioral tendency and/or not having the behavioral tendency;

Calculate the voting ratios corresponding to different behavioral inclinations according to the voting results;

The behavioral tendency with the highest voting ratio is taken as the behavioral tendency of the user to be detected.
A computer-readable storage medium, storing computer instructions in the computer-readable storage medium, when the computer instructions are executed on a computer, the computer is made to perform the following steps:

Acquiring a plurality of pieces of first text information published by a plurality of sample users with certain behavioral tendencies and the first record parameters corresponding to the first text information;

Extracting a plurality of keywords in each of the first text information, counting the number of occurrences of each keyword in each of the first text information, and performing vectorization processing to obtain a plurality of keyword vectors;

Taking the keyword vectors and the first recording parameters as training samples, randomly extracting multiple samples from the training samples multiple times to obtain multiple training sets;

With reference to the preset discriminant indicators, a decision tree corresponding to each training set is constructed, and a corresponding random forest model is generated according to each decision tree;

Acquiring multiple pieces of second text information published by the user to be detected and second record parameters corresponding to the second texts;

Inputting the second text information and the second recording parameters into the random forest model for voting to obtain a voting result;

According to the voting result, it is determined whether the user to be detected has the behavioral tendency.
The computer-readable storage medium of claim 15, when the computer instructions are executed on a computer, causing the computer to further perform the following steps:

performing word segmentation processing on the first text information to obtain a plurality of word units;

TF-IDF algorithm is used to calculate the degree of discrimination of each word unit;

Sort the degree of discrimination of each word unit, and extract the word unit with the highest degree of discrimination from the sorting result as a keyword.
The computer-readable storage medium of claim 15 or 16, when the computer instructions are executed on a computer, causing the computer to further perform the following steps:

According to the keywords, respectively determine the keywords included in the first text information published by the sample users;

Counting the occurrences of each keyword in the first text information published by each sample user;

Vector transformation is performed on the occurrence times of each keyword to obtain a keyword vector corresponding to each sample user.
The computer-readable storage medium of claim 15, when the computer instructions are executed on a computer, causing the computer to further perform the following steps:

A classification regression tree algorithm is adopted, and a preset discriminant index is used as the feature selection of the decision tree, and each training sample in the each training set is subjected to decision tree classification to obtain a plurality of decision trees;

The decision trees are sequentially combined to obtain a random forest model, wherein the discriminant indicators include keyword vectors, the number of different keywords hit, the total number of hit keywords, average text length, sensitive speech time and sensitive speech days.
The computer-readable storage medium of claim 15 or 18, when the computer instructions are executed on a computer, causing the computer to further perform the following steps:

S1, select a discriminant index as a root node, and calculate the Gini index of each discriminant index value corresponding to the root node to the training set;

S2. Determine whether each Gini index is greater than a preset first threshold and the number of samples in the sample set is greater than a preset second threshold;

S3. If yes, then divide the training set into a plurality of leaf nodes, and select the discriminant index value with the smallest Gini index as the root node, and execute S1-S2 cyclically;

S3. If not, generate a decision tree corresponding to the training set.
A user behavior tendency identification device, the user behavior tendency identification device comprising:

a first acquisition module, configured to acquire a plurality of pieces of first text information published by a plurality of sample users with certain behavioral tendencies and first record parameters corresponding to the first text information;

A vectorization module, configured to extract multiple keywords in each of the first text information, count the number of occurrences of each keyword in each of the first text information, and perform vectorization processing to obtain multiple keyword vectors ;

a sampling module, configured to use the keyword vectors and the first recording parameters as training samples, and randomly extract multiple samples from the training samples for multiple times to obtain multiple training sets;

a building module for constructing a decision tree corresponding to each training set with reference to a preset discriminant index, and generating a corresponding random forest model according to each decision tree;

a second obtaining module, configured to obtain a plurality of pieces of second text information published by the user to be detected and second record parameters corresponding to the second texts;

a voting module, configured to input the second text information and the second record parameters into the random forest model for voting to obtain a voting result;

A determination module, configured to determine whether the user to be detected has the behavioral tendency according to the voting result.