CN112527958A - User behavior tendency identification method, device, equipment and storage medium - Google Patents
User behavior tendency identification method, device, equipment and storage medium Download PDFInfo
- Publication number
- CN112527958A CN112527958A CN202011436696.4A CN202011436696A CN112527958A CN 112527958 A CN112527958 A CN 112527958A CN 202011436696 A CN202011436696 A CN 202011436696A CN 112527958 A CN112527958 A CN 112527958A
- Authority
- CN
- China
- Prior art keywords
- behavior tendency
- user
- voting
- text information
- keywords
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 36
- 238000012549 training Methods 0.000 claims abstract description 72
- 238000003066 decision tree Methods 0.000 claims abstract description 67
- 239000013598 vector Substances 0.000 claims abstract description 55
- 238000007637 random forest analysis Methods 0.000 claims abstract description 50
- 230000006399 behavior Effects 0.000 claims description 123
- 238000012545 processing Methods 0.000 claims description 20
- 238000004422 calculation algorithm Methods 0.000 claims description 18
- 230000011218 segmentation Effects 0.000 claims description 13
- 238000006243 chemical reaction Methods 0.000 claims description 8
- 238000012163 sequencing technique Methods 0.000 claims description 8
- 238000005070 sampling Methods 0.000 claims description 6
- 230000009466 transformation Effects 0.000 claims description 5
- 238000013473 artificial intelligence Methods 0.000 abstract description 2
- 238000010801 machine learning Methods 0.000 description 7
- 238000010586 diagram Methods 0.000 description 5
- 239000000284 extract Substances 0.000 description 5
- 238000004364 calculation method Methods 0.000 description 3
- 238000000605 extraction Methods 0.000 description 3
- 206010010144 Completed suicide Diseases 0.000 description 1
- 230000002411 adverse Effects 0.000 description 1
- 230000007423 decrease Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000002085 persistent effect Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 230000001052 transient effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/3332—Query translation
- G06F16/3334—Selection or weighting of terms from queries, including natural language queries
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3347—Query execution using vector based model
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/36—Creation of semantic tools, e.g. ontology or thesauri
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Databases & Information Systems (AREA)
- Computational Linguistics (AREA)
- Artificial Intelligence (AREA)
- Software Systems (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Computation (AREA)
- Medical Informatics (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention relates to the field of artificial intelligence and discloses a method, a device, equipment and a storage medium for identifying user behavior tendency. The user behavior tendency identification method comprises the following steps: acquiring a plurality of pieces of text information and recording parameters issued by a plurality of sample users with determined behavior tendency; extracting a plurality of keywords in each text message, and converting the keywords into keyword vectors; taking each keyword vector and each recording parameter as training samples, and randomly extracting a plurality of samples from the training samples to obtain a plurality of training sets; building a plurality of decision trees according to preset discrimination indexes, and generating a random forest model; inputting the text information issued by the user to be detected and the corresponding recording parameters into a random forest model for voting, and determining whether the user to be detected has the behavior tendency or not according to the voting result. The invention can quickly determine the behavior tendency of the user through the speech information issued by the user.
Description
Technical Field
The invention relates to the field of artificial intelligence, in particular to a user behavior tendency identification method, a device, equipment and a storage medium.
Background
With the development of the internet, information on the network is spread more and more rapidly and widely, the complicated speech information has different influences on users, and particularly, speech uttered by some users with negative behavior tendency may cause group effect, thereby causing serious consequences. As an information bearing platform, if some users with negative behavior tendency can be identified in advance and further intervention is adopted, the influence caused by adverse consequences can be reduced.
At present, sensitive word shielding is generally adopted for processing bad words of users, only part of known sensitive words can be shielded, and for some negative but insensitive psychological words, the influence cannot be eliminated by using a shielding mode. For users with a certain characteristic behavior tendency, the computer is difficult to recognize and can only be determined through a post-judgment mechanism.
Disclosure of Invention
The invention mainly aims to solve the technical problem of how to flexibly identify the behavior tendency of a user.
The invention provides a user behavior tendency identification method in a first aspect, which comprises the following steps:
acquiring a plurality of pieces of first text information issued by a plurality of sample users with determined behavior tendency and first recording parameters corresponding to the first text information;
extracting a plurality of keywords in each first text message, counting the occurrence frequency of each keyword in each first text message, and performing vectorization processing to obtain a plurality of keyword vectors;
taking each keyword vector and each first recording parameter as training samples, and randomly extracting a plurality of samples from each training sample for multiple times to obtain a plurality of training sets;
constructing decision trees corresponding to the training sets according to preset discrimination indexes, and generating corresponding random forest models according to the decision trees;
acquiring a plurality of pieces of second text information issued by a user to be detected and second recording parameters corresponding to the second texts;
inputting the second text information and the second recording parameters into the random forest model for voting to obtain a voting result;
and determining whether the user to be detected has the behavior tendency or not according to the voting result.
Optionally, in a first implementation manner of the first aspect of the present invention, the extracting a plurality of keywords in each piece of first text information includes:
performing word segmentation processing on the first text information to obtain a plurality of word units;
calculating the discrimination of each word unit by adopting a TF-IDF algorithm;
and sequencing the discrimination of each word unit, and extracting the word unit with the highest discrimination from the sequencing result as a keyword.
Optionally, in a second implementation manner of the first aspect of the present invention, the counting the occurrence frequency of each keyword in each first text message and performing vectorization processing to obtain a plurality of keyword vectors includes:
determining keywords contained in the first text information issued by each sample user respectively according to the keywords;
counting the occurrence times of the keywords in the first text information issued by each sample user;
and performing vector conversion on the occurrence times of the keywords to obtain the keyword vector corresponding to each sample user.
Optionally, in a third implementation manner of the first aspect of the present invention, the constructing a decision tree corresponding to each training set with reference to a preset criterion index, and generating a corresponding random forest model according to each decision tree includes:
performing decision tree classification on each training sample in each training set by adopting a classification regression tree algorithm and taking a preset discrimination index as the feature selection of a decision tree to obtain a plurality of decision trees;
and sequentially combining the decision trees to obtain a random forest model, wherein the judgment indexes comprise keyword vectors, the number of hit different keywords, the total number of hit keywords, average text length, sensitive speaking time and sensitive speaking days.
Optionally, in a fourth implementation manner of the first aspect of the present invention, the constructing the decision tree corresponding to each training set by referring to a preset criterion includes:
s1, selecting a discriminant index as a root node, and calculating the Gini index of each discriminant index value corresponding to the root node to the training set;
s2, judging whether each Gini index is larger than a preset first threshold value or not and the number of samples in the sample set is larger than a preset second threshold value;
s3, if yes, dividing the training set into a plurality of leaf nodes, selecting the discriminant index value with the minimum Gini index as a root node, and executing S1-S2 in a circulating manner;
and S3, if not, generating a decision tree corresponding to the training set.
Optionally, in a fifth implementation manner of the first aspect of the present invention, the inputting the second text information and the second recording parameters into the random forest model for voting, and obtaining a voting result includes:
counting the occurrence frequency of each keyword in the second text information and carrying out vector transformation to obtain a target vector;
inputting the target vector and the second recording parameter into the random forest model for classification to obtain a classification result;
and voting all decision trees in the random forest model on the classification result to obtain a voting result.
Optionally, in a sixth implementation manner of the first aspect of the present invention, the determining, according to the voting result, whether the user to be detected has the behavior tendency includes:
obtaining voting results of all decision trees in the random forest model, wherein the voting results are the behavior tendency and/or the behavior tendency is not existed;
calculating voting ratios corresponding to different behavior tendencies according to the voting results;
and taking the behavior tendency with the highest voting ratio as the behavior tendency of the user to be detected.
A second aspect of the present invention provides a user behavior tendency recognition apparatus, including:
the first acquisition module is used for acquiring a plurality of pieces of first text information issued by a plurality of sample users with determined behavior tendency and first recording parameters corresponding to the first text information;
the vectorization module is used for extracting a plurality of keywords in each first text message, counting the occurrence frequency of each keyword in each first text message and carrying out vectorization processing to obtain a plurality of keyword vectors;
the sampling module is used for taking each keyword vector and each first recording parameter as training samples, and randomly extracting a plurality of samples from each training sample for a plurality of times to obtain a plurality of training sets;
the building module is used for building decision trees corresponding to the training sets according to preset discrimination indexes and generating corresponding random forest models according to the decision trees;
the second acquisition module is used for acquiring a plurality of pieces of second text information issued by the user to be detected and second recording parameters corresponding to the second texts;
the voting module is used for inputting the second text information and the second recording parameters into the random forest model for voting to obtain a voting result;
and the determining module is used for determining whether the user to be detected has the behavior tendency or not according to the voting result.
Optionally, in a first implementation manner of the second aspect of the present invention, the vectorization module includes:
the keyword extraction unit is used for performing word segmentation processing on the first text information to obtain a plurality of word units; calculating the discrimination of each word unit by adopting a TF-IDF algorithm; and sequencing the discrimination of each word unit, and extracting the word unit with the highest discrimination from the sequencing result as a keyword.
Optionally, in a second implementation manner of the second aspect of the present invention, the vectorization module further includes:
the vector conversion unit is used for respectively determining the keywords contained in the first text information issued by each sample user according to the keywords; counting the occurrence times of the keywords in the first text information issued by each sample user; and performing vector conversion on the occurrence times of the keywords to obtain the keyword vector corresponding to each sample user.
Optionally, in a third implementation manner of the second aspect of the present invention, the building module is specifically configured to:
performing decision tree classification on each training sample in each training set by adopting a classification regression tree algorithm and taking a preset discrimination index as the feature selection of a decision tree to obtain a plurality of decision trees;
and sequentially combining the decision trees to obtain a random forest model, wherein the judgment indexes comprise keyword vectors, the number of hit different keywords, the total number of hit keywords, average text length, sensitive speaking time and sensitive speaking days.
Optionally, in a fourth implementation manner of the second aspect of the present invention, the building module includes:
the computing unit is used for selecting a discriminant index as a root node and computing the kini index of each discriminant index value corresponding to the root node to the training set;
the judging unit is used for judging whether the every Gini index is larger than a preset first threshold value or not and the number of samples in the sample set is larger than a preset second threshold value;
the dividing unit is used for dividing the training set into a plurality of leaf nodes if each of the kiney indexes is larger than a preset first threshold and the number of samples in the sample set is larger than a preset second threshold, selecting a judgment index value with the minimum kiney index as a root node, and circularly executing the calculating unit and the judging unit;
and the generating unit is used for generating a decision tree corresponding to the training set if each Gini index is smaller than a preset first threshold or the number of samples in the sample set is smaller than a preset second threshold.
Optionally, in a fifth implementation manner of the second aspect of the present invention, the voting module is specifically configured to:
counting the occurrence frequency of each keyword in the second text information and carrying out vector transformation to obtain a target vector;
inputting the target vector and the second recording parameter into the random forest model for classification to obtain a classification result;
and voting all decision trees in the random forest model on the classification result to obtain a voting result.
Optionally, in a sixth implementation manner of the second aspect of the present invention, the determining module is specifically configured to:
obtaining voting results of all decision trees in the random forest model, wherein the voting results are the behavior tendency and/or the behavior tendency is not existed;
calculating voting ratios corresponding to different behavior tendencies according to the voting results;
and taking the behavior tendency with the highest voting ratio as the behavior tendency of the user to be detected.
A third aspect of the present invention provides a user behavior tendency identification device, including: a memory and at least one processor, the memory having instructions stored therein; the at least one processor invokes the instructions in the memory to cause the user behavior tendency identification device to perform the user behavior tendency identification method described above.
A fourth aspect of the present invention provides a computer-readable storage medium having stored therein instructions, which, when run on a computer, cause the computer to execute the above-mentioned user behavior tendency identification method.
According to the technical scheme provided by the invention, the speech data issued by users with the same type of characteristic behavior tendency are collected firstly, and the keywords in the speech data are extracted and used as the characteristic representation of the type of users. And then, using the speech data as a training sample for machine learning to construct a random forest model, inputting the speech data related to the user to be detected into the model for recognition, judging whether the user to be detected and the sample user have the same behavior characteristics, and if so, determining that the user to be detected and the sample user have the same behavior tendency. The invention can extract the speech characteristics related to the users with the same type of characteristic behavior tendency, train out the random forest model in a machine learning mode, further identify the users with unknown behavior tendency and determine whether the users have the same type of behavior tendency.
Drawings
FIG. 1 is a diagram of a first embodiment of a method for identifying a user behavior tendency according to an embodiment of the present invention;
FIG. 2 is a diagram of a second embodiment of a method for identifying a user behavior tendency according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of an embodiment of a user behavior tendency recognition apparatus according to an embodiment of the present invention;
fig. 4 is a schematic diagram of an embodiment of a user behavior tendency identification device in the embodiment of the present invention.
Detailed Description
The terms "first", "second", "third", "fourth", and the like (if any) in the description and claims and the above drawings are used for distinguishing similar objects and not necessarily for describing a particular order or sequence. It will be appreciated that the data so used may be interchanged under appropriate circumstances such that the embodiments described herein may be practiced otherwise than as specifically illustrated or described herein. Furthermore, the terms "comprises," "comprising," or "having," and any variations thereof, are intended to cover non-exclusive inclusions, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
For convenience of understanding, a specific flow of the embodiment of the present invention is described below, and referring to fig. 1, a first embodiment of the user behavior tendency identification method in the embodiment of the present invention includes:
101. acquiring a plurality of pieces of first text information issued by a plurality of sample users with determined behavior tendency and first recording parameters corresponding to the first text information;
it is to be understood that the executing subject of the present invention may be a user behavior tendency recognition device, and may also be a terminal or a server, which is not limited herein. The embodiment of the present invention is described by taking a server as an execution subject.
In this embodiment, since the present invention needs to determine whether the unknown user has a certain type of behavior tendency, the speech characteristics of the type of user need to be determined, and the behavior tendency of the unknown user is determined in a characteristic matching manner. Therefore, a large amount of sample data needs to be acquired to extract the speech characteristics which can represent the users with the same type of behavior tendency. The embodiment acquires the text information issued by the sample user with the determined behavior tendency and the issuing records corresponding to the text information, and is used for extracting the speech characteristic keywords and the machine learning training samples.
In this embodiment, the sample user with a certain behavior tendency may be a sample user with a desire to purchase a certain commodity, a sample user with a certain negative behavior tendency, a sample user with a certain psychological characteristic, and the like, and may be, for example, a user who purchases a private airplane, a suicide-prone user, a depression user, and the like. The behavior tendency of the sample user determines the recognition type of the model, and the models with different recognition types can recognize the users with different types of behavior tendencies.
102. Extracting a plurality of keywords in each first text message, counting the occurrence frequency of each keyword in each first text message, and performing vectorization processing to obtain a plurality of keyword vectors;
in the embodiment, the speech characteristic words of the users with the same type of behavior tendency are extracted from a large amount of sample data, the speech of the unknown user is compared with the characteristic words, and other judgment indexes are combined, so that whether the unknown user has the same characteristics or not is determined.
In the embodiment, after the keywords in the speeches of the users with special behavior tendencies are extracted, the speeches of the sample users are analyzed, the hit rate of the keywords in the text information issued by the sample users needs to be counted in the analysis process, and the hit rate is used as one of the judgment indexes when the unknown users are identified, so that the method has a very reference meaning.
Optionally, step 102 includes:
determining keywords contained in the first text information issued by each sample user respectively according to the keywords;
counting the occurrence times of the keywords in the first text information issued by each sample user;
and performing vector conversion on the occurrence times of the keywords to obtain the keyword vector corresponding to each sample user.
In this optional embodiment, the text needs to be converted into a vector first when calculating the hit rate of the keywords, and in this embodiment, the vector of the text refers to the number of times each keyword appears, for example, the keyword in the extracted text information issued by all sample users is D ═ T (T ═ T)1,T2,T3,T4,T5) And if the number of occurrences of each keyword in the text information published by one sample user is W ═ 5,2,0,1,0, then W can be used as the keyword vector conversion data of the sample user.
103. Taking each keyword vector and each first recording parameter as training samples, and randomly extracting a plurality of samples from each training sample for multiple times to obtain a plurality of training sets;
in this embodiment, the keyword vector and the first recording parameter corresponding to each sample user are used as training samples, a plurality of samples are randomly extracted from the training samples in a put-back manner to obtain training sets, and each training set constructs a decision tree, so as to generate a random forest model. The reason for random sampling is to make each decision tree different, so that the generated classification results are different, and the reason for sampling with put back is to make the decision trees intersect with each other, so as to avoid the sidedness of the decisions, and the final result is generated by voting of the decision trees, which should be "for the same", and if the result generated by each decision tree is completely independent, the final voting result does not help the problem solution at all, so the embodiment adopts the way of random sampling with put back many times to obtain the training set.
104. Constructing decision trees corresponding to the training sets according to preset discrimination indexes, and generating corresponding random forest models according to the decision trees;
in this embodiment, sample data in a training set is used as generation data of a decision tree, in this embodiment, a CART tree algorithm is preferably used to generate a classification decision tree, the input of the algorithm is the training set, the threshold of the kini index, the threshold of the number of samples, and the output is the decision tree. The generation process is that a CART classification tree is established by using a training set recursion from a following node, when the number of samples is less than a preset number or no characteristic exists, a decision sub-tree is returned, the recursion of the current node is stopped, and in the embodiment, the characteristic refers to a preset judgment index; calculating the kini index of the sample set, if the kini index is smaller than a threshold value, returning to a decision sub-tree, and stopping recursion of the current node; calculating the Kini index of each characteristic value of each existing characteristic of the current node to the data set, selecting the characteristic with the minimum Kini index and the corresponding characteristic value as classification nodes, establishing leaf nodes, and continuing to execute the algorithm from beginning recursively until the conditions for generating the decision tree are met.
Optionally, step 104 includes:
performing decision tree classification on each training sample in each training set by adopting a classification regression tree algorithm and taking a preset discrimination index as the feature selection of a decision tree to obtain a plurality of decision trees;
and sequentially combining the decision trees to obtain a random forest model, wherein the judgment indexes comprise keyword vectors, the number of hit different keywords, the total number of hit keywords, average text length, sensitive speaking time and sensitive speaking days.
Optionally, step 104 further includes:
s1, selecting a discriminant index as a root node, and calculating the Gini index of each discriminant index value corresponding to the root node to the training set;
s2, judging whether each Gini index is larger than a preset first threshold value or not and the number of samples in the sample set is larger than a preset second threshold value;
s3, if yes, dividing the training set into a plurality of leaf nodes, selecting the discriminant index value with the minimum Gini index as a root node, and executing S1-S2 in a circulating manner;
and S3, if not, generating a decision tree corresponding to the training set.
105. Acquiring a plurality of pieces of second text information issued by a user to be detected and second recording parameters corresponding to the second texts;
106. inputting the second text information and the second recording parameters into the random forest model for voting to obtain a voting result;
in this embodiment, after extracting the text information and the recording parameters issued by the user to be detected, the number of times of occurrence of the keywords in the second text information needs to be counted, so as to obtain a target vector as one of the parameters input into the random forest model.
In this embodiment, there are multiple classification trees in the random forest, each tree is a weak classifier, and the classification results of the weak classifiers are voted and selected to form a strong classifier, which is the idea of random forest banking.
Optionally, step 106 includes:
counting the occurrence frequency of each keyword in the second text information and carrying out vector transformation to obtain a target vector;
inputting the target vector and the second recording parameter into the random forest model for classification to obtain a classification result;
and voting all decision trees in the random forest model on the classification result to obtain a voting result.
107. And determining whether the user to be detected has the behavior tendency or not according to the voting result.
In this embodiment, the voting results include those with and/or without the behavior tendency, for example, 80% of the decision trees are classified as having the behavior tendency, 20% of the decision trees are classified as not having the behavior tendency, 80% and 20% are voting rates, and the voting result with a high voting rate is used as the recognition result of the model, that is, the detected user has the same behavior tendency as the sample user.
Optionally, step 107 includes:
obtaining voting results of all decision trees in the random forest model, wherein the voting results are the behavior tendency and/or the behavior tendency is not existed;
calculating voting ratios corresponding to different behavior tendencies according to the voting results;
and taking the behavior tendency with the highest voting ratio as the behavior tendency of the user to be detected.
In the embodiment of the invention, the speech data issued by the users with the same type of characteristic behavior tendency are collected firstly, and the keywords in the speech data are extracted to be used as the characteristic representation of the type of users. And then, using the speech data as a training sample for machine learning to construct a random forest model, inputting the speech data related to the user to be detected into the model for recognition, judging whether the user to be detected and the sample user have the same behavior characteristics, and if so, determining that the user to be detected and the sample user have the same behavior tendency. The invention can extract the speech characteristics related to the users with the same type of characteristic behavior tendency, train out the random forest model in a machine learning mode, further identify the users with unknown behavior tendency and determine whether the users have the same type of behavior tendency.
Referring to fig. 2, a second embodiment of the method for identifying a user behavior tendency according to the embodiment of the present invention includes:
201. acquiring a plurality of pieces of first text information issued by a plurality of sample users with determined behavior tendency and first recording parameters corresponding to the first text information;
202. performing word segmentation processing on the first text information to obtain a plurality of word units;
203. calculating the discrimination of each word unit by adopting a TF-IDF algorithm;
204. sorting the discrimination of each word unit, and extracting the word unit with the highest discrimination from a sorting result as a keyword;
in this optional embodiment, before extracting the keywords in the text information, word segmentation processing needs to be performed on the text, and word segmentation is a basis for processing natural language, so that a machine can understand human language. The existing word segmentation algorithms are many, and the present embodiment preferably performs word segmentation processing on the original text by using an NLP word segmentation algorithm, so as to extract keywords. The NLP segmentation algorithm is the prior art and is not described herein.
In this alternative embodiment, the utterance keywords are determined by using a TF-IDF algorithm, which is a word frequency-inverse text frequency algorithm based on discrete word bags and is used to evaluate the importance of a word to one of the documents in the document set or corpus, where the importance of a word increases in proportion to the number of times it appears in the document, but decreases in inverse proportion to the frequency of its appearance in the corpus. Degree of distinction W of words iiThe calculation formula of (2) is as follows:
wherein tf isiThe frequency of the occurrence of the word i in the document after word segmentation, N the total number of the documents in the corpus, dfiRefers to the number of documents containing the word i. The following illustrates the formulaThe use mode.
For example, if the total number of words in a document is 100, and the word "buy" appears 4 times, the frequency of "buy" in the document is 4/100-0.04, i.e., the word frequency tfiIf the term "buy" appears in 1000 documents and the total number of documents is 10000, the inverse text frequency is 0.04Finally WiThe calculation result is the degree of discrimination or importance of the word "purchase" in the document set, 0.04 × 1. In this embodiment, the discrimination of each word is sorted, and N words before the discrimination are used as the keywords of the behavior tendency user, and are used as the reference data when the keywords of the sample user are vectorized, where N is a preset parameter, and N is an integer greater than 0.
205. Counting the occurrence frequency of each keyword in each first text message and carrying out vectorization processing to obtain a plurality of keyword vectors;
206. taking each keyword vector and each first recording parameter as training samples, and randomly extracting a plurality of samples from each training sample for multiple times to obtain a plurality of training sets;
207. constructing decision trees corresponding to the training sets according to preset discrimination indexes, and generating corresponding random forest models according to the decision trees;
208. acquiring a plurality of pieces of second text information issued by a user to be detected and second recording parameters corresponding to the second texts;
209. inputting the second text information and the second recording parameters into the random forest model for voting to obtain a voting result;
210. and determining whether the user to be detected has the behavior tendency or not according to the voting result.
In the embodiment of the invention, when the behavior tendency of the user is analyzed, the keywords in the text information play a vital role and can represent the speech characteristics of the user with the type of the behavior tendency, so the embodiment of the invention extracts the keywords in all the text information issued by the sample user. The extraction method comprises the steps of firstly carrying out word segmentation processing on long text information to obtain words which cannot be segmented again, then calculating the occurrence frequency of the words, and taking a plurality of words with higher occurrence frequency as keywords. The method obtains representative characteristic keywords through analysis and calculation of a large amount of data, the representative characteristic keywords are used as one of the judgment indexes for behavior tendency identification, the user behavior tendency can be well predicted, and the user behavior tendency can be accurately identified by combining a plurality of other judgment indexes, so that further intervention is adopted.
The above description of the method for identifying a user behavior tendency in the embodiment of the present invention is provided, and referring to fig. 3, the following description of the apparatus for identifying a user behavior tendency in the embodiment of the present invention is provided, where an embodiment of the apparatus for identifying a user behavior tendency in the embodiment of the present invention includes:
a first obtaining module 301, configured to obtain multiple pieces of first text information issued by multiple sample users with certain behavior tendencies and first recording parameters corresponding to the first text information;
a vectorization module 302, configured to extract a plurality of keywords in each first text message, count the occurrence frequency of each keyword in each first text message, and perform vectorization processing to obtain a plurality of keyword vectors;
a sampling module 303, configured to take each keyword vector and each first recording parameter as a training sample, and randomly extract multiple samples from each training sample for multiple times to obtain multiple training sets;
a building module 304, configured to build a decision tree corresponding to each training set according to a preset criterion, and generate a corresponding random forest model according to each decision tree;
a second obtaining module 305, configured to obtain multiple pieces of second text information issued by a user to be detected and second recording parameters corresponding to the second texts;
the voting module 306 is configured to input the second text information and the second recording parameters into the random forest model for voting, so as to obtain a voting result;
the determining module 307 is configured to determine whether the user to be detected has the behavior tendency according to the voting result.
Optionally, in an embodiment, the vectorization module 302 includes:
the keyword extraction unit is used for performing word segmentation processing on the first text information to obtain a plurality of word units; calculating the discrimination of each word unit by adopting a TF-IDF algorithm; and sequencing the discrimination of each word unit, and extracting the word unit with the highest discrimination from the sequencing result as a keyword.
Optionally, in an embodiment, the vectorization module 302 further includes:
the vector conversion unit is used for respectively determining the keywords contained in the first text information issued by each sample user according to the keywords; counting the occurrence times of the keywords in the first text information issued by each sample user; and performing vector conversion on the occurrence times of the keywords to obtain the keyword vector corresponding to each sample user.
Optionally, in an embodiment, the building module 304 is specifically configured to:
performing decision tree classification on each training sample in each training set by adopting a classification regression tree algorithm and taking a preset discrimination index as the feature selection of a decision tree to obtain a plurality of decision trees;
and sequentially combining the decision trees to obtain a random forest model, wherein the judgment indexes comprise keyword vectors, the number of hit different keywords, the total number of hit keywords, average text length, sensitive speaking time and sensitive speaking days.
Optionally, in an embodiment, the building module 304 includes:
the computing unit is used for selecting a discriminant index as a root node and computing the kini index of each discriminant index value corresponding to the root node to the training set;
the judging unit is used for judging whether the every Gini index is larger than a preset first threshold value or not and the number of samples in the sample set is larger than a preset second threshold value;
the dividing unit is used for dividing the training set into a plurality of leaf nodes if each of the kiney indexes is larger than a preset first threshold and the number of samples in the sample set is larger than a preset second threshold, selecting a judgment index value with the minimum kiney index as a root node, and circularly executing the calculating unit and the judging unit;
and the generating unit is used for generating a decision tree corresponding to the training set if each Gini index is smaller than a preset first threshold or the number of samples in the sample set is smaller than a preset second threshold.
Optionally, in an embodiment, the voting module 306 is specifically configured to:
counting the occurrence frequency of each keyword in the second text information and carrying out vector transformation to obtain a target vector;
inputting the target vector and the second recording parameter into the random forest model for classification to obtain a classification result;
and voting all decision trees in the random forest model on the classification result to obtain a voting result.
Optionally, in an embodiment, the determining module 307 is specifically configured to:
obtaining voting results of all decision trees in the random forest model, wherein the voting results are the behavior tendency and/or the behavior tendency is not existed;
calculating voting ratios corresponding to different behavior tendencies according to the voting results;
and taking the behavior tendency with the highest voting ratio as the behavior tendency of the user to be detected.
In the embodiment of the invention, the speech data issued by the users with the same type of characteristic behavior tendency are collected firstly, and the keywords in the speech data are extracted to be used as the characteristic representation of the type of users. And then, using the speech data as a training sample for machine learning to construct a random forest model, inputting the speech data related to the user to be detected into the model for recognition, judging whether the user to be detected and the sample user have the same behavior characteristics, and if so, determining that the user to be detected and the sample user have the same behavior tendency. The invention can extract the speech characteristics related to the users with the same type of characteristic behavior tendency, train out the random forest model in a machine learning mode, further identify the users with unknown behavior tendency and determine whether the users have the same type of behavior tendency.
Fig. 3 describes the user behavior tendency recognition apparatus in the embodiment of the present invention in detail from the perspective of the modular functional entity, and the user behavior tendency recognition apparatus in the embodiment of the present invention is described in detail from the perspective of hardware processing.
Fig. 4 is a schematic structural diagram of a user behavior tendency recognition device according to an embodiment of the present invention, where the user behavior tendency recognition device 500 may generate relatively large differences due to different configurations or performances, and may include one or more processors (CPUs) 510 (e.g., one or more processors) and a memory 520, and one or more storage media 530 (e.g., one or more mass storage devices) storing applications 533 or data 532. Memory 520 and storage media 530 may be, among other things, transient or persistent storage. The program stored on the storage medium 530 may include one or more modules (not shown), each of which may include a sequence of instruction operations for the user behavior tendency recognition device 500. Further, the processor 510 may be configured to communicate with the storage medium 530 to execute a series of instruction operations in the storage medium 530 on the user behavior tendency recognition device 500.
The user behavior propensity identification device 500 may also include one or more power supplies 540, one or more wired or wireless network interfaces 550, one or more input-output interfaces 560, and/or one or more operating systems 531, such as Windows Server, Mac OS X, Unix, Linux, FreeBSD, and the like. Those skilled in the art will appreciate that the configuration of the user behavior tendency recognition device shown in fig. 4 does not constitute a limitation of the user behavior tendency recognition device, and may include more or less components than those shown, or some components may be combined, or a different arrangement of components may be used.
The present invention also provides a user behavior tendency identification device, which includes a memory and a processor, where the memory stores computer readable instructions, and the computer readable instructions, when executed by the processor, cause the processor to execute the steps of the user behavior tendency identification method in the foregoing embodiments.
The present invention also provides a computer-readable storage medium, which may be a non-volatile computer-readable storage medium, and which may also be a volatile computer-readable storage medium, having stored therein instructions, which, when run on a computer, cause the computer to perform the steps of the user behavior tendency identification method.
It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a read-only memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.
The above-mentioned embodiments are only used for illustrating the technical solutions of the present invention, and not for limiting the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.
Claims (10)
1. A user behavior tendency recognition method is characterized by comprising the following steps:
acquiring a plurality of pieces of first text information issued by a plurality of sample users with determined behavior tendency and first recording parameters corresponding to the first text information;
extracting a plurality of keywords in each first text message, counting the occurrence frequency of each keyword in each first text message, and performing vectorization processing to obtain a plurality of keyword vectors;
taking each keyword vector and each first recording parameter as training samples, and randomly extracting a plurality of samples from each training sample for multiple times to obtain a plurality of training sets;
constructing decision trees corresponding to the training sets according to preset discrimination indexes, and generating corresponding random forest models according to the decision trees;
acquiring a plurality of pieces of second text information issued by a user to be detected and second recording parameters corresponding to the second texts;
inputting the second text information and the second recording parameters into the random forest model for voting to obtain a voting result;
and determining whether the user to be detected has the behavior tendency or not according to the voting result.
2. The method according to claim 1, wherein the extracting the plurality of keywords in each of the first text messages comprises:
performing word segmentation processing on the first text information to obtain a plurality of word units;
calculating the discrimination of each word unit by adopting a TF-IDF algorithm;
and sequencing the discrimination of each word unit, and extracting the word unit with the highest discrimination from the sequencing result as a keyword.
3. The method according to claim 1 or 2, wherein the counting the occurrence frequency of each keyword in each first text message and performing vectorization processing to obtain a plurality of keyword vectors comprises:
determining keywords contained in the first text information issued by each sample user respectively according to the keywords;
counting the occurrence times of the keywords in the first text information issued by each sample user;
and performing vector conversion on the occurrence times of the keywords to obtain the keyword vector corresponding to each sample user.
4. The method for identifying the user behavior tendency according to claim 1, wherein the constructing the decision trees corresponding to the training sets by referring to preset discriminant indexes and generating the corresponding random forest models according to the decision trees comprises:
performing decision tree classification on each training sample in each training set by adopting a classification regression tree algorithm and taking a preset discrimination index as the feature selection of a decision tree to obtain a plurality of decision trees;
and sequentially combining the decision trees to obtain a random forest model, wherein the judgment indexes comprise keyword vectors, the number of hit different keywords, the total number of hit keywords, average text length, sensitive speaking time and sensitive speaking days.
5. The method according to claim 1 or 4, wherein the constructing the decision tree corresponding to each training set by referring to a preset discriminant index comprises:
s1, selecting a discriminant index as a root node, and calculating the Gini index of each discriminant index value corresponding to the root node to the training set;
s2, judging whether each Gini index is larger than a preset first threshold value or not and the number of samples in the sample set is larger than a preset second threshold value;
s3, if yes, dividing the training set into a plurality of leaf nodes, selecting the discriminant index value with the minimum Gini index as a root node, and executing S1-S2 in a circulating manner;
and S3, if not, generating a decision tree corresponding to the training set.
6. The method as claimed in claim 1, wherein the step of inputting the second text information and the second recording parameters into the random forest model for voting to obtain the voting result comprises:
counting the occurrence frequency of each keyword in the second text information and carrying out vector transformation to obtain a target vector;
inputting the target vector and the second recording parameter into the random forest model for classification to obtain a classification result;
and voting all decision trees in the random forest model on the classification result to obtain a voting result.
7. The method for identifying the user behavior tendency according to claim 1 or 6, wherein the determining whether the user to be detected has the behavior tendency according to the voting result comprises:
obtaining voting results of all decision trees in the random forest model, wherein the voting results are the behavior tendency and/or the behavior tendency is not existed;
calculating voting ratios corresponding to different behavior tendencies according to the voting results;
and taking the behavior tendency with the highest voting ratio as the behavior tendency of the user to be detected.
8. A user behavior tendency recognition apparatus, characterized in that the user behavior tendency recognition apparatus comprises:
the first acquisition module is used for acquiring a plurality of pieces of first text information issued by a plurality of sample users with determined behavior tendency and first recording parameters corresponding to the first text information;
the vectorization module is used for extracting a plurality of keywords in each first text message, counting the occurrence frequency of each keyword in each first text message and carrying out vectorization processing to obtain a plurality of keyword vectors;
the sampling module is used for taking each keyword vector and each first recording parameter as training samples, and randomly extracting a plurality of samples from each training sample for a plurality of times to obtain a plurality of training sets;
the building module is used for building decision trees corresponding to the training sets according to preset discrimination indexes and generating corresponding random forest models according to the decision trees;
the second acquisition module is used for acquiring a plurality of pieces of second text information issued by the user to be detected and second recording parameters corresponding to the second texts;
the voting module is used for inputting the second text information and the second recording parameters into the random forest model for voting to obtain a voting result;
and the determining module is used for determining whether the user to be detected has the behavior tendency or not according to the voting result.
9. A user behavior tendency recognition device, characterized in that the user behavior tendency recognition device comprises: a memory and at least one processor, the memory having instructions stored therein;
the at least one processor invokes the instructions in the memory to cause the user behavior tendency identification device to perform the user behavior tendency identification method according to any one of claims 1-7.
10. A computer-readable storage medium having instructions stored thereon, wherein the instructions, when executed by a processor, implement the user behavior tendency identification method according to any one of claims 1-7.
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011436696.4A CN112527958A (en) | 2020-12-11 | 2020-12-11 | User behavior tendency identification method, device, equipment and storage medium |
PCT/CN2021/083480 WO2022121163A1 (en) | 2020-12-11 | 2021-03-29 | User behavior tendency identification method, apparatus, and device, and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011436696.4A CN112527958A (en) | 2020-12-11 | 2020-12-11 | User behavior tendency identification method, device, equipment and storage medium |
Publications (1)
Publication Number | Publication Date |
---|---|
CN112527958A true CN112527958A (en) | 2021-03-19 |
Family
ID=74999586
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202011436696.4A Pending CN112527958A (en) | 2020-12-11 | 2020-12-11 | User behavior tendency identification method, device, equipment and storage medium |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN112527958A (en) |
WO (1) | WO2022121163A1 (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2022121163A1 (en) * | 2020-12-11 | 2022-06-16 | 平安科技(深圳)有限公司 | User behavior tendency identification method, apparatus, and device, and storage medium |
CN114663143A (en) * | 2022-03-21 | 2022-06-24 | 平安健康保险股份有限公司 | Intervention user screening method and device based on differential intervention response model |
CN114676961A (en) * | 2022-02-23 | 2022-06-28 | 深圳中科闻歌科技有限公司 | Enterprise external migration risk prediction method and device and computer readable storage medium |
CN115620853A (en) * | 2022-09-07 | 2023-01-17 | 国家康复辅具研究中心 | Model training method for TMS strategy automatic selection, automatic selection method and system |
CN116468096A (en) * | 2023-03-30 | 2023-07-21 | 之江实验室 | Model training method, device, equipment and readable storage medium |
Families Citing this family (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115545085A (en) * | 2022-11-04 | 2022-12-30 | 南方电网数字电网研究院有限公司 | Weak fault current fault type identification method, device, equipment and medium |
CN116189215A (en) * | 2022-12-30 | 2023-05-30 | 中国人民财产保险股份有限公司 | Automatic auditing method and device, electronic equipment and storage medium |
CN118450342A (en) * | 2024-07-05 | 2024-08-06 | 深圳博瑞天下科技有限公司 | Method and device for processing short message node overall under high throughput |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107016107A (en) * | 2017-04-12 | 2017-08-04 | 四川九鼎瑞信软件开发有限公司 | The analysis of public opinion method and system |
CN109241418A (en) * | 2018-08-22 | 2019-01-18 | 中国平安人寿保险股份有限公司 | Abnormal user recognition methods and device, equipment, medium based on random forest |
CN111666502A (en) * | 2020-07-08 | 2020-09-15 | 腾讯科技(深圳)有限公司 | Abnormal user identification method and device based on deep learning and storage medium |
Family Cites Families (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11288297B2 (en) * | 2017-11-29 | 2022-03-29 | Oracle International Corporation | Explicit semantic analysis-based large-scale classification |
CN109325106A (en) * | 2018-07-31 | 2019-02-12 | 厦门快商通信息技术有限公司 | A kind of U.S. chat robots intension recognizing method of doctor and device |
CN109934260A (en) * | 2019-01-31 | 2019-06-25 | 中国科学院信息工程研究所 | Image, text and data fusion sensibility classification method and device based on random forest |
CN111368076B (en) * | 2020-02-27 | 2023-04-07 | 中国地质大学(武汉) | Bernoulli naive Bayesian text classification method based on random forest |
CN112527958A (en) * | 2020-12-11 | 2021-03-19 | 平安科技(深圳)有限公司 | User behavior tendency identification method, device, equipment and storage medium |
-
2020
- 2020-12-11 CN CN202011436696.4A patent/CN112527958A/en active Pending
-
2021
- 2021-03-29 WO PCT/CN2021/083480 patent/WO2022121163A1/en active Application Filing
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107016107A (en) * | 2017-04-12 | 2017-08-04 | 四川九鼎瑞信软件开发有限公司 | The analysis of public opinion method and system |
CN109241418A (en) * | 2018-08-22 | 2019-01-18 | 中国平安人寿保险股份有限公司 | Abnormal user recognition methods and device, equipment, medium based on random forest |
CN111666502A (en) * | 2020-07-08 | 2020-09-15 | 腾讯科技(深圳)有限公司 | Abnormal user identification method and device based on deep learning and storage medium |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2022121163A1 (en) * | 2020-12-11 | 2022-06-16 | 平安科技(深圳)有限公司 | User behavior tendency identification method, apparatus, and device, and storage medium |
CN114676961A (en) * | 2022-02-23 | 2022-06-28 | 深圳中科闻歌科技有限公司 | Enterprise external migration risk prediction method and device and computer readable storage medium |
CN114663143A (en) * | 2022-03-21 | 2022-06-24 | 平安健康保险股份有限公司 | Intervention user screening method and device based on differential intervention response model |
CN115620853A (en) * | 2022-09-07 | 2023-01-17 | 国家康复辅具研究中心 | Model training method for TMS strategy automatic selection, automatic selection method and system |
CN116468096A (en) * | 2023-03-30 | 2023-07-21 | 之江实验室 | Model training method, device, equipment and readable storage medium |
CN116468096B (en) * | 2023-03-30 | 2024-01-02 | 之江实验室 | Model training method, device, equipment and readable storage medium |
Also Published As
Publication number | Publication date |
---|---|
WO2022121163A1 (en) | 2022-06-16 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112527958A (en) | User behavior tendency identification method, device, equipment and storage medium | |
Tixier et al. | A graph degeneracy-based approach to keyword extraction | |
WO2021047186A1 (en) | Method, apparatus, device, and storage medium for processing consultation dialogue | |
CN107085581B (en) | Short text classification method and device | |
CN109872162B (en) | Wind control classification and identification method and system for processing user complaint information | |
CN110472043B (en) | Clustering method and device for comment text | |
JPH07114572A (en) | Document classifying device | |
CN110858217A (en) | Method and device for detecting microblog sensitive topics and readable storage medium | |
CN111309916B (en) | Digest extracting method and apparatus, storage medium, and electronic apparatus | |
CN106294733A (en) | Page detection method based on text analyzing | |
CN106294736A (en) | Text feature based on key word frequency | |
CN107665221A (en) | The sorting technique and device of keyword | |
CN108536673B (en) | News event extraction method and device | |
Bhutada et al. | Semantic latent dirichlet allocation for automatic topic extraction | |
CN113626604A (en) | Webpage text classification system based on maximum interval criterion | |
CN108073567B (en) | Feature word extraction processing method, system and server | |
KR20220041337A (en) | Graph generation system of updating a search word from thesaurus and extracting core documents and method thereof | |
CN110413985B (en) | Related text segment searching method and device | |
CN106294295A (en) | Article similarity recognition method based on word frequency | |
CN115496066A (en) | Text analysis system, text analysis method, electronic device, and storage medium | |
CN107590163B (en) | The methods, devices and systems of text feature selection | |
CN115048523A (en) | Text classification method, device, equipment and storage medium | |
JP2008282111A (en) | Similar document retrieval method, program and device | |
CN113868431A (en) | Financial knowledge graph-oriented relation extraction method and device and storage medium | |
KR20220041336A (en) | Graph generation system of recommending significant keywords and extracting core documents and method thereof |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |