CN112527958A - User behavior tendency identification method, device, equipment and storage medium - Google Patents

User behavior tendency identification method, device, equipment and storage medium Download PDF

Info

Publication number
CN112527958A
CN112527958A CN202011436696.4A CN202011436696A CN112527958A CN 112527958 A CN112527958 A CN 112527958A CN 202011436696 A CN202011436696 A CN 202011436696A CN 112527958 A CN112527958 A CN 112527958A
Authority
CN
China
Prior art keywords
behavior tendency
user
voting
text information
keywords
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011436696.4A
Other languages
Chinese (zh)
Inventor
卢春曦
王健宗
黄章成
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN202011436696.4A priority Critical patent/CN112527958A/en
Publication of CN112527958A publication Critical patent/CN112527958A/en
Priority to PCT/CN2021/083480 priority patent/WO2022121163A1/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/3332Query translation
    • G06F16/3334Selection or weighting of terms from queries, including natural language queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3347Query execution using vector based model
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Software Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to the field of artificial intelligence and discloses a method, a device, equipment and a storage medium for identifying user behavior tendency. The user behavior tendency identification method comprises the following steps: acquiring a plurality of pieces of text information and recording parameters issued by a plurality of sample users with determined behavior tendency; extracting a plurality of keywords in each text message, and converting the keywords into keyword vectors; taking each keyword vector and each recording parameter as training samples, and randomly extracting a plurality of samples from the training samples to obtain a plurality of training sets; building a plurality of decision trees according to preset discrimination indexes, and generating a random forest model; inputting the text information issued by the user to be detected and the corresponding recording parameters into a random forest model for voting, and determining whether the user to be detected has the behavior tendency or not according to the voting result. The invention can quickly determine the behavior tendency of the user through the speech information issued by the user.

Description

User behavior tendency identification method, device, equipment and storage medium
Technical Field
The invention relates to the field of artificial intelligence, in particular to a user behavior tendency identification method, a device, equipment and a storage medium.
Background
With the development of the internet, information on the network is spread more and more rapidly and widely, the complicated speech information has different influences on users, and particularly, speech uttered by some users with negative behavior tendency may cause group effect, thereby causing serious consequences. As an information bearing platform, if some users with negative behavior tendency can be identified in advance and further intervention is adopted, the influence caused by adverse consequences can be reduced.
At present, sensitive word shielding is generally adopted for processing bad words of users, only part of known sensitive words can be shielded, and for some negative but insensitive psychological words, the influence cannot be eliminated by using a shielding mode. For users with a certain characteristic behavior tendency, the computer is difficult to recognize and can only be determined through a post-judgment mechanism.
Disclosure of Invention
The invention mainly aims to solve the technical problem of how to flexibly identify the behavior tendency of a user.
The invention provides a user behavior tendency identification method in a first aspect, which comprises the following steps:
acquiring a plurality of pieces of first text information issued by a plurality of sample users with determined behavior tendency and first recording parameters corresponding to the first text information;
extracting a plurality of keywords in each first text message, counting the occurrence frequency of each keyword in each first text message, and performing vectorization processing to obtain a plurality of keyword vectors;
taking each keyword vector and each first recording parameter as training samples, and randomly extracting a plurality of samples from each training sample for multiple times to obtain a plurality of training sets;
constructing decision trees corresponding to the training sets according to preset discrimination indexes, and generating corresponding random forest models according to the decision trees;
acquiring a plurality of pieces of second text information issued by a user to be detected and second recording parameters corresponding to the second texts;
inputting the second text information and the second recording parameters into the random forest model for voting to obtain a voting result;
and determining whether the user to be detected has the behavior tendency or not according to the voting result.
Optionally, in a first implementation manner of the first aspect of the present invention, the extracting a plurality of keywords in each piece of first text information includes:
performing word segmentation processing on the first text information to obtain a plurality of word units;
calculating the discrimination of each word unit by adopting a TF-IDF algorithm;
and sequencing the discrimination of each word unit, and extracting the word unit with the highest discrimination from the sequencing result as a keyword.
Optionally, in a second implementation manner of the first aspect of the present invention, the counting the occurrence frequency of each keyword in each first text message and performing vectorization processing to obtain a plurality of keyword vectors includes:
determining keywords contained in the first text information issued by each sample user respectively according to the keywords;
counting the occurrence times of the keywords in the first text information issued by each sample user;
and performing vector conversion on the occurrence times of the keywords to obtain the keyword vector corresponding to each sample user.
Optionally, in a third implementation manner of the first aspect of the present invention, the constructing a decision tree corresponding to each training set with reference to a preset criterion index, and generating a corresponding random forest model according to each decision tree includes:
performing decision tree classification on each training sample in each training set by adopting a classification regression tree algorithm and taking a preset discrimination index as the feature selection of a decision tree to obtain a plurality of decision trees;
and sequentially combining the decision trees to obtain a random forest model, wherein the judgment indexes comprise keyword vectors, the number of hit different keywords, the total number of hit keywords, average text length, sensitive speaking time and sensitive speaking days.
Optionally, in a fourth implementation manner of the first aspect of the present invention, the constructing the decision tree corresponding to each training set by referring to a preset criterion includes:
s1, selecting a discriminant index as a root node, and calculating the Gini index of each discriminant index value corresponding to the root node to the training set;
s2, judging whether each Gini index is larger than a preset first threshold value or not and the number of samples in the sample set is larger than a preset second threshold value;
s3, if yes, dividing the training set into a plurality of leaf nodes, selecting the discriminant index value with the minimum Gini index as a root node, and executing S1-S2 in a circulating manner;
and S3, if not, generating a decision tree corresponding to the training set.
Optionally, in a fifth implementation manner of the first aspect of the present invention, the inputting the second text information and the second recording parameters into the random forest model for voting, and obtaining a voting result includes:
counting the occurrence frequency of each keyword in the second text information and carrying out vector transformation to obtain a target vector;
inputting the target vector and the second recording parameter into the random forest model for classification to obtain a classification result;
and voting all decision trees in the random forest model on the classification result to obtain a voting result.
Optionally, in a sixth implementation manner of the first aspect of the present invention, the determining, according to the voting result, whether the user to be detected has the behavior tendency includes:
obtaining voting results of all decision trees in the random forest model, wherein the voting results are the behavior tendency and/or the behavior tendency is not existed;
calculating voting ratios corresponding to different behavior tendencies according to the voting results;
and taking the behavior tendency with the highest voting ratio as the behavior tendency of the user to be detected.
A second aspect of the present invention provides a user behavior tendency recognition apparatus, including:
the first acquisition module is used for acquiring a plurality of pieces of first text information issued by a plurality of sample users with determined behavior tendency and first recording parameters corresponding to the first text information;
the vectorization module is used for extracting a plurality of keywords in each first text message, counting the occurrence frequency of each keyword in each first text message and carrying out vectorization processing to obtain a plurality of keyword vectors;
the sampling module is used for taking each keyword vector and each first recording parameter as training samples, and randomly extracting a plurality of samples from each training sample for a plurality of times to obtain a plurality of training sets;
the building module is used for building decision trees corresponding to the training sets according to preset discrimination indexes and generating corresponding random forest models according to the decision trees;
the second acquisition module is used for acquiring a plurality of pieces of second text information issued by the user to be detected and second recording parameters corresponding to the second texts;
the voting module is used for inputting the second text information and the second recording parameters into the random forest model for voting to obtain a voting result;
and the determining module is used for determining whether the user to be detected has the behavior tendency or not according to the voting result.
Optionally, in a first implementation manner of the second aspect of the present invention, the vectorization module includes:
the keyword extraction unit is used for performing word segmentation processing on the first text information to obtain a plurality of word units; calculating the discrimination of each word unit by adopting a TF-IDF algorithm; and sequencing the discrimination of each word unit, and extracting the word unit with the highest discrimination from the sequencing result as a keyword.
Optionally, in a second implementation manner of the second aspect of the present invention, the vectorization module further includes:
the vector conversion unit is used for respectively determining the keywords contained in the first text information issued by each sample user according to the keywords; counting the occurrence times of the keywords in the first text information issued by each sample user; and performing vector conversion on the occurrence times of the keywords to obtain the keyword vector corresponding to each sample user.
Optionally, in a third implementation manner of the second aspect of the present invention, the building module is specifically configured to:
performing decision tree classification on each training sample in each training set by adopting a classification regression tree algorithm and taking a preset discrimination index as the feature selection of a decision tree to obtain a plurality of decision trees;
and sequentially combining the decision trees to obtain a random forest model, wherein the judgment indexes comprise keyword vectors, the number of hit different keywords, the total number of hit keywords, average text length, sensitive speaking time and sensitive speaking days.
Optionally, in a fourth implementation manner of the second aspect of the present invention, the building module includes:
the computing unit is used for selecting a discriminant index as a root node and computing the kini index of each discriminant index value corresponding to the root node to the training set;
the judging unit is used for judging whether the every Gini index is larger than a preset first threshold value or not and the number of samples in the sample set is larger than a preset second threshold value;
the dividing unit is used for dividing the training set into a plurality of leaf nodes if each of the kiney indexes is larger than a preset first threshold and the number of samples in the sample set is larger than a preset second threshold, selecting a judgment index value with the minimum kiney index as a root node, and circularly executing the calculating unit and the judging unit;
and the generating unit is used for generating a decision tree corresponding to the training set if each Gini index is smaller than a preset first threshold or the number of samples in the sample set is smaller than a preset second threshold.
Optionally, in a fifth implementation manner of the second aspect of the present invention, the voting module is specifically configured to:
counting the occurrence frequency of each keyword in the second text information and carrying out vector transformation to obtain a target vector;
inputting the target vector and the second recording parameter into the random forest model for classification to obtain a classification result;
and voting all decision trees in the random forest model on the classification result to obtain a voting result.
Optionally, in a sixth implementation manner of the second aspect of the present invention, the determining module is specifically configured to:
obtaining voting results of all decision trees in the random forest model, wherein the voting results are the behavior tendency and/or the behavior tendency is not existed;
calculating voting ratios corresponding to different behavior tendencies according to the voting results;
and taking the behavior tendency with the highest voting ratio as the behavior tendency of the user to be detected.
A third aspect of the present invention provides a user behavior tendency identification device, including: a memory and at least one processor, the memory having instructions stored therein; the at least one processor invokes the instructions in the memory to cause the user behavior tendency identification device to perform the user behavior tendency identification method described above.
A fourth aspect of the present invention provides a computer-readable storage medium having stored therein instructions, which, when run on a computer, cause the computer to execute the above-mentioned user behavior tendency identification method.
According to the technical scheme provided by the invention, the speech data issued by users with the same type of characteristic behavior tendency are collected firstly, and the keywords in the speech data are extracted and used as the characteristic representation of the type of users. And then, using the speech data as a training sample for machine learning to construct a random forest model, inputting the speech data related to the user to be detected into the model for recognition, judging whether the user to be detected and the sample user have the same behavior characteristics, and if so, determining that the user to be detected and the sample user have the same behavior tendency. The invention can extract the speech characteristics related to the users with the same type of characteristic behavior tendency, train out the random forest model in a machine learning mode, further identify the users with unknown behavior tendency and determine whether the users have the same type of behavior tendency.
Drawings
FIG. 1 is a diagram of a first embodiment of a method for identifying a user behavior tendency according to an embodiment of the present invention;
FIG. 2 is a diagram of a second embodiment of a method for identifying a user behavior tendency according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of an embodiment of a user behavior tendency recognition apparatus according to an embodiment of the present invention;
fig. 4 is a schematic diagram of an embodiment of a user behavior tendency identification device in the embodiment of the present invention.
Detailed Description
The terms "first", "second", "third", "fourth", and the like (if any) in the description and claims and the above drawings are used for distinguishing similar objects and not necessarily for describing a particular order or sequence. It will be appreciated that the data so used may be interchanged under appropriate circumstances such that the embodiments described herein may be practiced otherwise than as specifically illustrated or described herein. Furthermore, the terms "comprises," "comprising," or "having," and any variations thereof, are intended to cover non-exclusive inclusions, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
For convenience of understanding, a specific flow of the embodiment of the present invention is described below, and referring to fig. 1, a first embodiment of the user behavior tendency identification method in the embodiment of the present invention includes:
101. acquiring a plurality of pieces of first text information issued by a plurality of sample users with determined behavior tendency and first recording parameters corresponding to the first text information;
it is to be understood that the executing subject of the present invention may be a user behavior tendency recognition device, and may also be a terminal or a server, which is not limited herein. The embodiment of the present invention is described by taking a server as an execution subject.
In this embodiment, since the present invention needs to determine whether the unknown user has a certain type of behavior tendency, the speech characteristics of the type of user need to be determined, and the behavior tendency of the unknown user is determined in a characteristic matching manner. Therefore, a large amount of sample data needs to be acquired to extract the speech characteristics which can represent the users with the same type of behavior tendency. The embodiment acquires the text information issued by the sample user with the determined behavior tendency and the issuing records corresponding to the text information, and is used for extracting the speech characteristic keywords and the machine learning training samples.
In this embodiment, the sample user with a certain behavior tendency may be a sample user with a desire to purchase a certain commodity, a sample user with a certain negative behavior tendency, a sample user with a certain psychological characteristic, and the like, and may be, for example, a user who purchases a private airplane, a suicide-prone user, a depression user, and the like. The behavior tendency of the sample user determines the recognition type of the model, and the models with different recognition types can recognize the users with different types of behavior tendencies.
102. Extracting a plurality of keywords in each first text message, counting the occurrence frequency of each keyword in each first text message, and performing vectorization processing to obtain a plurality of keyword vectors;
in the embodiment, the speech characteristic words of the users with the same type of behavior tendency are extracted from a large amount of sample data, the speech of the unknown user is compared with the characteristic words, and other judgment indexes are combined, so that whether the unknown user has the same characteristics or not is determined.
In the embodiment, after the keywords in the speeches of the users with special behavior tendencies are extracted, the speeches of the sample users are analyzed, the hit rate of the keywords in the text information issued by the sample users needs to be counted in the analysis process, and the hit rate is used as one of the judgment indexes when the unknown users are identified, so that the method has a very reference meaning.
Optionally, step 102 includes:
determining keywords contained in the first text information issued by each sample user respectively according to the keywords;
counting the occurrence times of the keywords in the first text information issued by each sample user;
and performing vector conversion on the occurrence times of the keywords to obtain the keyword vector corresponding to each sample user.
In this optional embodiment, the text needs to be converted into a vector first when calculating the hit rate of the keywords, and in this embodiment, the vector of the text refers to the number of times each keyword appears, for example, the keyword in the extracted text information issued by all sample users is D ═ T (T ═ T)1,T2,T3,T4,T5) And if the number of occurrences of each keyword in the text information published by one sample user is W ═ 5,2,0,1,0, then W can be used as the keyword vector conversion data of the sample user.
103. Taking each keyword vector and each first recording parameter as training samples, and randomly extracting a plurality of samples from each training sample for multiple times to obtain a plurality of training sets;
in this embodiment, the keyword vector and the first recording parameter corresponding to each sample user are used as training samples, a plurality of samples are randomly extracted from the training samples in a put-back manner to obtain training sets, and each training set constructs a decision tree, so as to generate a random forest model. The reason for random sampling is to make each decision tree different, so that the generated classification results are different, and the reason for sampling with put back is to make the decision trees intersect with each other, so as to avoid the sidedness of the decisions, and the final result is generated by voting of the decision trees, which should be "for the same", and if the result generated by each decision tree is completely independent, the final voting result does not help the problem solution at all, so the embodiment adopts the way of random sampling with put back many times to obtain the training set.
104. Constructing decision trees corresponding to the training sets according to preset discrimination indexes, and generating corresponding random forest models according to the decision trees;
in this embodiment, sample data in a training set is used as generation data of a decision tree, in this embodiment, a CART tree algorithm is preferably used to generate a classification decision tree, the input of the algorithm is the training set, the threshold of the kini index, the threshold of the number of samples, and the output is the decision tree. The generation process is that a CART classification tree is established by using a training set recursion from a following node, when the number of samples is less than a preset number or no characteristic exists, a decision sub-tree is returned, the recursion of the current node is stopped, and in the embodiment, the characteristic refers to a preset judgment index; calculating the kini index of the sample set, if the kini index is smaller than a threshold value, returning to a decision sub-tree, and stopping recursion of the current node; calculating the Kini index of each characteristic value of each existing characteristic of the current node to the data set, selecting the characteristic with the minimum Kini index and the corresponding characteristic value as classification nodes, establishing leaf nodes, and continuing to execute the algorithm from beginning recursively until the conditions for generating the decision tree are met.
Optionally, step 104 includes:
performing decision tree classification on each training sample in each training set by adopting a classification regression tree algorithm and taking a preset discrimination index as the feature selection of a decision tree to obtain a plurality of decision trees;
and sequentially combining the decision trees to obtain a random forest model, wherein the judgment indexes comprise keyword vectors, the number of hit different keywords, the total number of hit keywords, average text length, sensitive speaking time and sensitive speaking days.
Optionally, step 104 further includes:
s1, selecting a discriminant index as a root node, and calculating the Gini index of each discriminant index value corresponding to the root node to the training set;
s2, judging whether each Gini index is larger than a preset first threshold value or not and the number of samples in the sample set is larger than a preset second threshold value;
s3, if yes, dividing the training set into a plurality of leaf nodes, selecting the discriminant index value with the minimum Gini index as a root node, and executing S1-S2 in a circulating manner;
and S3, if not, generating a decision tree corresponding to the training set.
105. Acquiring a plurality of pieces of second text information issued by a user to be detected and second recording parameters corresponding to the second texts;
106. inputting the second text information and the second recording parameters into the random forest model for voting to obtain a voting result;
in this embodiment, after extracting the text information and the recording parameters issued by the user to be detected, the number of times of occurrence of the keywords in the second text information needs to be counted, so as to obtain a target vector as one of the parameters input into the random forest model.
In this embodiment, there are multiple classification trees in the random forest, each tree is a weak classifier, and the classification results of the weak classifiers are voted and selected to form a strong classifier, which is the idea of random forest banking.
Optionally, step 106 includes:
counting the occurrence frequency of each keyword in the second text information and carrying out vector transformation to obtain a target vector;
inputting the target vector and the second recording parameter into the random forest model for classification to obtain a classification result;
and voting all decision trees in the random forest model on the classification result to obtain a voting result.
107. And determining whether the user to be detected has the behavior tendency or not according to the voting result.
In this embodiment, the voting results include those with and/or without the behavior tendency, for example, 80% of the decision trees are classified as having the behavior tendency, 20% of the decision trees are classified as not having the behavior tendency, 80% and 20% are voting rates, and the voting result with a high voting rate is used as the recognition result of the model, that is, the detected user has the same behavior tendency as the sample user.
Optionally, step 107 includes:
obtaining voting results of all decision trees in the random forest model, wherein the voting results are the behavior tendency and/or the behavior tendency is not existed;
calculating voting ratios corresponding to different behavior tendencies according to the voting results;
and taking the behavior tendency with the highest voting ratio as the behavior tendency of the user to be detected.
In the embodiment of the invention, the speech data issued by the users with the same type of characteristic behavior tendency are collected firstly, and the keywords in the speech data are extracted to be used as the characteristic representation of the type of users. And then, using the speech data as a training sample for machine learning to construct a random forest model, inputting the speech data related to the user to be detected into the model for recognition, judging whether the user to be detected and the sample user have the same behavior characteristics, and if so, determining that the user to be detected and the sample user have the same behavior tendency. The invention can extract the speech characteristics related to the users with the same type of characteristic behavior tendency, train out the random forest model in a machine learning mode, further identify the users with unknown behavior tendency and determine whether the users have the same type of behavior tendency.
Referring to fig. 2, a second embodiment of the method for identifying a user behavior tendency according to the embodiment of the present invention includes:
201. acquiring a plurality of pieces of first text information issued by a plurality of sample users with determined behavior tendency and first recording parameters corresponding to the first text information;
202. performing word segmentation processing on the first text information to obtain a plurality of word units;
203. calculating the discrimination of each word unit by adopting a TF-IDF algorithm;
204. sorting the discrimination of each word unit, and extracting the word unit with the highest discrimination from a sorting result as a keyword;
in this optional embodiment, before extracting the keywords in the text information, word segmentation processing needs to be performed on the text, and word segmentation is a basis for processing natural language, so that a machine can understand human language. The existing word segmentation algorithms are many, and the present embodiment preferably performs word segmentation processing on the original text by using an NLP word segmentation algorithm, so as to extract keywords. The NLP segmentation algorithm is the prior art and is not described herein.
In this alternative embodiment, the utterance keywords are determined by using a TF-IDF algorithm, which is a word frequency-inverse text frequency algorithm based on discrete word bags and is used to evaluate the importance of a word to one of the documents in the document set or corpus, where the importance of a word increases in proportion to the number of times it appears in the document, but decreases in inverse proportion to the frequency of its appearance in the corpus. Degree of distinction W of words iiThe calculation formula of (2) is as follows:
Figure BDA0002829270370000101
wherein tf isiThe frequency of the occurrence of the word i in the document after word segmentation, N the total number of the documents in the corpus, dfiRefers to the number of documents containing the word i. The following illustrates the formulaThe use mode.
For example, if the total number of words in a document is 100, and the word "buy" appears 4 times, the frequency of "buy" in the document is 4/100-0.04, i.e., the word frequency tfiIf the term "buy" appears in 1000 documents and the total number of documents is 10000, the inverse text frequency is 0.04
Figure BDA0002829270370000111
Finally WiThe calculation result is the degree of discrimination or importance of the word "purchase" in the document set, 0.04 × 1. In this embodiment, the discrimination of each word is sorted, and N words before the discrimination are used as the keywords of the behavior tendency user, and are used as the reference data when the keywords of the sample user are vectorized, where N is a preset parameter, and N is an integer greater than 0.
205. Counting the occurrence frequency of each keyword in each first text message and carrying out vectorization processing to obtain a plurality of keyword vectors;
206. taking each keyword vector and each first recording parameter as training samples, and randomly extracting a plurality of samples from each training sample for multiple times to obtain a plurality of training sets;
207. constructing decision trees corresponding to the training sets according to preset discrimination indexes, and generating corresponding random forest models according to the decision trees;
208. acquiring a plurality of pieces of second text information issued by a user to be detected and second recording parameters corresponding to the second texts;
209. inputting the second text information and the second recording parameters into the random forest model for voting to obtain a voting result;
210. and determining whether the user to be detected has the behavior tendency or not according to the voting result.
In the embodiment of the invention, when the behavior tendency of the user is analyzed, the keywords in the text information play a vital role and can represent the speech characteristics of the user with the type of the behavior tendency, so the embodiment of the invention extracts the keywords in all the text information issued by the sample user. The extraction method comprises the steps of firstly carrying out word segmentation processing on long text information to obtain words which cannot be segmented again, then calculating the occurrence frequency of the words, and taking a plurality of words with higher occurrence frequency as keywords. The method obtains representative characteristic keywords through analysis and calculation of a large amount of data, the representative characteristic keywords are used as one of the judgment indexes for behavior tendency identification, the user behavior tendency can be well predicted, and the user behavior tendency can be accurately identified by combining a plurality of other judgment indexes, so that further intervention is adopted.
The above description of the method for identifying a user behavior tendency in the embodiment of the present invention is provided, and referring to fig. 3, the following description of the apparatus for identifying a user behavior tendency in the embodiment of the present invention is provided, where an embodiment of the apparatus for identifying a user behavior tendency in the embodiment of the present invention includes:
a first obtaining module 301, configured to obtain multiple pieces of first text information issued by multiple sample users with certain behavior tendencies and first recording parameters corresponding to the first text information;
a vectorization module 302, configured to extract a plurality of keywords in each first text message, count the occurrence frequency of each keyword in each first text message, and perform vectorization processing to obtain a plurality of keyword vectors;
a sampling module 303, configured to take each keyword vector and each first recording parameter as a training sample, and randomly extract multiple samples from each training sample for multiple times to obtain multiple training sets;
a building module 304, configured to build a decision tree corresponding to each training set according to a preset criterion, and generate a corresponding random forest model according to each decision tree;
a second obtaining module 305, configured to obtain multiple pieces of second text information issued by a user to be detected and second recording parameters corresponding to the second texts;
the voting module 306 is configured to input the second text information and the second recording parameters into the random forest model for voting, so as to obtain a voting result;
the determining module 307 is configured to determine whether the user to be detected has the behavior tendency according to the voting result.
Optionally, in an embodiment, the vectorization module 302 includes:
the keyword extraction unit is used for performing word segmentation processing on the first text information to obtain a plurality of word units; calculating the discrimination of each word unit by adopting a TF-IDF algorithm; and sequencing the discrimination of each word unit, and extracting the word unit with the highest discrimination from the sequencing result as a keyword.
Optionally, in an embodiment, the vectorization module 302 further includes:
the vector conversion unit is used for respectively determining the keywords contained in the first text information issued by each sample user according to the keywords; counting the occurrence times of the keywords in the first text information issued by each sample user; and performing vector conversion on the occurrence times of the keywords to obtain the keyword vector corresponding to each sample user.
Optionally, in an embodiment, the building module 304 is specifically configured to:
performing decision tree classification on each training sample in each training set by adopting a classification regression tree algorithm and taking a preset discrimination index as the feature selection of a decision tree to obtain a plurality of decision trees;
and sequentially combining the decision trees to obtain a random forest model, wherein the judgment indexes comprise keyword vectors, the number of hit different keywords, the total number of hit keywords, average text length, sensitive speaking time and sensitive speaking days.
Optionally, in an embodiment, the building module 304 includes:
the computing unit is used for selecting a discriminant index as a root node and computing the kini index of each discriminant index value corresponding to the root node to the training set;
the judging unit is used for judging whether the every Gini index is larger than a preset first threshold value or not and the number of samples in the sample set is larger than a preset second threshold value;
the dividing unit is used for dividing the training set into a plurality of leaf nodes if each of the kiney indexes is larger than a preset first threshold and the number of samples in the sample set is larger than a preset second threshold, selecting a judgment index value with the minimum kiney index as a root node, and circularly executing the calculating unit and the judging unit;
and the generating unit is used for generating a decision tree corresponding to the training set if each Gini index is smaller than a preset first threshold or the number of samples in the sample set is smaller than a preset second threshold.
Optionally, in an embodiment, the voting module 306 is specifically configured to:
counting the occurrence frequency of each keyword in the second text information and carrying out vector transformation to obtain a target vector;
inputting the target vector and the second recording parameter into the random forest model for classification to obtain a classification result;
and voting all decision trees in the random forest model on the classification result to obtain a voting result.
Optionally, in an embodiment, the determining module 307 is specifically configured to:
obtaining voting results of all decision trees in the random forest model, wherein the voting results are the behavior tendency and/or the behavior tendency is not existed;
calculating voting ratios corresponding to different behavior tendencies according to the voting results;
and taking the behavior tendency with the highest voting ratio as the behavior tendency of the user to be detected.
In the embodiment of the invention, the speech data issued by the users with the same type of characteristic behavior tendency are collected firstly, and the keywords in the speech data are extracted to be used as the characteristic representation of the type of users. And then, using the speech data as a training sample for machine learning to construct a random forest model, inputting the speech data related to the user to be detected into the model for recognition, judging whether the user to be detected and the sample user have the same behavior characteristics, and if so, determining that the user to be detected and the sample user have the same behavior tendency. The invention can extract the speech characteristics related to the users with the same type of characteristic behavior tendency, train out the random forest model in a machine learning mode, further identify the users with unknown behavior tendency and determine whether the users have the same type of behavior tendency.
Fig. 3 describes the user behavior tendency recognition apparatus in the embodiment of the present invention in detail from the perspective of the modular functional entity, and the user behavior tendency recognition apparatus in the embodiment of the present invention is described in detail from the perspective of hardware processing.
Fig. 4 is a schematic structural diagram of a user behavior tendency recognition device according to an embodiment of the present invention, where the user behavior tendency recognition device 500 may generate relatively large differences due to different configurations or performances, and may include one or more processors (CPUs) 510 (e.g., one or more processors) and a memory 520, and one or more storage media 530 (e.g., one or more mass storage devices) storing applications 533 or data 532. Memory 520 and storage media 530 may be, among other things, transient or persistent storage. The program stored on the storage medium 530 may include one or more modules (not shown), each of which may include a sequence of instruction operations for the user behavior tendency recognition device 500. Further, the processor 510 may be configured to communicate with the storage medium 530 to execute a series of instruction operations in the storage medium 530 on the user behavior tendency recognition device 500.
The user behavior propensity identification device 500 may also include one or more power supplies 540, one or more wired or wireless network interfaces 550, one or more input-output interfaces 560, and/or one or more operating systems 531, such as Windows Server, Mac OS X, Unix, Linux, FreeBSD, and the like. Those skilled in the art will appreciate that the configuration of the user behavior tendency recognition device shown in fig. 4 does not constitute a limitation of the user behavior tendency recognition device, and may include more or less components than those shown, or some components may be combined, or a different arrangement of components may be used.
The present invention also provides a user behavior tendency identification device, which includes a memory and a processor, where the memory stores computer readable instructions, and the computer readable instructions, when executed by the processor, cause the processor to execute the steps of the user behavior tendency identification method in the foregoing embodiments.
The present invention also provides a computer-readable storage medium, which may be a non-volatile computer-readable storage medium, and which may also be a volatile computer-readable storage medium, having stored therein instructions, which, when run on a computer, cause the computer to perform the steps of the user behavior tendency identification method.
It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a read-only memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.
The above-mentioned embodiments are only used for illustrating the technical solutions of the present invention, and not for limiting the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (10)

1. A user behavior tendency recognition method is characterized by comprising the following steps:
acquiring a plurality of pieces of first text information issued by a plurality of sample users with determined behavior tendency and first recording parameters corresponding to the first text information;
extracting a plurality of keywords in each first text message, counting the occurrence frequency of each keyword in each first text message, and performing vectorization processing to obtain a plurality of keyword vectors;
taking each keyword vector and each first recording parameter as training samples, and randomly extracting a plurality of samples from each training sample for multiple times to obtain a plurality of training sets;
constructing decision trees corresponding to the training sets according to preset discrimination indexes, and generating corresponding random forest models according to the decision trees;
acquiring a plurality of pieces of second text information issued by a user to be detected and second recording parameters corresponding to the second texts;
inputting the second text information and the second recording parameters into the random forest model for voting to obtain a voting result;
and determining whether the user to be detected has the behavior tendency or not according to the voting result.
2. The method according to claim 1, wherein the extracting the plurality of keywords in each of the first text messages comprises:
performing word segmentation processing on the first text information to obtain a plurality of word units;
calculating the discrimination of each word unit by adopting a TF-IDF algorithm;
and sequencing the discrimination of each word unit, and extracting the word unit with the highest discrimination from the sequencing result as a keyword.
3. The method according to claim 1 or 2, wherein the counting the occurrence frequency of each keyword in each first text message and performing vectorization processing to obtain a plurality of keyword vectors comprises:
determining keywords contained in the first text information issued by each sample user respectively according to the keywords;
counting the occurrence times of the keywords in the first text information issued by each sample user;
and performing vector conversion on the occurrence times of the keywords to obtain the keyword vector corresponding to each sample user.
4. The method for identifying the user behavior tendency according to claim 1, wherein the constructing the decision trees corresponding to the training sets by referring to preset discriminant indexes and generating the corresponding random forest models according to the decision trees comprises:
performing decision tree classification on each training sample in each training set by adopting a classification regression tree algorithm and taking a preset discrimination index as the feature selection of a decision tree to obtain a plurality of decision trees;
and sequentially combining the decision trees to obtain a random forest model, wherein the judgment indexes comprise keyword vectors, the number of hit different keywords, the total number of hit keywords, average text length, sensitive speaking time and sensitive speaking days.
5. The method according to claim 1 or 4, wherein the constructing the decision tree corresponding to each training set by referring to a preset discriminant index comprises:
s1, selecting a discriminant index as a root node, and calculating the Gini index of each discriminant index value corresponding to the root node to the training set;
s2, judging whether each Gini index is larger than a preset first threshold value or not and the number of samples in the sample set is larger than a preset second threshold value;
s3, if yes, dividing the training set into a plurality of leaf nodes, selecting the discriminant index value with the minimum Gini index as a root node, and executing S1-S2 in a circulating manner;
and S3, if not, generating a decision tree corresponding to the training set.
6. The method as claimed in claim 1, wherein the step of inputting the second text information and the second recording parameters into the random forest model for voting to obtain the voting result comprises:
counting the occurrence frequency of each keyword in the second text information and carrying out vector transformation to obtain a target vector;
inputting the target vector and the second recording parameter into the random forest model for classification to obtain a classification result;
and voting all decision trees in the random forest model on the classification result to obtain a voting result.
7. The method for identifying the user behavior tendency according to claim 1 or 6, wherein the determining whether the user to be detected has the behavior tendency according to the voting result comprises:
obtaining voting results of all decision trees in the random forest model, wherein the voting results are the behavior tendency and/or the behavior tendency is not existed;
calculating voting ratios corresponding to different behavior tendencies according to the voting results;
and taking the behavior tendency with the highest voting ratio as the behavior tendency of the user to be detected.
8. A user behavior tendency recognition apparatus, characterized in that the user behavior tendency recognition apparatus comprises:
the first acquisition module is used for acquiring a plurality of pieces of first text information issued by a plurality of sample users with determined behavior tendency and first recording parameters corresponding to the first text information;
the vectorization module is used for extracting a plurality of keywords in each first text message, counting the occurrence frequency of each keyword in each first text message and carrying out vectorization processing to obtain a plurality of keyword vectors;
the sampling module is used for taking each keyword vector and each first recording parameter as training samples, and randomly extracting a plurality of samples from each training sample for a plurality of times to obtain a plurality of training sets;
the building module is used for building decision trees corresponding to the training sets according to preset discrimination indexes and generating corresponding random forest models according to the decision trees;
the second acquisition module is used for acquiring a plurality of pieces of second text information issued by the user to be detected and second recording parameters corresponding to the second texts;
the voting module is used for inputting the second text information and the second recording parameters into the random forest model for voting to obtain a voting result;
and the determining module is used for determining whether the user to be detected has the behavior tendency or not according to the voting result.
9. A user behavior tendency recognition device, characterized in that the user behavior tendency recognition device comprises: a memory and at least one processor, the memory having instructions stored therein;
the at least one processor invokes the instructions in the memory to cause the user behavior tendency identification device to perform the user behavior tendency identification method according to any one of claims 1-7.
10. A computer-readable storage medium having instructions stored thereon, wherein the instructions, when executed by a processor, implement the user behavior tendency identification method according to any one of claims 1-7.
CN202011436696.4A 2020-12-11 2020-12-11 User behavior tendency identification method, device, equipment and storage medium Pending CN112527958A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202011436696.4A CN112527958A (en) 2020-12-11 2020-12-11 User behavior tendency identification method, device, equipment and storage medium
PCT/CN2021/083480 WO2022121163A1 (en) 2020-12-11 2021-03-29 User behavior tendency identification method, apparatus, and device, and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011436696.4A CN112527958A (en) 2020-12-11 2020-12-11 User behavior tendency identification method, device, equipment and storage medium

Publications (1)

Publication Number Publication Date
CN112527958A true CN112527958A (en) 2021-03-19

Family

ID=74999586

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011436696.4A Pending CN112527958A (en) 2020-12-11 2020-12-11 User behavior tendency identification method, device, equipment and storage medium

Country Status (2)

Country Link
CN (1) CN112527958A (en)
WO (1) WO2022121163A1 (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2022121163A1 (en) * 2020-12-11 2022-06-16 平安科技(深圳)有限公司 User behavior tendency identification method, apparatus, and device, and storage medium
CN114663143A (en) * 2022-03-21 2022-06-24 平安健康保险股份有限公司 Intervention user screening method and device based on differential intervention response model
CN114676961A (en) * 2022-02-23 2022-06-28 深圳中科闻歌科技有限公司 Enterprise external migration risk prediction method and device and computer readable storage medium
CN115620853A (en) * 2022-09-07 2023-01-17 国家康复辅具研究中心 Model training method for TMS strategy automatic selection, automatic selection method and system
CN116468096A (en) * 2023-03-30 2023-07-21 之江实验室 Model training method, device, equipment and readable storage medium

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115545085A (en) * 2022-11-04 2022-12-30 南方电网数字电网研究院有限公司 Weak fault current fault type identification method, device, equipment and medium
CN116189215A (en) * 2022-12-30 2023-05-30 中国人民财产保险股份有限公司 Automatic auditing method and device, electronic equipment and storage medium
CN118450342A (en) * 2024-07-05 2024-08-06 深圳博瑞天下科技有限公司 Method and device for processing short message node overall under high throughput

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107016107A (en) * 2017-04-12 2017-08-04 四川九鼎瑞信软件开发有限公司 The analysis of public opinion method and system
CN109241418A (en) * 2018-08-22 2019-01-18 中国平安人寿保险股份有限公司 Abnormal user recognition methods and device, equipment, medium based on random forest
CN111666502A (en) * 2020-07-08 2020-09-15 腾讯科技(深圳)有限公司 Abnormal user identification method and device based on deep learning and storage medium

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11288297B2 (en) * 2017-11-29 2022-03-29 Oracle International Corporation Explicit semantic analysis-based large-scale classification
CN109325106A (en) * 2018-07-31 2019-02-12 厦门快商通信息技术有限公司 A kind of U.S. chat robots intension recognizing method of doctor and device
CN109934260A (en) * 2019-01-31 2019-06-25 中国科学院信息工程研究所 Image, text and data fusion sensibility classification method and device based on random forest
CN111368076B (en) * 2020-02-27 2023-04-07 中国地质大学(武汉) Bernoulli naive Bayesian text classification method based on random forest
CN112527958A (en) * 2020-12-11 2021-03-19 平安科技(深圳)有限公司 User behavior tendency identification method, device, equipment and storage medium

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107016107A (en) * 2017-04-12 2017-08-04 四川九鼎瑞信软件开发有限公司 The analysis of public opinion method and system
CN109241418A (en) * 2018-08-22 2019-01-18 中国平安人寿保险股份有限公司 Abnormal user recognition methods and device, equipment, medium based on random forest
CN111666502A (en) * 2020-07-08 2020-09-15 腾讯科技(深圳)有限公司 Abnormal user identification method and device based on deep learning and storage medium

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2022121163A1 (en) * 2020-12-11 2022-06-16 平安科技(深圳)有限公司 User behavior tendency identification method, apparatus, and device, and storage medium
CN114676961A (en) * 2022-02-23 2022-06-28 深圳中科闻歌科技有限公司 Enterprise external migration risk prediction method and device and computer readable storage medium
CN114663143A (en) * 2022-03-21 2022-06-24 平安健康保险股份有限公司 Intervention user screening method and device based on differential intervention response model
CN115620853A (en) * 2022-09-07 2023-01-17 国家康复辅具研究中心 Model training method for TMS strategy automatic selection, automatic selection method and system
CN116468096A (en) * 2023-03-30 2023-07-21 之江实验室 Model training method, device, equipment and readable storage medium
CN116468096B (en) * 2023-03-30 2024-01-02 之江实验室 Model training method, device, equipment and readable storage medium

Also Published As

Publication number Publication date
WO2022121163A1 (en) 2022-06-16

Similar Documents

Publication Publication Date Title
CN112527958A (en) User behavior tendency identification method, device, equipment and storage medium
Tixier et al. A graph degeneracy-based approach to keyword extraction
WO2021047186A1 (en) Method, apparatus, device, and storage medium for processing consultation dialogue
CN107085581B (en) Short text classification method and device
CN109872162B (en) Wind control classification and identification method and system for processing user complaint information
CN110472043B (en) Clustering method and device for comment text
JPH07114572A (en) Document classifying device
CN110858217A (en) Method and device for detecting microblog sensitive topics and readable storage medium
CN111309916B (en) Digest extracting method and apparatus, storage medium, and electronic apparatus
CN106294733A (en) Page detection method based on text analyzing
CN106294736A (en) Text feature based on key word frequency
CN107665221A (en) The sorting technique and device of keyword
CN108536673B (en) News event extraction method and device
Bhutada et al. Semantic latent dirichlet allocation for automatic topic extraction
CN113626604A (en) Webpage text classification system based on maximum interval criterion
CN108073567B (en) Feature word extraction processing method, system and server
KR20220041337A (en) Graph generation system of updating a search word from thesaurus and extracting core documents and method thereof
CN110413985B (en) Related text segment searching method and device
CN106294295A (en) Article similarity recognition method based on word frequency
CN115496066A (en) Text analysis system, text analysis method, electronic device, and storage medium
CN107590163B (en) The methods, devices and systems of text feature selection
CN115048523A (en) Text classification method, device, equipment and storage medium
JP2008282111A (en) Similar document retrieval method, program and device
CN113868431A (en) Financial knowledge graph-oriented relation extraction method and device and storage medium
KR20220041336A (en) Graph generation system of recommending significant keywords and extracting core documents and method thereof

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination