WO2022121163A1 - User behavior tendency identification method, apparatus, and device, and storage medium - Google Patents

User behavior tendency identification method, apparatus, and device, and storage medium Download PDF

Info

Publication number
WO2022121163A1
WO2022121163A1 PCT/CN2021/083480 CN2021083480W WO2022121163A1 WO 2022121163 A1 WO2022121163 A1 WO 2022121163A1 CN 2021083480 W CN2021083480 W CN 2021083480W WO 2022121163 A1 WO2022121163 A1 WO 2022121163A1
Authority
WO
WIPO (PCT)
Prior art keywords
text information
keyword
user
voting
computer
Prior art date
Application number
PCT/CN2021/083480
Other languages
French (fr)
Chinese (zh)
Inventor
卢春曦
王健宗
黄章成
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2022121163A1 publication Critical patent/WO2022121163A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/3332Query translation
    • G06F16/3334Selection or weighting of terms from queries, including natural language queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3347Query execution using vector based model
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Definitions

  • the present application relates to the field of artificial intelligence, and in particular, to a method, device, device and storage medium for identifying user behavior tendency.
  • the main purpose of this application is to solve the technical problem of how to flexibly identify user behavior tendencies.
  • a first aspect of the present application provides a method for identifying a user behavior tendency, which includes: acquiring a plurality of pieces of first text information published by a plurality of sample users with a determined behavior tendency, and the corresponding first text information. a first record parameter; extracting a plurality of keywords in the first text information, counting the number of occurrences of each keyword in the first text information, and performing vectorization processing to obtain a plurality of keyword vectors; Using the keyword vectors and the first record parameters as training samples, randomly extract multiple samples from the training samples for multiple times to obtain multiple training sets; refer to the preset discriminant indicators to construct the respective training samples.
  • the decision tree corresponding to the training set, and the corresponding random forest model is generated according to each decision tree; a plurality of pieces of second text information published by the user to be detected and the second record parameters corresponding to each second text are obtained; The second text information and the second record parameters are input into the random forest model for voting, and a voting result is obtained; according to the voting result, it is determined whether the user to be detected has the behavioral tendency.
  • a second aspect of the present application provides a user behavior tendency identification device, comprising a memory, a processor, and computer-readable instructions stored on the memory and executable on the processor, the processor executing the computer
  • the instruction When the instruction is readable, the following steps are implemented: acquiring multiple pieces of first text information published by multiple sample users with certain behavioral tendencies and first record parameters corresponding to the first text information; extracting the first text information from the first text information , count the number of occurrences of the keywords in the first text information and perform vectorization processing to obtain multiple keyword vectors; use the keyword vectors and the first records
  • the parameters are training samples, and multiple samples are randomly selected from the training samples for multiple times to obtain multiple training sets; with reference to the preset discriminant indicators, a decision tree corresponding to each training set is constructed, and a decision tree corresponding to each training set is constructed according to the each decision tree.
  • Generate a corresponding random forest model obtain multiple pieces of second text information published by the user to be detected and second record parameters corresponding to the second texts; input the second text information and the second record parameters
  • the random forest model votes to obtain a voting result; according to the voting result, it is determined whether the user to be detected has the behavioral tendency.
  • a third aspect of the present application provides a computer-readable storage medium, where computer instructions are stored in the computer-readable storage medium, and when the computer instructions are executed on a computer, the computer is caused to perform the following steps: obtaining a data with a certain behavioral tendency A plurality of pieces of first text information published by a plurality of sample users and the first record parameters corresponding to the first text information; extract a plurality of keywords in the first text information, and count the first texts The number of occurrences of each keyword in the information is vectorized to obtain a plurality of keyword vectors; using the keyword vectors and the first record parameters as training samples, the Randomly extract multiple samples at a time to obtain multiple training sets; refer to preset discriminant indicators, construct decision trees corresponding to the training sets, and generate corresponding random forest models according to the decision trees; Multiple pieces of second text information and second record parameters corresponding to the second texts; input the second text information and the second record parameters into the random forest model for voting, and obtain a voting result; The voting result determines whether the user to be detected has the behavioral
  • a fourth aspect of the present application provides a user behavior tendency identification device, comprising: a first acquisition module configured to acquire a plurality of pieces of first text information and the first text information published by a plurality of sample users with a determined behavior tendency The corresponding first record parameter; the vectorization module is used to extract a plurality of keywords in the first text information, count the number of occurrences of the keywords in the first text information, and perform vectorization processing , to obtain a plurality of keyword vectors; the sampling module is used for taking the keyword vectors and the first recording parameters as training samples, randomly extracting a plurality of samples from the training samples for many times, and obtaining a plurality of a training set; a building module is used to construct a decision tree corresponding to each training set with reference to preset discriminant indicators, and generate a corresponding random forest model according to each decision tree; a second obtaining module is used to obtain the user to be detected A plurality of published second text information and the second record parameters corresponding to the second texts; a voting module, configured to input the
  • speech data published by users with the same type of characteristic behavior tendency are first collected, and keywords in these speech data are extracted as characteristic representations of this type of user. Then use these speech data as training samples for machine learning, build a random forest model, and then input the speech data related to the user to be detected into the model for identification, and determine whether the user to be detected and the sample user have the same behavioral characteristics, if so, Then it can be determined that the user to be detected and the sample user have the same behavioral tendency.
  • This application can extract relevant speech features of users with the same type of characteristic behavioral tendencies, train a random forest model through machine learning, and then identify users with unknown behavioral tendencies to determine whether they have the same type of behavioral tendencies.
  • FIG. 1 is a schematic diagram of a first embodiment of a method for identifying a user behavior tendency in an embodiment of the present application
  • FIG. 2 is a schematic diagram of a second embodiment of a method for identifying a user behavior tendency in an embodiment of the present application
  • FIG. 3 is a schematic diagram of an embodiment of a user behavior tendency identification device in an embodiment of the present application.
  • FIG. 4 is a schematic diagram of an embodiment of a user behavior tendency identification device in an embodiment of the present application.
  • Embodiments of the present application provide a method, device, device, and storage medium for identifying user behavior tendencies.
  • the terms “first”, “second” and “third” in the description and claims of the present application and the above drawings , “fourth”, etc. (if present) are used to distinguish similar objects and are not necessarily used to describe a particular order or precedence. It is to be understood that data so used can be interchanged under appropriate circumstances so that the embodiments described herein can be practiced in sequences other than those illustrated or described herein.
  • the first embodiment of the method for identifying a user behavior tendency in the embodiment of the present application includes:
  • the execution subject of the present application may be a user behavior tendency identification device, and may also be a terminal or a server, which is not specifically limited here.
  • the embodiments of the present application take the server as an execution subject as an example for description.
  • a sample user with a certain behavioral tendency may be a sample user with a desire to purchase a certain commodity, a sample user with a certain negative behavioral tendency, a sample user with a certain psychological characteristic, etc.
  • the behavioral tendency of sample users determines the recognition type of the model, and models with different recognition types can identify users with different types of behavioral tendencies.
  • the speech characteristic words of the users with the same type of behavior tendency are extracted from a large amount of sample data, and then the speech of the unknown user is compared with the characteristic words, and then combined with other discriminant indicators, so as to determine whether the unknown user has the same feature.
  • the speech information of the sample users is analyzed.
  • the hit rate of the keywords in the text information published by the sample users needs to be counted.
  • the hit rate It will be used as one of the discriminant indicators when identifying unknown users, which is of great reference significance.
  • step 102 includes:
  • the keywords respectively determine the keywords included in the first text information published by the sample users
  • Vector transformation is performed on the occurrence times of each keyword to obtain a keyword vector corresponding to each sample user.
  • the text needs to be converted into a vector first.
  • the vector of the text refers to the number of occurrences of each keyword, for example, the extracted texts published by all sample users
  • the keyword vector corresponding to each sample user and the first record parameter are used as training samples, and multiple samples are randomly selected in the training samples with replacement to obtain training sets, and a decision tree is constructed for each training set. , to generate a random forest model.
  • the reason for random sampling is to make each decision tree different, and the resulting classification results are also different, and the reason for sampling with replacement is to make the intersection between each decision tree and avoid one-sided decision-making.
  • the result is generated by voting on these decision trees, and this voting should be "consensus". If the results generated by each decision tree are completely independent, the final voting result will not be helpful to the solution of the problem at all. Therefore, this embodiment adopts multiple The training set is obtained by random sampling with replacement.
  • the sample data in a training set is used as the generation data of a decision tree.
  • the CART tree algorithm is preferred to generate a classification decision tree.
  • the input of the algorithm is the training set, the Gini index threshold, the sample number threshold, and the output is decision tree.
  • the generation process starts from the follow node, and uses the training set to recursively build the CART classification tree.
  • the decision subtree is returned, and the current node stops the recursion.
  • the feature refers to the preset.
  • Discriminant index Calculate the Gini index of the sample set, if the Gini index is less than the threshold, return to the decision subtree, the current node stops recursion; calculate the Gini index of each feature value of each feature of the current node to the data set, and select the smallest Gini index
  • the features of and the corresponding feature values are classification nodes, establish leaf nodes, and continue to recursively execute the algorithm from the beginning until the conditions for generating a decision tree are met.
  • step 104 includes:
  • a classification regression tree algorithm is adopted, and a preset discriminant index is used as the feature selection of the decision tree, and each training sample in the each training set is subjected to decision tree classification to obtain a plurality of decision trees;
  • the decision trees are sequentially combined to obtain a random forest model, wherein the discriminant indicators include keyword vectors, the number of different keywords hit, the total number of hit keywords, average text length, sensitive speech time and sensitive speech days.
  • step 104 further includes:
  • S1 select a discriminant index as a root node, and calculate the Gini index of each discriminant index value corresponding to the root node to the training set;
  • the target vector as one of the parameters in the input random forest model.
  • the acquisition method of the target vector in the example is similar to the acquisition method of the keyword vector of the sample user, which can be used for reference.
  • each tree is a weak classifier.
  • the classification results of several weak classifiers are selected by voting to form a strong classifier, which is the method of random forest bagging. Thought.
  • step 106 includes:
  • the voting result includes having the behavioral tendency and/or not having the behavioral tendency. For example, 80% of the decision trees are classified as having the behavioral tendency, and 20% of the decision trees are classified as not With the behavioral inclination, 80% and 20% are voting ratios, and the voting result with a high voting ratio is used as the recognition result of the model, that is, the detected user has the same behavioral inclination as the sample user.
  • step 107 includes:
  • the behavioral tendency with the highest voting ratio is taken as the behavioral tendency of the user to be detected.
  • the speech data published by users with the same type of characteristic behavior tendency is first collected, and the keywords in the speech data are extracted as the characteristic representation of this type of user. Then use these speech data as training samples for machine learning, build a random forest model, and then input the speech data related to the user to be detected into the model for identification, and determine whether the user to be detected and the sample user have the same behavioral characteristics, if so, Then it can be determined that the user to be detected and the sample user have the same behavioral tendency.
  • This application can extract the relevant speech features of users with the same type of characteristic behavioral tendencies, train a random forest model through machine learning, and then identify users with unknown behavioral tendencies to determine whether they have the same type of behavioral tendencies.
  • the second embodiment of the method for identifying user behavior tendency in the embodiment of the present application includes:
  • word segmentation processing needs to be performed on the text before extracting keywords in the text information.
  • Word segmentation is the basis for processing natural language, so that the machine can understand human language.
  • the NLP word segmentation algorithm is preferred to perform word segmentation processing on the original text, thereby extracting keywords.
  • the NLP word segmentation algorithm is in the prior art and will not be repeated here.
  • the TF-IDF algorithm is used to determine speech keywords.
  • the TF-IDF algorithm is a word frequency-inverse text frequency algorithm based on discrete word bags, and is used to evaluate the effect of a word on one of the document sets or corpora. The importance of a word increases proportionally to the number of times it appears in the document, but decreases inversely proportional to its frequency in the corpus.
  • the calculation formula of the discrimination degree W i of the word i is:
  • tf i refers to the frequency of word i in the document after tokenization
  • N refers to the total number of documents in the corpus
  • df i refers to the number of documents containing word i.
  • the distinguishing degree of each word is sorted, and the top N words of the distinguishing degree are used as the keywords of the behavior-oriented user, which are used as the benchmark data for the vectorization processing of the keywords of the sample users, where N is a preset parameter, N is an integer greater than 0.
  • the keywords in the text information play a crucial role, and can represent the speech characteristics of users with this type of behavior tendency. Keywords in all textual information.
  • the extraction method is to first perform word segmentation on long text information to obtain words that cannot be further divided, then calculate the frequency of occurrence of these words, and use multiple words with high frequency as keywords.
  • the present application obtains representative characteristic keywords through the analysis and calculation of a large amount of data.
  • the discriminant indicators for behavioral tendency identification it can better predict the user's behavioral tendency. Combined with other discriminant indicators, it can accurately identify the user's behavioral tendency. behavioral tendencies to take further intervention.
  • An embodiment of the device for identifying the user behavior tendency in the embodiment of the present application includes:
  • a first obtaining module 301 configured to obtain a plurality of pieces of first text information published by a plurality of sample users with certain behavioral tendencies and first record parameters corresponding to the first text information;
  • the vectorization module 302 is configured to extract a plurality of keywords in the first text information, count the number of times each keyword appears in the first text information, and perform vectorization processing to obtain a plurality of keywords vector;
  • Sampling module 303 configured to use the keyword vectors and the first recording parameters as training samples, and randomly extract multiple samples from the training samples for multiple times to obtain multiple training sets;
  • the construction module 304 is used for constructing a decision tree corresponding to each training set with reference to a preset discriminant index, and generating a corresponding random forest model according to each decision tree;
  • the second obtaining module 305 is configured to obtain multiple pieces of second text information published by the user to be detected and second record parameters corresponding to the second texts;
  • a voting module 306, configured to input the second text information and the second record parameters into the random forest model for voting to obtain a voting result
  • the determining module 307 is configured to determine whether the user to be detected has the behavioral tendency according to the voting result.
  • the vectorization module 302 includes:
  • a keyword extraction unit configured to perform word segmentation processing on the first text information to obtain a plurality of word units; use the TF-IDF algorithm to calculate the degree of discrimination of the word units; sort the degree of discrimination of the word units , and extract the word unit with the highest degree of discrimination as the keyword from the ranking result.
  • the vectorization module 302 further includes:
  • a vector transformation unit configured to determine the keywords contained in the first text information published by the sample users according to the keywords; count the keywords in the first text information published by the sample users Number of occurrences of the word; vector transformation is performed on the number of occurrences of each keyword to obtain a keyword vector corresponding to each sample user.
  • the building module 304 is specifically used for:
  • a classification regression tree algorithm is adopted, and a preset discriminant index is used as the feature selection of the decision tree, and each training sample in the each training set is subjected to decision tree classification to obtain a plurality of decision trees;
  • the decision trees are sequentially combined to obtain a random forest model, wherein the discriminant indicators include keyword vectors, the number of different keywords hit, the total number of hit keywords, average text length, sensitive speech time and sensitive speech days.
  • the building module 304 includes:
  • a computing unit configured to select a discriminant index as a root node, and calculate the Gini index of each discriminant index value corresponding to the root node to the training set;
  • a judgment unit configured to judge whether each Gini index is greater than a preset first threshold and the number of samples in the sample set is greater than a preset second threshold
  • a dividing unit configured to divide the training set into a plurality of leaf nodes and select a Gini index if the Gini indices are greater than a preset first threshold and the number of samples in the sample set is greater than a preset second threshold The smallest discriminant index value is used as the root node, and the calculation unit and the judgment unit are executed cyclically;
  • a generating unit configured to generate a decision tree corresponding to the training set if each Gini index is less than a preset first threshold or the number of samples in the sample set is less than a preset second threshold.
  • the voting module 306 is specifically configured to:
  • the determining module 307 is specifically configured to:
  • the behavioral tendency with the highest voting ratio is taken as the behavioral tendency of the user to be detected.
  • the speech data published by users with the same type of characteristic behavior tendency is first collected, and the keywords in the speech data are extracted as the characteristic representation of this type of user. Then use these speech data as training samples for machine learning, build a random forest model, and then input the speech data related to the user to be detected into the model for identification, and determine whether the user to be detected and the sample user have the same behavioral characteristics, if so, Then it can be determined that the user to be detected and the sample user have the same behavioral tendency.
  • This application can extract relevant speech features of users with the same type of characteristic behavioral tendencies, train a random forest model through machine learning, and then identify users with unknown behavioral tendencies to determine whether they have the same type of behavioral tendencies.
  • Fig. 3 describes the user behavior tendency identification device in the embodiment of the present application in detail from the perspective of modular functional entities, and the following describes the user behavior tendency identification device in the embodiment of the present application in detail from the perspective of hardware processing.
  • FIG. 4 is a schematic structural diagram of a user behavior tendency identification device provided by an embodiment of the present application.
  • the user behavior tendency identification device 400 may vary greatly due to different configurations or performances, and may include one or more processors (central processing units, CPU) 410 (eg, one or more processors) and memory 420, one or more storage media 430 (eg, one or more mass storage devices) that store application programs 433 or data 432.
  • the memory 420 and the storage medium 430 may be short-term storage or persistent storage.
  • the program stored in the storage medium 430 may include one or more modules (not shown in the figure), and each module may include a series of instruction operations on the user behavior tendency recognition device 400 .
  • the processor 410 may be configured to communicate with the storage medium 430 to execute a series of instruction operations in the storage medium 430 on the user behavior tendency identification device 400 .
  • the user behavior tendency identification device 400 may also include one or more power supplies 440, one or more wired or wireless network interfaces 450, one or more input and output interfaces 460, and/or, one or more operating systems 431, such as Windows Serve, Mac OS X, Unix, Linux, FreeBSD, and more.
  • operating systems 431, such as Windows Serve, Mac OS X, Unix, Linux, FreeBSD, and more such as Windows Serve, Mac OS X, Unix, Linux, FreeBSD, and more.
  • the present application also provides a user behavior tendency identification device, comprising: a memory and at least one processor, wherein instructions are stored in the memory, and the memory and the at least one processor are interconnected by a line; the at least one processor The instructions in the memory are invoked, so that the user behavior tendency identification device executes the steps in the above-mentioned user behavior tendency identification method.
  • the present application also provides a computer-readable storage medium, and the computer-readable storage medium may be a non-volatile computer-readable storage medium or a volatile computer-readable storage medium.
  • the computer-readable storage medium stores computer instructions, and when the computer instructions are executed on the computer, the computer performs the following steps:
  • the integrated unit if implemented in the form of a software functional unit and sold or used as an independent product, may be stored in a computer-readable storage medium.
  • the technical solutions of the present application can be embodied in the form of software products in essence, or the parts that contribute to the prior art, or all or part of the technical solutions, and the computer software products are stored in a storage medium , including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of the present application.
  • the aforementioned storage medium includes: U disk, mobile hard disk, read-only memory (ROM), random access memory (RAM), magnetic disk or optical disk and other media that can store program codes .

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Software Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A user behavior tendency identification method, apparatus, and device, and a storage medium, which relate to the field of artificial intelligence. The method comprises: obtaining a plurality of pieces of text information published by a plurality of sample users that have a determined behavior tendency and recording parameters; extracting a plurality of keywords from within the pieces of text information and converting the keywords into keyword vectors; using the keyword vectors and the recording parameters as training samples and randomly extracting a plurality of samples from the training samples to obtain a plurality of training sets; constructing a plurality of decision trees according to a preset discrimination indicator and generating a random forest model; and inputting text information published by a user to be detected and corresponding recording parameters into the random forest model for voting, and according to the voting result, determining whether the user has the behavior tendency. User behavior tendencies can be determined quickly by means of speech information published by a user.

Description

用户行为倾向识别方法、装置、设备及存储介质User behavior tendency identification method, device, equipment and storage medium
本申请要求于2020年12月11日提交中国专利局、申请号为202011436696.4、发明名称为“用户行为倾向识别方法、装置、设备及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在申请中。This application claims the priority of the Chinese patent application with the application number 202011436696.4 and the invention titled "User Behavior Tendency Recognition Method, Apparatus, Equipment and Storage Medium" filed with the China Patent Office on December 11, 2020, the entire contents of which are by reference incorporated in the application.
技术领域technical field
本申请涉及人工智能领域,尤其涉及一种用户行为倾向识别方法、装置、设备及存储介质。The present application relates to the field of artificial intelligence, and in particular, to a method, device, device and storage medium for identifying user behavior tendency.
背景技术Background technique
随着互联网的发展,网络上的信息传播越来越快速和广泛,繁杂的言论信息会对用户产生不同的影响,特别是一些有负面行为倾向的用户发表的言论,可能会引起群体效应,进而导致严重的后果。作为信息承载的平台,若能提前识别出一些有负面行为倾向的用户,并采取进一步的干预,能减少不良后果带来的影响。With the development of the Internet, the dissemination of information on the Internet is becoming more and more rapid and extensive, and the complex speech information will have different effects on users, especially the remarks made by some users with negative behavior tendencies, which may cause group effects, and then lead to serious consequences. As an information-bearing platform, if some users with negative behavior tendencies can be identified in advance and further interventions can be taken, the impact of adverse consequences can be reduced.
发明人意识到,目前对用户不良言论的处理方式一般是采用敏感词屏蔽,这种方式只能屏蔽部分已知的敏感词汇,对于一些负面但不敏感的心理词汇,无法使用屏蔽的方式来消除影响。而对于有某一特征行为倾向的用户,计算机也难以识别,只能通过后判机制来确定。The inventor realized that the current way to deal with users' bad speech is to use sensitive word shielding. This method can only shield some known sensitive words. For some negative but insensitive psychological words, the shielding method cannot be used to eliminate them. influences. For users with a certain characteristic behavioral tendency, it is difficult for the computer to identify them, and can only be determined through the post-judgment mechanism.
发明内容SUMMARY OF THE INVENTION
本申请的主要目的在于解决如何灵活识别用户行为倾向的技术问题。The main purpose of this application is to solve the technical problem of how to flexibly identify user behavior tendencies.
为实现上述目的,本申请第一方面提供了一种用户行为倾向识别方法,包括:获取具有确定行为倾向的多个样本用户发布的多条第一文本信息及所述各第一文本信息对应的第一记录参数;提取所述各第一文本信息中的多个关键词,统计所述各第一文本信息中所述各关键词出现的次数并进行向量化处理,得到多个关键词向量;以所述各关键词向量及所述各第一记录参数为训练样本,从所述各训练样本中多次随机抽取多个样本,得到多个训练集;参照预置判别指标,构建所述各训练集对应的决策树,并根据所述各决策树生成对应的随机森林模型;获取待检测用户发布的多条第二文本信息及所述各第二文本对应的第二记录参数;将所述各第二文本信息和所述各第二记录参数输入所述随机森林模型进行投票,得到投票结果;根据所述投票结果,确定所述待检测用户是否具有所述行为倾向。In order to achieve the above purpose, a first aspect of the present application provides a method for identifying a user behavior tendency, which includes: acquiring a plurality of pieces of first text information published by a plurality of sample users with a determined behavior tendency, and the corresponding first text information. a first record parameter; extracting a plurality of keywords in the first text information, counting the number of occurrences of each keyword in the first text information, and performing vectorization processing to obtain a plurality of keyword vectors; Using the keyword vectors and the first record parameters as training samples, randomly extract multiple samples from the training samples for multiple times to obtain multiple training sets; refer to the preset discriminant indicators to construct the respective training samples. The decision tree corresponding to the training set, and the corresponding random forest model is generated according to each decision tree; a plurality of pieces of second text information published by the user to be detected and the second record parameters corresponding to each second text are obtained; The second text information and the second record parameters are input into the random forest model for voting, and a voting result is obtained; according to the voting result, it is determined whether the user to be detected has the behavioral tendency.
本申请第二方面提供了一种用户行为倾向识别设备,包括存储器、处理器及存储在所述存储器上并可在所述处理器上运行的计算机可读指令,所述处理器执行所述计算机可读指令时实现如下步骤:获取具有确定行为倾向的多个样本用户发布的多条第一文本信息及所述各第一文本信息对应的第一记录参数;提取所述各第一文本信息中的多个关键词,统计所述各第一文本信息中所述各关键词出现的次数并进行向量化处理,得到多个关键词向量;以所述各关键词向量及所述各第一记录参数为训练样本,从所述各训练样本中多次随机抽取多个样本,得到多个训练集;参照预置判别指标,构建所述各训练集对应的决策树,并根据所述各决策树生成对应的随机森林模型;获取待检测用户发布的多条第二文本信息及所述各第二文本对应的第二记录参数;将所述各第二文本信息和所述各第二记录参数输入所述随机森林模型进行投票,得到投票结果;根据所述投票结果,确定所述待检测用户是否具有所述行为倾向。A second aspect of the present application provides a user behavior tendency identification device, comprising a memory, a processor, and computer-readable instructions stored on the memory and executable on the processor, the processor executing the computer When the instruction is readable, the following steps are implemented: acquiring multiple pieces of first text information published by multiple sample users with certain behavioral tendencies and first record parameters corresponding to the first text information; extracting the first text information from the first text information , count the number of occurrences of the keywords in the first text information and perform vectorization processing to obtain multiple keyword vectors; use the keyword vectors and the first records The parameters are training samples, and multiple samples are randomly selected from the training samples for multiple times to obtain multiple training sets; with reference to the preset discriminant indicators, a decision tree corresponding to each training set is constructed, and a decision tree corresponding to each training set is constructed according to the each decision tree. Generate a corresponding random forest model; obtain multiple pieces of second text information published by the user to be detected and second record parameters corresponding to the second texts; input the second text information and the second record parameters The random forest model votes to obtain a voting result; according to the voting result, it is determined whether the user to be detected has the behavioral tendency.
本申请第三方面提供了一种计算机可读存储介质,所述计算机可读存储介质中存储有计算机指令,当所述计算机指令在计算机上运行时,使得计算机执行如下步骤:获取具有确定行为倾向的多个样本用户发布的多条第一文本信息及所述各第一文本信息对应的第一记录参数;提取所述各第一文本信息中的多个关键词,统计所述各第一文本信息中所述各关键词出现的次数并进行向量化处理,得到多个关键词向量;以所述各关键词向量及所述 各第一记录参数为训练样本,从所述各训练样本中多次随机抽取多个样本,得到多个训练集;参照预置判别指标,构建所述各训练集对应的决策树,并根据所述各决策树生成对应的随机森林模型;获取待检测用户发布的多条第二文本信息及所述各第二文本对应的第二记录参数;将所述各第二文本信息和所述各第二记录参数输入所述随机森林模型进行投票,得到投票结果;根据所述投票结果,确定所述待检测用户是否具有所述行为倾向。A third aspect of the present application provides a computer-readable storage medium, where computer instructions are stored in the computer-readable storage medium, and when the computer instructions are executed on a computer, the computer is caused to perform the following steps: obtaining a data with a certain behavioral tendency A plurality of pieces of first text information published by a plurality of sample users and the first record parameters corresponding to the first text information; extract a plurality of keywords in the first text information, and count the first texts The number of occurrences of each keyword in the information is vectorized to obtain a plurality of keyword vectors; using the keyword vectors and the first record parameters as training samples, the Randomly extract multiple samples at a time to obtain multiple training sets; refer to preset discriminant indicators, construct decision trees corresponding to the training sets, and generate corresponding random forest models according to the decision trees; Multiple pieces of second text information and second record parameters corresponding to the second texts; input the second text information and the second record parameters into the random forest model for voting, and obtain a voting result; The voting result determines whether the user to be detected has the behavioral tendency.
本申请第四方面提供了一种用户行为倾向识别装置,包括:第一获取模块,用于获取具有确定行为倾向的多个样本用户发布的多条第一文本信息及所述各第一文本信息对应的第一记录参数;向量化模块,用于提取所述各第一文本信息中的多个关键词,统计所述各第一文本信息中所述各关键词出现的次数并进行向量化处理,得到多个关键词向量;抽样模块,用于以所述各关键词向量及所述各第一记录参数为训练样本,从所述各训练样本中多次随机抽取多个样本,得到多个训练集;构建模块,用于参照预置判别指标,构建所述各训练集对应的决策树,并根据所述各决策树生成对应的随机森林模型;第二获取模块,用于获取待检测用户发布的多条第二文本信息及所述各第二文本对应的第二记录参数;投票模块,用于将所述各第二文本信息和所述各第二记录参数输入所述随机森林模型进行投票,得到投票结果;确定模块,用于根据所述投票结果,确定所述待检测用户是否具有所述行为倾向。A fourth aspect of the present application provides a user behavior tendency identification device, comprising: a first acquisition module configured to acquire a plurality of pieces of first text information and the first text information published by a plurality of sample users with a determined behavior tendency The corresponding first record parameter; the vectorization module is used to extract a plurality of keywords in the first text information, count the number of occurrences of the keywords in the first text information, and perform vectorization processing , to obtain a plurality of keyword vectors; the sampling module is used for taking the keyword vectors and the first recording parameters as training samples, randomly extracting a plurality of samples from the training samples for many times, and obtaining a plurality of a training set; a building module is used to construct a decision tree corresponding to each training set with reference to preset discriminant indicators, and generate a corresponding random forest model according to each decision tree; a second obtaining module is used to obtain the user to be detected A plurality of published second text information and the second record parameters corresponding to the second texts; a voting module, configured to input the second text information and the second record parameters into the random forest model for Voting to obtain a voting result; a determining module, configured to determine whether the user to be detected has the behavioral tendency according to the voting result.
本申请提供的技术方案中,首先收集有同一类型特征行为倾向的用户发布的言论数据,提取出这些言论数据中的关键词,作为这一类型用户的特征表示。再将这些言论数据作为机器学习的训练样本,构建随机森林模型,然后将待检测的用户相关的言论数据输入到模型进行识别,判断待检测用户和样本用户是否具有相同的行为特征,如果有,那么就能确定待检测的用户和样本用户具有相同的行为倾向。本申请能提取出有同一类型特征行为倾向的用户相关的言论特征,通过机器学习的方式训练出随机森林模型,进而对未知行为倾向的用户进行识别,确定其是否具有相同类型的行为倾向。In the technical solution provided by the present application, speech data published by users with the same type of characteristic behavior tendency are first collected, and keywords in these speech data are extracted as characteristic representations of this type of user. Then use these speech data as training samples for machine learning, build a random forest model, and then input the speech data related to the user to be detected into the model for identification, and determine whether the user to be detected and the sample user have the same behavioral characteristics, if so, Then it can be determined that the user to be detected and the sample user have the same behavioral tendency. This application can extract relevant speech features of users with the same type of characteristic behavioral tendencies, train a random forest model through machine learning, and then identify users with unknown behavioral tendencies to determine whether they have the same type of behavioral tendencies.
附图说明Description of drawings
图1为本申请实施例中用户行为倾向识别方法的第一个实施例示意图;1 is a schematic diagram of a first embodiment of a method for identifying a user behavior tendency in an embodiment of the present application;
图2为本申请实施例中用户行为倾向识别方法的第二个实施例示意图;2 is a schematic diagram of a second embodiment of a method for identifying a user behavior tendency in an embodiment of the present application;
图3为本申请实施例中用户行为倾向识别装置的一个实施例示意图;3 is a schematic diagram of an embodiment of a user behavior tendency identification device in an embodiment of the present application;
图4为本申请实施例中用户行为倾向识别设备的一个实施例示意图。FIG. 4 is a schematic diagram of an embodiment of a user behavior tendency identification device in an embodiment of the present application.
具体实施方式Detailed ways
本申请实施例提供了一种用户行为倾向识别方法、装置、设备及存储介质,本申请的说明书和权利要求书及上述附图中的术语“第一”、“第二”、“第三”、“第四”等(如果存在)是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换,以便这里描述的实施例能够以除了在这里图示或描述的内容以外的顺序实施。此外,术语“包括”或“具有”及其任何变形,意图在于覆盖不排他的包含,例如,包含了一系列步骤或单元的过程、方法、系统、产品或设备不必限于清楚地列出的那些步骤或单元,而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它步骤或单元。Embodiments of the present application provide a method, device, device, and storage medium for identifying user behavior tendencies. The terms "first", "second" and "third" in the description and claims of the present application and the above drawings , "fourth", etc. (if present) are used to distinguish similar objects and are not necessarily used to describe a particular order or precedence. It is to be understood that data so used can be interchanged under appropriate circumstances so that the embodiments described herein can be practiced in sequences other than those illustrated or described herein. Furthermore, the terms "comprising" or "having" and any variations thereof are intended to cover non-exclusive inclusion, for example, a process, method, system, product or device comprising a series of steps or units is not necessarily limited to those expressly listed steps or units, but may include other steps or units not expressly listed or inherent to these processes, methods, products or devices.
为便于理解,下面对本申请实施例的具体流程进行描述,请参阅图1,本申请实施例中用户行为倾向识别方法的第一个实施例包括:For ease of understanding, the specific process of the embodiment of the present application will be described below. Please refer to FIG. 1 . The first embodiment of the method for identifying a user behavior tendency in the embodiment of the present application includes:
101、获取具有确定行为倾向的多个样本用户发布的多条第一文本信息及所述各第一文本信息对应的第一记录参数;101. Acquire a plurality of pieces of first text information published by a plurality of sample users with certain behavioral tendencies and first record parameters corresponding to the first text information;
可以理解的是,本申请的执行主体可以为用户行为倾向识别装置,还可以是终端或者服务器,具体此处不做限定。本申请实施例以服务器为执行主体为例进行说明。It can be understood that the execution subject of the present application may be a user behavior tendency identification device, and may also be a terminal or a server, which is not specifically limited here. The embodiments of the present application take the server as an execution subject as an example for description.
本实施例中,由于本申请需要确定未知用户是否具有某一类型行为倾向,所以需要确定该类型用户的言论特征,通过特征匹配的方式来判断未知用户的行为倾向。因此,需要获取大量的样本数据,才能提取出最能代表具有同一类型行为倾向的用户的言论特征。本实施例获取了具有确定行为倾向的样本用户发布的文本信息和各文本信息对应的发布记录,用于提取言论特征关键词和机器学习训练样本。In this embodiment, since the application needs to determine whether an unknown user has a certain type of behavioral tendency, it is necessary to determine the speech characteristics of this type of user, and determine the behavioral tendency of the unknown user by means of feature matching. Therefore, it is necessary to obtain a large amount of sample data in order to extract the speech features that best represent users with the same type of behavioral tendencies. In this embodiment, text information published by sample users with a determined behavioral tendency and a publishing record corresponding to each text information are obtained, which are used to extract speech feature keywords and machine learning training samples.
本实施例中,具有确定行为倾向的样本用户可以是具有某一商品购买欲望的样本用户,具有某一负面行为倾向的样本用户,具有某一心理特征的样本用户等等,例如可以是,购买私人飞机的用户、自杀倾向用户、抑郁症用户等等。样本用户的行为倾向决定了模型的识别类型,不同识别类型的模型能识别出不同类型行为倾向的用户。In this embodiment, a sample user with a certain behavioral tendency may be a sample user with a desire to purchase a certain commodity, a sample user with a certain negative behavioral tendency, a sample user with a certain psychological characteristic, etc. Users of private jets, suicidal users, depressed users, etc. The behavioral tendency of sample users determines the recognition type of the model, and models with different recognition types can identify users with different types of behavioral tendencies.
102、提取所述各第一文本信息中的多个关键词,统计所述各第一文本信息中所述各关键词出现的次数并进行向量化处理,得到多个关键词向量;102. Extract a plurality of keywords in each of the first text information, count the number of occurrences of each of the keywords in each of the first text information, and perform vectorization processing to obtain a plurality of keyword vectors;
本实施例中,在大量的样本数据中提取出同一类型行为倾向用户的言论特征词汇,再将未知用户的言论与特征词汇进行比对,再结合其它判别指标,从而确定未知用户是否具有相同的特征。In this embodiment, the speech characteristic words of the users with the same type of behavior tendency are extracted from a large amount of sample data, and then the speech of the unknown user is compared with the characteristic words, and then combined with other discriminant indicators, so as to determine whether the unknown user has the same feature.
本实施例中,提取到特殊行为倾向用户言论中的关键词后,再对样本用户的言论信息进行分析,分析过程中需要统计样本用户发布的文本信息中对关键词的命中率,这个命中率将作为识别未知用户时的判别指标之一,非常具有参考意义。In this embodiment, after extracting the keywords in the speeches of the users with special behavior tendency, the speech information of the sample users is analyzed. During the analysis, the hit rate of the keywords in the text information published by the sample users needs to be counted. The hit rate It will be used as one of the discriminant indicators when identifying unknown users, which is of great reference significance.
可选的,步骤102包括:Optionally, step 102 includes:
根据所述各关键词,分别确定所述各样本用户发布的第一文本信息中所包含的关键词;According to the keywords, respectively determine the keywords included in the first text information published by the sample users;
统计所述各样本用户发布的第一文本信息中所述各关键词出现的次数;Counting the occurrences of each keyword in the first text information published by each sample user;
对所述各关键词出现的次数进行向量转化,得到各样本用户对应的关键词向量。Vector transformation is performed on the occurrence times of each keyword to obtain a keyword vector corresponding to each sample user.
本可选实施例中,计算关键词的命中率需要先将文本转化为向量,本实施例中,文本的向量指的是各关键词出现的次数,例如,提取到的所有样本用户发布的文本信息中的关键词为D=(T 1,T 2,T 3,T 4,T 5),其中一个样本用户发布的文本信息中各关键词出现的次数为W=(5,2,0,1,0),那么W可以作为该样本用户的关键词向量转化数据。 In this optional embodiment, to calculate the hit rate of a keyword, the text needs to be converted into a vector first. In this embodiment, the vector of the text refers to the number of occurrences of each keyword, for example, the extracted texts published by all sample users The keywords in the information are D=(T 1 , T 2 , T 3 , T 4 , T 5 ), and the number of occurrences of each keyword in the text information published by a sample user is W=(5,2,0, 1,0), then W can be used as the keyword vector transformation data of the sample user.
103、以所述各关键词向量及所述各第一记录参数为训练样本,从所述各训练样本中多次随机抽取多个样本,得到多个训练集;103. Using the keyword vectors and the first recording parameters as training samples, randomly extract multiple samples from the training samples for multiple times to obtain multiple training sets;
本实施例中,将各样本用户对应的关键词向量和第一记录参数作为训练样本,在训练样本中有放回地随机抽取多个样本,得到训练集,每个训练集构建一棵决策树,从而生成随机森林模型。随机抽样的原因是要令每棵决策树都不一样,从而产生的分类结果也不一样,而有放回地抽样的原因是使各决策树之间产生交集,避免决策的片面化,最终的结果是由这些决策树投票产生,这种投票应该是“求同”,若每个决策树产生的结果完全独立,那最终的投票结果对问题的解决完全没有帮助,因此,本实施例采用多次有放回地随机抽样的方式来得到训练集。In this embodiment, the keyword vector corresponding to each sample user and the first record parameter are used as training samples, and multiple samples are randomly selected in the training samples with replacement to obtain training sets, and a decision tree is constructed for each training set. , to generate a random forest model. The reason for random sampling is to make each decision tree different, and the resulting classification results are also different, and the reason for sampling with replacement is to make the intersection between each decision tree and avoid one-sided decision-making. The result is generated by voting on these decision trees, and this voting should be "consensus". If the results generated by each decision tree are completely independent, the final voting result will not be helpful to the solution of the problem at all. Therefore, this embodiment adopts multiple The training set is obtained by random sampling with replacement.
104、参照预置判别指标,构建所述各训练集对应的决策树,并根据所述各决策树生成对应的随机森林模型;104. With reference to preset discriminant indicators, construct decision trees corresponding to each of the training sets, and generate a corresponding random forest model according to each of the decision trees;
本实施例中,一个训练集中的样本数据作为一棵决策树的生成数据,本实施例优选CART树算法生成分类决策树,算法的输入是训练集,基尼指数阈值,样本个数阈值,输出是决策树。生成过程是从跟节点开始,用训练集递归建立CART分类树,当样本个数小于预置或没有特征时,返回决策子树,当前节点停止递归,本实施例中,特征指的是预置判别指标;计算样本集的基尼指数,如果基尼指数小于阈值,则返回决策子树,当前节点停止递归;计算当前节点现有的各个特征的各个特征值对数据集的基尼指数,选择基尼指数最小的特征和对应的特征值为分类节点,建立叶子节点,继续从头开始递归执行算法,直到满足生 成决策树的条件。In this embodiment, the sample data in a training set is used as the generation data of a decision tree. In this embodiment, the CART tree algorithm is preferred to generate a classification decision tree. The input of the algorithm is the training set, the Gini index threshold, the sample number threshold, and the output is decision tree. The generation process starts from the follow node, and uses the training set to recursively build the CART classification tree. When the number of samples is less than the preset or there is no feature, the decision subtree is returned, and the current node stops the recursion. In this embodiment, the feature refers to the preset. Discriminant index: Calculate the Gini index of the sample set, if the Gini index is less than the threshold, return to the decision subtree, the current node stops recursion; calculate the Gini index of each feature value of each feature of the current node to the data set, and select the smallest Gini index The features of and the corresponding feature values are classification nodes, establish leaf nodes, and continue to recursively execute the algorithm from the beginning until the conditions for generating a decision tree are met.
可选的,步骤104包括:Optionally, step 104 includes:
采用分类回归树算法,以预置判别指标作为决策树的特征选择,对所述各训练集中的所述各训练样本进行决策树分类,得到多棵决策树;A classification regression tree algorithm is adopted, and a preset discriminant index is used as the feature selection of the decision tree, and each training sample in the each training set is subjected to decision tree classification to obtain a plurality of decision trees;
依次组合所述各决策树,得到随机森林模型,其中,所述判别指标包括关键词向量、命中不同关键词的个数、命中关键词的总数、平均文本长度、敏感发言时间以及敏感发言天数。The decision trees are sequentially combined to obtain a random forest model, wherein the discriminant indicators include keyword vectors, the number of different keywords hit, the total number of hit keywords, average text length, sensitive speech time and sensitive speech days.
可选的,步骤104还包括:Optionally, step 104 further includes:
S1、选择一个判别指标作为根节点,计算所述根节点对应的各判别指标值对所述训练集的基尼指数;S1, select a discriminant index as a root node, and calculate the Gini index of each discriminant index value corresponding to the root node to the training set;
S2、判断所述各基尼指数是否大于预置第一阈值且所述样本集中的样本个数大于预置第二阈值;S2. Determine whether each Gini index is greater than a preset first threshold and the number of samples in the sample set is greater than a preset second threshold;
S3、若是,则将所述训练集划分为多个叶节点,并选择基尼指数最小的判别指标值作为根节点,循环执行S1-S2;S3. If yes, then divide the training set into a plurality of leaf nodes, and select the discriminant index value with the smallest Gini index as the root node, and execute S1-S2 cyclically;
S3、若否,则生成所述训练集对应的决策树。S3. If not, generate a decision tree corresponding to the training set.
105、获取待检测用户发布的多条第二文本信息及所述各第二文本对应的第二记录参数;105. Acquire multiple pieces of second text information published by the user to be detected and second record parameters corresponding to the second texts;
106、将所述各第二文本信息和所述各第二记录参数输入所述随机森林模型进行投票,得到投票结果;106. Input the second text information and the second record parameters into the random forest model for voting, and obtain a voting result;
本实施例中,提取到待检测用户发布的文本信息和记录参数后,需要统计出第二文本信息中的关键词出现次数,从而得到目标向量作为输入随机森林模型中的参数之一,本实施例中的目标向量的获取方法与样本用户的关键词向量的获取方法相似,可以对照参考。In this embodiment, after extracting the text information and recording parameters published by the user to be detected, it is necessary to count the number of occurrences of keywords in the second text information, so as to obtain the target vector as one of the parameters in the input random forest model. The acquisition method of the target vector in the example is similar to the acquisition method of the keyword vector of the sample user, which can be used for reference.
本实施例中,随机森林中有多棵分类树,每棵树都是一个弱分类器,将若干个弱分类器的分类结果进行投票选择,从而组成一个强分类器,这就是随机森林bagging的思想。In this embodiment, there are multiple classification trees in the random forest, and each tree is a weak classifier. The classification results of several weak classifiers are selected by voting to form a strong classifier, which is the method of random forest bagging. Thought.
可选的,步骤106包括:Optionally, step 106 includes:
统计所述第二文本信息中所述各关键词出现的次数并进行向量转化,得到目标向量;Count the times of occurrence of each keyword in the second text information and perform vector transformation to obtain a target vector;
将所述目标向量及所述第二记录参数输入所述随机森林模型进行分类,得到分类结果;Inputting the target vector and the second record parameter into the random forest model for classification to obtain a classification result;
令所述随机森林模型中的所有决策树对所述分类结果进行投票,得到投票结果。All decision trees in the random forest model are made to vote on the classification results to obtain voting results.
107、根据所述投票结果,确定所述待检测用户是否具有所述行为倾向。107. Determine whether the user to be detected has the behavioral tendency according to the voting result.
本实施例中,投票结果包括具有所述行为倾向和/或不具有所述行为倾向,例如,80%的决策树的分类结果为具有所述行为倾向,20%的决策树的分类结果为不具有所述行为倾向,80%和20%为投票比率,将投票比率高的投票结果作为模型的识别结果,即被检测用户具有与样本用户相同的行为倾向。In this embodiment, the voting result includes having the behavioral tendency and/or not having the behavioral tendency. For example, 80% of the decision trees are classified as having the behavioral tendency, and 20% of the decision trees are classified as not With the behavioral inclination, 80% and 20% are voting ratios, and the voting result with a high voting ratio is used as the recognition result of the model, that is, the detected user has the same behavioral inclination as the sample user.
可选的,步骤107包括:Optionally, step 107 includes:
获取所述随机森林模型中所有决策树的投票结果,其中,所述投票结果为具有所述行为倾向和/或不具有所述行为倾向;Obtain the voting results of all decision trees in the random forest model, wherein the voting results are having the behavioral tendency and/or not having the behavioral tendency;
根据所述各投票结果,计算不同行为倾向对应的投票比率;Calculate the voting ratios corresponding to different behavioral inclinations according to the voting results;
将投票比率最高的行为倾向作为所述待检测用户具有的行为倾向。The behavioral tendency with the highest voting ratio is taken as the behavioral tendency of the user to be detected.
本申请实施例中,首先收集有同一类型特征行为倾向的用户发布的言论数据,提取出这些言论数据中的关键词,作为这一类型用户的特征表示。再将这些言论数据作为机器学习的训练样本,构建随机森林模型,然后将待检测的用户相关的言论数据输入到模型进行识别,判断待检测用户和样本用户是否具有相同的行为特征,如果有,那么就能确定待检测的用户和样本用户具有相同的行为倾向。本申请能提取出有同一类型特征行为倾向的用 户相关的言论特征,通过机器学习的方式训练出随机森林模型,进而对未知行为倾向的用户进行识别,确定其是否具有相同类型的行为倾向。In the embodiment of the present application, the speech data published by users with the same type of characteristic behavior tendency is first collected, and the keywords in the speech data are extracted as the characteristic representation of this type of user. Then use these speech data as training samples for machine learning, build a random forest model, and then input the speech data related to the user to be detected into the model for identification, and determine whether the user to be detected and the sample user have the same behavioral characteristics, if so, Then it can be determined that the user to be detected and the sample user have the same behavioral tendency. This application can extract the relevant speech features of users with the same type of characteristic behavioral tendencies, train a random forest model through machine learning, and then identify users with unknown behavioral tendencies to determine whether they have the same type of behavioral tendencies.
请参阅图2,本申请实施例中用户行为倾向识别方法的第二个实施例包括:Referring to FIG. 2, the second embodiment of the method for identifying user behavior tendency in the embodiment of the present application includes:
201、获取具有确定行为倾向的多个样本用户发布的多条第一文本信息及所述各第一文本信息对应的第一记录参数;201. Acquire a plurality of pieces of first text information published by a plurality of sample users with a certain behavioral tendency and a first record parameter corresponding to each of the first text information;
202、对所述第一文本信息进行分词处理,得到多个词单元;202. Perform word segmentation processing on the first text information to obtain multiple word units;
203、采用TF-IDF算法计算所述各词单元的区分度;203. Calculate the degree of discrimination of each word unit by using the TF-IDF algorithm;
204、对所述各词单元的区分度进行排序,并从排序结果中提取区分度最高的词单元作为关键词;204. Sort the degree of discrimination of each word unit, and extract the word unit with the highest degree of discrimination from the sorting result as a keyword;
本可选实施例中,在提取文本信息中的关键词之前需要先对文本进行分词处理,分词是对自然语言进行处理的基础,是为了让机器能够理解人类语言。现有的分词算法很多,本实施例优选NLP分词算法对原始文本进行分词处理,从而提取出关键词。NLP分词算法为现有技术,在此不再赘述。In this optional embodiment, word segmentation processing needs to be performed on the text before extracting keywords in the text information. Word segmentation is the basis for processing natural language, so that the machine can understand human language. There are many existing word segmentation algorithms. In this embodiment, the NLP word segmentation algorithm is preferred to perform word segmentation processing on the original text, thereby extracting keywords. The NLP word segmentation algorithm is in the prior art and will not be repeated here.
本可选实施例中,采用TF-IDF算法来确定言论关键词,TF-IDF算法是基于离散词袋的词频-逆文本频率算法,用于评估一字词对于文档集或语料库中的其中一份文档的重要程度,字词的重要性随着它在文档中出现的次数成正比增加,但同时会随着它在语料库中出现的频率成反比下降。字词i的区分度W i的计算公式为: In this optional embodiment, the TF-IDF algorithm is used to determine speech keywords. The TF-IDF algorithm is a word frequency-inverse text frequency algorithm based on discrete word bags, and is used to evaluate the effect of a word on one of the document sets or corpora. The importance of a word increases proportionally to the number of times it appears in the document, but decreases inversely proportional to its frequency in the corpus. The calculation formula of the discrimination degree W i of the word i is:
Figure PCTCN2021083480-appb-000001
Figure PCTCN2021083480-appb-000001
其中tf i是指分词后字词i在文档中出现的频率,N是指语料库中的文档总数,df i是指包含字词i的文档个数。下面举例说明该公式的使用方式。 where tf i refers to the frequency of word i in the document after tokenization, N refers to the total number of documents in the corpus, and df i refers to the number of documents containing word i. The following example illustrates how this formula is used.
例如一个文档的总词语数是100个,而词语“购买”出现了4次,那么“购买”一词在该文档中的频率就是4/100=0.04,即词频tf i=0.04,如果“购买”一词在1000份文档中出现过,而文档总数是10000份的话,其逆文本频率就是
Figure PCTCN2021083480-appb-000002
最后W i=0.04×1=0.04,计算结果为“购买”一词在文档集中的区分度或重要程度。本实施例中,将各词语的区分度进行排序,将区分度前N的词语作为该行为倾向用户的关键词,用于样本用户的关键词向量化处理时的基准数据,其中N为预置参数,N为大于0的整数。
For example, the total number of words in a document is 100, and the word "purchase" appears 4 times, then the frequency of the word "purchase" in the document is 4/100=0.04, that is, the word frequency tf i =0.04, if "purchase" ” appears in 1000 documents, and if the total number of documents is 10000, the inverse text frequency is
Figure PCTCN2021083480-appb-000002
Finally, W i =0.04×1=0.04, and the calculation result is the degree of discrimination or importance of the word "purchase" in the document set. In this embodiment, the distinguishing degree of each word is sorted, and the top N words of the distinguishing degree are used as the keywords of the behavior-oriented user, which are used as the benchmark data for the vectorization processing of the keywords of the sample users, where N is a preset parameter, N is an integer greater than 0.
205、统计所述各第一文本信息中所述各关键词出现的次数并进行向量化处理,得到多个关键词向量;205. Count the number of occurrences of each keyword in each of the first text information and perform vectorization processing to obtain multiple keyword vectors;
206、以所述各关键词向量及所述各第一记录参数为训练样本,从所述各训练样本中多次随机抽取多个样本,得到多个训练集;206. Using the keyword vectors and the first recording parameters as training samples, randomly extract multiple samples from the training samples for multiple times to obtain multiple training sets;
207、参照预置判别指标,构建所述各训练集对应的决策树,并根据所述各决策树生成对应的随机森林模型;207. With reference to a preset discriminant index, construct a decision tree corresponding to each training set, and generate a corresponding random forest model according to each decision tree;
208、获取待检测用户发布的多条第二文本信息及所述各第二文本对应的第二记录参数;208. Obtain multiple pieces of second text information published by the user to be detected and second record parameters corresponding to the second texts;
209、将所述各第二文本信息和所述各第二记录参数输入所述随机森林模型进行投票,得到投票结果;209. Input the second text information and the second record parameters into the random forest model for voting, and obtain a voting result;
210、根据所述投票结果,确定所述待检测用户是否具有所述行为倾向。210. Determine whether the user to be detected has the behavioral tendency according to the voting result.
本申请实施例中,在分析用户行为倾向时,文本信息中的关键词起到了至关重要的作用,能代表该类型行为倾向用户的言论特征,因此,本申请实施例提取了样本用户发布的 所有文本信息中的关键词。提取方法是首先对长文本信息进行分词处理,得到无法再分割的单词,再计算这些单词出现的频率,将出现频率较高的多个单词作为关键词。本申请通过大量数据的分析和计算得到具有代表性的特征关键词,作为行为倾向识别的判别指标之一,能较好地预测用户行为倾向,结合其它多个判别指标,能准确地识别出用户行为倾向,从而采取进一步干预。In the embodiment of the present application, when analyzing the behavior tendency of users, the keywords in the text information play a crucial role, and can represent the speech characteristics of users with this type of behavior tendency. Keywords in all textual information. The extraction method is to first perform word segmentation on long text information to obtain words that cannot be further divided, then calculate the frequency of occurrence of these words, and use multiple words with high frequency as keywords. The present application obtains representative characteristic keywords through the analysis and calculation of a large amount of data. As one of the discriminant indicators for behavioral tendency identification, it can better predict the user's behavioral tendency. Combined with other discriminant indicators, it can accurately identify the user's behavioral tendency. behavioral tendencies to take further intervention.
上面对本申请实施例中用户行为倾向识别方法进行了描述,下面对本申请实施例中用户行为倾向识别装置进行描述,请参阅图3,本申请实施例中用户行为倾向识别装置一个实施例包括:The method for identifying the user behavior tendency in the embodiment of the present application has been described above. The following describes the device for identifying the user behavior tendency in the embodiment of the present application. Please refer to FIG. 3 . An embodiment of the device for identifying the user behavior tendency in the embodiment of the present application includes:
第一获取模块301,用于获取具有确定行为倾向的多个样本用户发布的多条第一文本信息及所述各第一文本信息对应的第一记录参数;A first obtaining module 301, configured to obtain a plurality of pieces of first text information published by a plurality of sample users with certain behavioral tendencies and first record parameters corresponding to the first text information;
向量化模块302,用于提取所述各第一文本信息中的多个关键词,统计所述各第一文本信息中所述各关键词出现的次数并进行向量化处理,得到多个关键词向量;The vectorization module 302 is configured to extract a plurality of keywords in the first text information, count the number of times each keyword appears in the first text information, and perform vectorization processing to obtain a plurality of keywords vector;
抽样模块303,用于以所述各关键词向量及所述各第一记录参数为训练样本,从所述各训练样本中多次随机抽取多个样本,得到多个训练集; Sampling module 303, configured to use the keyword vectors and the first recording parameters as training samples, and randomly extract multiple samples from the training samples for multiple times to obtain multiple training sets;
构建模块304,用于参照预置判别指标,构建所述各训练集对应的决策树,并根据所述各决策树生成对应的随机森林模型;The construction module 304 is used for constructing a decision tree corresponding to each training set with reference to a preset discriminant index, and generating a corresponding random forest model according to each decision tree;
第二获取模块305,用于获取待检测用户发布的多条第二文本信息及所述各第二文本对应的第二记录参数;The second obtaining module 305 is configured to obtain multiple pieces of second text information published by the user to be detected and second record parameters corresponding to the second texts;
投票模块306,用于将所述各第二文本信息和所述各第二记录参数输入所述随机森林模型进行投票,得到投票结果;A voting module 306, configured to input the second text information and the second record parameters into the random forest model for voting to obtain a voting result;
确定模块307,用于根据所述投票结果,确定所述待检测用户是否具有所述行为倾向。The determining module 307 is configured to determine whether the user to be detected has the behavioral tendency according to the voting result.
可选的,在一实施例中,所述向量化模块302包括:Optionally, in an embodiment, the vectorization module 302 includes:
关键词提取单元,用于对所述第一文本信息进行分词处理,得到多个词单元;采用TF-IDF算法计算所述各词单元的区分度;对所述各词单元的区分度进行排序,并从排序结果中提取区分度最高的词单元作为关键词。A keyword extraction unit, configured to perform word segmentation processing on the first text information to obtain a plurality of word units; use the TF-IDF algorithm to calculate the degree of discrimination of the word units; sort the degree of discrimination of the word units , and extract the word unit with the highest degree of discrimination as the keyword from the ranking result.
可选的,在一实施例中,所述向量化模块302还包括:Optionally, in an embodiment, the vectorization module 302 further includes:
向量转化单元,用于根据所述各关键词,分别确定所述各样本用户发布的第一文本信息中所包含的关键词;统计所述各样本用户发布的第一文本信息中所述各关键词出现的次数;对所述各关键词出现的次数进行向量转化,得到各样本用户对应的关键词向量。A vector transformation unit, configured to determine the keywords contained in the first text information published by the sample users according to the keywords; count the keywords in the first text information published by the sample users Number of occurrences of the word; vector transformation is performed on the number of occurrences of each keyword to obtain a keyword vector corresponding to each sample user.
可选的,在一实施例中,所述构建模块304具体用于:Optionally, in an embodiment, the building module 304 is specifically used for:
采用分类回归树算法,以预置判别指标作为决策树的特征选择,对所述各训练集中的所述各训练样本进行决策树分类,得到多棵决策树;A classification regression tree algorithm is adopted, and a preset discriminant index is used as the feature selection of the decision tree, and each training sample in the each training set is subjected to decision tree classification to obtain a plurality of decision trees;
依次组合所述各决策树,得到随机森林模型,其中,所述判别指标包括关键词向量、命中不同关键词的个数、命中关键词的总数、平均文本长度、敏感发言时间以及敏感发言天数。The decision trees are sequentially combined to obtain a random forest model, wherein the discriminant indicators include keyword vectors, the number of different keywords hit, the total number of hit keywords, average text length, sensitive speech time and sensitive speech days.
可选的,在一实施例中,所述构建模块304包括:Optionally, in one embodiment, the building module 304 includes:
计算单元,用于选择一个判别指标作为根节点,计算所述根节点对应的各判别指标值对所述训练集的基尼指数;a computing unit, configured to select a discriminant index as a root node, and calculate the Gini index of each discriminant index value corresponding to the root node to the training set;
判断单元,用于判断所述各基尼指数是否大于预置第一阈值且所述样本集中的样本个数大于预置第二阈值;a judgment unit, configured to judge whether each Gini index is greater than a preset first threshold and the number of samples in the sample set is greater than a preset second threshold;
划分单元,用于若所述各基尼指数大于预置第一阈值且所述样本集中的样本个数大于预置第二阈值,则将所述训练集划分为多个叶节点,并选择基尼指数最小的判别指标值作为根节点,循环执行计算单元、判断单元;a dividing unit, configured to divide the training set into a plurality of leaf nodes and select a Gini index if the Gini indices are greater than a preset first threshold and the number of samples in the sample set is greater than a preset second threshold The smallest discriminant index value is used as the root node, and the calculation unit and the judgment unit are executed cyclically;
生成单元,用于若所述各基尼指数小于预置第一阈值或所述样本集中的样本个数小于预置第二阈值,则生成所述训练集对应的决策树。A generating unit, configured to generate a decision tree corresponding to the training set if each Gini index is less than a preset first threshold or the number of samples in the sample set is less than a preset second threshold.
可选的,在一实施例中,所述投票模块306具体用于:Optionally, in an embodiment, the voting module 306 is specifically configured to:
统计所述第二文本信息中所述各关键词出现的次数并进行向量转化,得到目标向量;Count the times of occurrence of each keyword in the second text information and perform vector transformation to obtain a target vector;
将所述目标向量及所述第二记录参数输入所述随机森林模型进行分类,得到分类结果;Inputting the target vector and the second record parameter into the random forest model for classification to obtain a classification result;
令所述随机森林模型中的所有决策树对所述分类结果进行投票,得到投票结果。All decision trees in the random forest model are made to vote on the classification results to obtain voting results.
可选的,在一实施例中,所述确定模块307具体用于:Optionally, in an embodiment, the determining module 307 is specifically configured to:
获取所述随机森林模型中所有决策树的投票结果,其中,所述投票结果为具有所述行为倾向和/或不具有所述行为倾向;Obtain the voting results of all decision trees in the random forest model, wherein the voting results are having the behavioral tendency and/or not having the behavioral tendency;
根据所述各投票结果,计算不同行为倾向对应的投票比率;Calculate the voting ratios corresponding to different behavioral inclinations according to the voting results;
将投票比率最高的行为倾向作为所述待检测用户具有的行为倾向。The behavioral tendency with the highest voting ratio is taken as the behavioral tendency of the user to be detected.
本申请实施例中,首先收集有同一类型特征行为倾向的用户发布的言论数据,提取出这些言论数据中的关键词,作为这一类型用户的特征表示。再将这些言论数据作为机器学习的训练样本,构建随机森林模型,然后将待检测的用户相关的言论数据输入到模型进行识别,判断待检测用户和样本用户是否具有相同的行为特征,如果有,那么就能确定待检测的用户和样本用户具有相同的行为倾向。本申请能提取出有同一类型特征行为倾向的用户相关的言论特征,通过机器学习的方式训练出随机森林模型,进而对未知行为倾向的用户进行识别,确定其是否具有相同类型的行为倾向。In the embodiment of the present application, the speech data published by users with the same type of characteristic behavior tendency is first collected, and the keywords in the speech data are extracted as the characteristic representation of this type of user. Then use these speech data as training samples for machine learning, build a random forest model, and then input the speech data related to the user to be detected into the model for identification, and determine whether the user to be detected and the sample user have the same behavioral characteristics, if so, Then it can be determined that the user to be detected and the sample user have the same behavioral tendency. This application can extract relevant speech features of users with the same type of characteristic behavioral tendencies, train a random forest model through machine learning, and then identify users with unknown behavioral tendencies to determine whether they have the same type of behavioral tendencies.
上面图3从模块化功能实体的角度对本申请实施例中的用户行为倾向识别装置进行详细描述,下面从硬件处理的角度对本申请实施例中用户行为倾向识别设备进行详细描述。Fig. 3 above describes the user behavior tendency identification device in the embodiment of the present application in detail from the perspective of modular functional entities, and the following describes the user behavior tendency identification device in the embodiment of the present application in detail from the perspective of hardware processing.
图4是本申请实施例提供的一种用户行为倾向识别设备的结构示意图,该用户行为倾向识别设备400可因配置或性能不同而产生比较大的差异,可以包括一个或一个以上处理器(central processing units,CPU)410(例如,一个或一个以上处理器)和存储器420,一个或一个以上存储应用程序433或数据432的存储介质430(例如一个或一个以上海量存储设备)。其中,存储器420和存储介质430可以是短暂存储或持久存储。存储在存储介质430的程序可以包括一个或一个以上模块(图示没标出),每个模块可以包括对用户行为倾向识别设备400中的一系列指令操作。更进一步地,处理器410可以设置为与存储介质430通信,在用户行为倾向识别设备400上执行存储介质430中的一系列指令操作。FIG. 4 is a schematic structural diagram of a user behavior tendency identification device provided by an embodiment of the present application. The user behavior tendency identification device 400 may vary greatly due to different configurations or performances, and may include one or more processors (central processing units, CPU) 410 (eg, one or more processors) and memory 420, one or more storage media 430 (eg, one or more mass storage devices) that store application programs 433 or data 432. Among them, the memory 420 and the storage medium 430 may be short-term storage or persistent storage. The program stored in the storage medium 430 may include one or more modules (not shown in the figure), and each module may include a series of instruction operations on the user behavior tendency recognition device 400 . Furthermore, the processor 410 may be configured to communicate with the storage medium 430 to execute a series of instruction operations in the storage medium 430 on the user behavior tendency identification device 400 .
用户行为倾向识别设备400还可以包括一个或一个以上电源440,一个或一个以上有线或无线网络接口450,一个或一个以上输入输出接口460,和/或,一个或一个以上操作系统431,例如Windows Serve,Mac OS X,Unix,Linux,FreeBSD等等。本领域技术人员可以理解,图4示出的用户行为倾向识别设备结构并不构成对用户行为倾向识别设备的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件布置。The user behavior tendency identification device 400 may also include one or more power supplies 440, one or more wired or wireless network interfaces 450, one or more input and output interfaces 460, and/or, one or more operating systems 431, such as Windows Serve, Mac OS X, Unix, Linux, FreeBSD, and more. Those skilled in the art can understand that the structure of the user behavior tendency identification device shown in FIG. 4 does not constitute a limitation on the user behavior tendency identification device, and may include more or less components than those shown in the figure, or combine some components, or Different component arrangements.
本申请还提供一种用户行为倾向识别设备,包括:存储器和至少一个处理器,所述存储器中存储有指令,所述存储器和所述至少一个处理器通过线路互连;所述至少一个处理器调用所述存储器中的所述指令,以使得所述用户行为倾向识别设备执行上述用户行为倾向识别方法中的步骤。The present application also provides a user behavior tendency identification device, comprising: a memory and at least one processor, wherein instructions are stored in the memory, and the memory and the at least one processor are interconnected by a line; the at least one processor The instructions in the memory are invoked, so that the user behavior tendency identification device executes the steps in the above-mentioned user behavior tendency identification method.
本申请还提供一种计算机可读存储介质,该计算机可读存储介质可以为非易失性计算机可读存储介质,也可以为易失性计算机可读存储介质。计算机可读存储介质存储有计算机指令,当所述计算机指令在计算机上运行时,使得计算机执行如下步骤:The present application also provides a computer-readable storage medium, and the computer-readable storage medium may be a non-volatile computer-readable storage medium or a volatile computer-readable storage medium. The computer-readable storage medium stores computer instructions, and when the computer instructions are executed on the computer, the computer performs the following steps:
获取具有确定行为倾向的多个样本用户发布的多条第一文本信息及所述各第一文本信息对应的第一记录参数;Acquiring a plurality of pieces of first text information published by a plurality of sample users with certain behavioral tendencies and the first record parameters corresponding to the first text information;
提取所述各第一文本信息中的多个关键词,统计所述各第一文本信息中所述各关键词 出现的次数并进行向量化处理,得到多个关键词向量;Extracting a plurality of keywords in each of the first text information, counting the number of occurrences of each keyword in each of the first text information, and performing vectorization processing to obtain a plurality of keyword vectors;
以所述各关键词向量及所述各第一记录参数为训练样本,从所述各训练样本中多次随机抽取多个样本,得到多个训练集;Taking the keyword vectors and the first recording parameters as training samples, randomly extracting multiple samples from the training samples multiple times to obtain multiple training sets;
参照预置判别指标,构建所述各训练集对应的决策树,并根据所述各决策树生成对应的随机森林模型;With reference to the preset discriminant indicators, construct a decision tree corresponding to each training set, and generate a corresponding random forest model according to each decision tree;
获取待检测用户发布的多条第二文本信息及所述各第二文本对应的第二记录参数;Acquiring multiple pieces of second text information published by the user to be detected and second record parameters corresponding to the second texts;
将所述各第二文本信息和所述各第二记录参数输入所述随机森林模型进行投票,得到投票结果;Inputting the second text information and the second recording parameters into the random forest model for voting to obtain a voting result;
根据所述投票结果,确定所述待检测用户是否具有所述行为倾向。According to the voting result, it is determined whether the user to be detected has the behavioral tendency.
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的系统,装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。Those skilled in the art can clearly understand that, for the convenience and brevity of description, the specific working process of the system, device and unit described above may refer to the corresponding process in the foregoing method embodiments, which will not be repeated here.
所述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(read-only memory,ROM)、随机存取存储器(random access memory,RAM)、磁碟或者光盘等各种可以存储程序代码的介质。The integrated unit, if implemented in the form of a software functional unit and sold or used as an independent product, may be stored in a computer-readable storage medium. Based on this understanding, the technical solutions of the present application can be embodied in the form of software products in essence, or the parts that contribute to the prior art, or all or part of the technical solutions, and the computer software products are stored in a storage medium , including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of the present application. The aforementioned storage medium includes: U disk, mobile hard disk, read-only memory (ROM), random access memory (RAM), magnetic disk or optical disk and other media that can store program codes .
以上所述,以上实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的精神和范围。As mentioned above, the above embodiments are only used to illustrate the technical solutions of the present application, but not to limit them; although the present application has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand: The technical solutions described in the embodiments are modified, or some technical features thereof are equivalently replaced; and these modifications or replacements do not make the essence of the corresponding technical solutions deviate from the spirit and scope of the technical solutions in the embodiments of the present application.

Claims (20)

  1. 一种用户行为倾向识别方法,包括:A method for identifying user behavior tendency, including:
    获取具有确定行为倾向的多个样本用户发布的多条第一文本信息及所述各第一文本信息对应的第一记录参数;Acquiring multiple pieces of first text information published by multiple sample users with certain behavioral tendencies and first record parameters corresponding to the first text information;
    提取所述各第一文本信息中的多个关键词,统计所述各第一文本信息中所述各关键词出现的次数并进行向量化处理,得到多个关键词向量;Extracting a plurality of keywords in each of the first text information, counting the number of occurrences of each keyword in each of the first text information, and performing vectorization processing to obtain a plurality of keyword vectors;
    以所述各关键词向量及所述各第一记录参数为训练样本,从所述各训练样本中多次随机抽取多个样本,得到多个训练集;Taking the keyword vectors and the first recording parameters as training samples, randomly extracting multiple samples from the training samples multiple times to obtain multiple training sets;
    参照预置判别指标,构建所述各训练集对应的决策树,并根据所述各决策树生成对应的随机森林模型;With reference to the preset discriminant indicators, a decision tree corresponding to each training set is constructed, and a corresponding random forest model is generated according to each decision tree;
    获取待检测用户发布的多条第二文本信息及所述各第二文本对应的第二记录参数;Acquiring multiple pieces of second text information published by the user to be detected and second record parameters corresponding to the second texts;
    将所述各第二文本信息和所述各第二记录参数输入所述随机森林模型进行投票,得到投票结果;Inputting the second text information and the second recording parameters into the random forest model for voting to obtain a voting result;
    根据所述投票结果,确定所述待检测用户是否具有所述行为倾向。According to the voting result, it is determined whether the user to be detected has the behavioral tendency.
  2. 根据权利要求1所述的用户行为倾向识别方法,其中,所述提取所述各第一文本信息中的多个关键词包括:The method for identifying user behavior tendency according to claim 1, wherein the extracting a plurality of keywords in each of the first text information comprises:
    对所述第一文本信息进行分词处理,得到多个词单元;performing word segmentation processing on the first text information to obtain a plurality of word units;
    采用TF-IDF算法计算所述各词单元的区分度;TF-IDF algorithm is used to calculate the degree of discrimination of each word unit;
    对所述各词单元的区分度进行排序,并从排序结果中提取区分度最高的词单元作为关键词。Sort the degree of discrimination of each word unit, and extract the word unit with the highest degree of discrimination from the sorting result as a keyword.
  3. 根据权利要求1或2所述的用户行为倾向识别方法,其中,所述统计所述各第一文本信息中所述各关键词出现的次数并进行向量化处理,得到多个关键词向量包括:The method for identifying a user behavior tendency according to claim 1 or 2, wherein the counting the number of occurrences of each keyword in the first text information and performing vectorization processing to obtain a plurality of keyword vectors comprises:
    根据所述各关键词,分别确定所述各样本用户发布的第一文本信息中所包含的关键词;According to the keywords, respectively determine the keywords contained in the first text information published by the sample users;
    统计所述各样本用户发布的第一文本信息中所述各关键词出现的次数;Counting the occurrences of each keyword in the first text information published by each sample user;
    对所述各关键词出现的次数进行向量转化,得到各样本用户对应的关键词向量。Vector transformation is performed on the occurrence times of each keyword to obtain a keyword vector corresponding to each sample user.
  4. 根据权利要求1所述的用户行为倾向识别方法,其中,所述参照预置判别指标,构建所述各训练集对应的决策树,并根据所述各决策树生成对应的随机森林模型包括:The method for identifying user behavior tendencies according to claim 1, wherein the building a decision tree corresponding to each training set with reference to a preset discriminant index, and generating a corresponding random forest model according to each decision tree comprises:
    采用分类回归树算法,以预置判别指标作为决策树的特征选择,对所述各训练集中的所述各训练样本进行决策树分类,得到多棵决策树;A classification regression tree algorithm is adopted, and a preset discriminant index is used as the feature selection of the decision tree, and each training sample in the each training set is subjected to decision tree classification to obtain a plurality of decision trees;
    依次组合所述各决策树,得到随机森林模型,其中,所述判别指标包括关键词向量、命中不同关键词的个数、命中关键词的总数、平均文本长度、敏感发言时间以及敏感发言天数。The decision trees are sequentially combined to obtain a random forest model, wherein the discriminant indicators include keyword vectors, the number of different keywords hit, the total number of hit keywords, average text length, sensitive speech time and sensitive speech days.
  5. 根据权利要求1或4所述的用户行为倾向识别方法,其中,所述参照预置判别指标,构建所述各训练集对应的决策树包括:The method for identifying user behavior tendencies according to claim 1 or 4, wherein, with reference to a preset discrimination index, constructing a decision tree corresponding to each training set comprises:
    S1、选择一个判别指标作为根节点,计算所述根节点对应的各判别指标值对所述训练集的基尼指数;S1, select a discriminant index as a root node, and calculate the Gini index of each discriminant index value corresponding to the root node to the training set;
    S2、判断所述各基尼指数是否大于预置第一阈值且所述样本集中的样本个数大于预置第二阈值;S2. Determine whether each Gini index is greater than a preset first threshold and the number of samples in the sample set is greater than a preset second threshold;
    S3、若是,则将所述训练集划分为多个叶节点,并选择基尼指数最小的判别指标值作为根节点,循环执行S1-S2;S3. If yes, then divide the training set into a plurality of leaf nodes, and select the discriminant index value with the smallest Gini index as the root node, and execute S1-S2 cyclically;
    S3、若否,则生成所述训练集对应的决策树。S3. If not, generate a decision tree corresponding to the training set.
  6. 根据权利要求1所述的用户行为倾向识别方法,其中,所述将所述各第二文本信息和所述各第二记录参数输入所述随机森林模型进行投票,得到投票结果包括:The method for identifying user behavior tendency according to claim 1, wherein the inputting the second text information and the second recording parameters into the random forest model for voting, and obtaining a voting result comprises:
    统计所述第二文本信息中所述各关键词出现的次数并进行向量转化,得到目标向量;Count the times of occurrence of each keyword in the second text information and perform vector transformation to obtain a target vector;
    将所述目标向量及所述第二记录参数输入所述随机森林模型进行分类,得到分类结果;Inputting the target vector and the second record parameter into the random forest model for classification to obtain a classification result;
    令所述随机森林模型中的所有决策树对所述分类结果进行投票,得到投票结果。All decision trees in the random forest model are made to vote on the classification results to obtain voting results.
  7. 根据权利要求1或6所述的用户行为倾向识别方法,其中,所述根据所述投票结果,确定所述待检测用户是否具有所述行为倾向包括:The method for identifying a user behavior tendency according to claim 1 or 6, wherein the determining whether the user to be detected has the behavior tendency according to the voting result comprises:
    获取所述随机森林模型中所有决策树的投票结果,其中,所述投票结果为具有所述行为倾向和/或不具有所述行为倾向;Obtain the voting results of all decision trees in the random forest model, wherein the voting results are having the behavioral tendency and/or not having the behavioral tendency;
    根据所述各投票结果,计算不同行为倾向对应的投票比率;Calculate the voting ratios corresponding to different behavioral inclinations according to the voting results;
    将投票比率最高的行为倾向作为所述待检测用户具有的行为倾向。The behavioral tendency with the highest voting ratio is taken as the behavioral tendency of the user to be detected.
  8. 一种用户行为倾向识别设备,包括存储器、处理器及存储在所述存储器上并可在所述处理器上运行的计算机可读指令,所述处理器执行所述计算机可读指令时实现如下步骤:A user behavior tendency identification device, comprising a memory, a processor, and computer-readable instructions stored on the memory and running on the processor, the processor implementing the following steps when executing the computer-readable instructions :
    获取具有确定行为倾向的多个样本用户发布的多条第一文本信息及所述各第一文本信息对应的第一记录参数;Acquiring a plurality of pieces of first text information published by a plurality of sample users with certain behavioral tendencies and the first record parameters corresponding to the first text information;
    提取所述各第一文本信息中的多个关键词,统计所述各第一文本信息中所述各关键词出现的次数并进行向量化处理,得到多个关键词向量;Extracting a plurality of keywords in each of the first text information, counting the number of occurrences of each keyword in each of the first text information, and performing vectorization processing to obtain a plurality of keyword vectors;
    以所述各关键词向量及所述各第一记录参数为训练样本,从所述各训练样本中多次随机抽取多个样本,得到多个训练集;Taking the keyword vectors and the first recording parameters as training samples, randomly extracting multiple samples from the training samples multiple times to obtain multiple training sets;
    参照预置判别指标,构建所述各训练集对应的决策树,并根据所述各决策树生成对应的随机森林模型;With reference to the preset discriminant indicators, a decision tree corresponding to each training set is constructed, and a corresponding random forest model is generated according to each decision tree;
    获取待检测用户发布的多条第二文本信息及所述各第二文本对应的第二记录参数;Acquiring multiple pieces of second text information published by the user to be detected and second record parameters corresponding to the second texts;
    将所述各第二文本信息和所述各第二记录参数输入所述随机森林模型进行投票,得到投票结果;Inputting the second text information and the second recording parameters into the random forest model for voting to obtain a voting result;
    根据所述投票结果,确定所述待检测用户是否具有所述行为倾向。According to the voting result, it is determined whether the user to be detected has the behavioral tendency.
  9. 根据权利要求8所述的用户行为倾向识别设备,所述处理器执行所述计算机程序时还实现以下步骤:The user behavior tendency identification device according to claim 8, wherein the processor further implements the following steps when executing the computer program:
    对所述第一文本信息进行分词处理,得到多个词单元;performing word segmentation processing on the first text information to obtain a plurality of word units;
    采用TF-IDF算法计算所述各词单元的区分度;TF-IDF algorithm is used to calculate the degree of discrimination of each word unit;
    对所述各词单元的区分度进行排序,并从排序结果中提取区分度最高的词单元作为关键词。Sort the degree of discrimination of each word unit, and extract the word unit with the highest degree of discrimination from the sorting result as a keyword.
  10. 根据权利要求8或9所述的用户行为倾向识别设备,所述处理器执行所述计算机程序时还实现以下步骤:The user behavior tendency identification device according to claim 8 or 9, wherein the processor further implements the following steps when executing the computer program:
    根据所述各关键词,分别确定所述各样本用户发布的第一文本信息中所包含的关键词;According to the keywords, respectively determine the keywords included in the first text information published by the sample users;
    统计所述各样本用户发布的第一文本信息中所述各关键词出现的次数;Counting the occurrences of each keyword in the first text information published by each sample user;
    对所述各关键词出现的次数进行向量转化,得到各样本用户对应的关键词向量。Vector transformation is performed on the occurrence times of each keyword to obtain a keyword vector corresponding to each sample user.
  11. 根据权利要求8所述的用户行为倾向识别设备,所述处理器执行所述计算机程序时还实现以下步骤:The user behavior tendency identification device according to claim 8, wherein the processor further implements the following steps when executing the computer program:
    采用分类回归树算法,以预置判别指标作为决策树的特征选择,对所述各训练集中的所述各训练样本进行决策树分类,得到多棵决策树;A classification regression tree algorithm is adopted, and a preset discriminant index is used as the feature selection of the decision tree, and each training sample in the each training set is subjected to decision tree classification to obtain a plurality of decision trees;
    依次组合所述各决策树,得到随机森林模型,其中,所述判别指标包括关键词向量、命中不同关键词的个数、命中关键词的总数、平均文本长度、敏感发言时间以及敏感发言天数。The decision trees are sequentially combined to obtain a random forest model, wherein the discriminant indicators include keyword vectors, the number of different keywords hit, the total number of hit keywords, average text length, sensitive speech time and sensitive speech days.
  12. 根据权利要求8或11所述的用户行为倾向识别设备,所述处理器执行所述计算机程序时还实现以下步骤:The device for identifying user behavior tendency according to claim 8 or 11, wherein the processor further implements the following steps when executing the computer program:
    S1、选择一个判别指标作为根节点,计算所述根节点对应的各判别指标值对所述训练集的基尼指数;S1, select a discriminant index as the root node, and calculate the Gini index of each discriminant index value corresponding to the root node to the training set;
    S2、判断所述各基尼指数是否大于预置第一阈值且所述样本集中的样本个数大于预置第二阈值;S2. Determine whether each Gini index is greater than a preset first threshold and the number of samples in the sample set is greater than a preset second threshold;
    S3、若是,则将所述训练集划分为多个叶节点,并选择基尼指数最小的判别指标值作为根节点,循环执行S1-S2;S3. If yes, then divide the training set into a plurality of leaf nodes, and select the discriminant index value with the smallest Gini index as the root node, and execute S1-S2 cyclically;
    S3、若否,则生成所述训练集对应的决策树。S3. If not, generate a decision tree corresponding to the training set.
  13. 根据权利要求8所述的用户行为倾向识别设备,所述处理器执行所述计算机程序时还实现以下步骤:The user behavior tendency identification device according to claim 8, wherein the processor further implements the following steps when executing the computer program:
    统计所述第二文本信息中所述各关键词出现的次数并进行向量转化,得到目标向量;Count the times of occurrence of each keyword in the second text information and perform vector transformation to obtain a target vector;
    将所述目标向量及所述第二记录参数输入所述随机森林模型进行分类,得到分类结果;Inputting the target vector and the second record parameter into the random forest model for classification to obtain a classification result;
    令所述随机森林模型中的所有决策树对所述分类结果进行投票,得到投票结果。All decision trees in the random forest model are made to vote on the classification results to obtain voting results.
  14. 根据权利要求8或13所述的用户行为倾向识别设备,所述处理器执行所述计算机程序时还实现以下步骤:The user behavior tendency identification device according to claim 8 or 13, wherein the processor further implements the following steps when executing the computer program:
    获取所述随机森林模型中所有决策树的投票结果,其中,所述投票结果为具有所述行为倾向和/或不具有所述行为倾向;Obtain the voting results of all decision trees in the random forest model, wherein the voting results are having the behavioral tendency and/or not having the behavioral tendency;
    根据所述各投票结果,计算不同行为倾向对应的投票比率;Calculate the voting ratios corresponding to different behavioral inclinations according to the voting results;
    将投票比率最高的行为倾向作为所述待检测用户具有的行为倾向。The behavioral tendency with the highest voting ratio is taken as the behavioral tendency of the user to be detected.
  15. 一种计算机可读存储介质,所述计算机可读存储介质中存储计算机指令,当所述计算机指令在计算机上运行时,使得计算机执行如下步骤:A computer-readable storage medium, storing computer instructions in the computer-readable storage medium, when the computer instructions are executed on a computer, the computer is made to perform the following steps:
    获取具有确定行为倾向的多个样本用户发布的多条第一文本信息及所述各第一文本信息对应的第一记录参数;Acquiring a plurality of pieces of first text information published by a plurality of sample users with certain behavioral tendencies and the first record parameters corresponding to the first text information;
    提取所述各第一文本信息中的多个关键词,统计所述各第一文本信息中所述各关键词出现的次数并进行向量化处理,得到多个关键词向量;Extracting a plurality of keywords in each of the first text information, counting the number of occurrences of each keyword in each of the first text information, and performing vectorization processing to obtain a plurality of keyword vectors;
    以所述各关键词向量及所述各第一记录参数为训练样本,从所述各训练样本中多次随机抽取多个样本,得到多个训练集;Taking the keyword vectors and the first recording parameters as training samples, randomly extracting multiple samples from the training samples multiple times to obtain multiple training sets;
    参照预置判别指标,构建所述各训练集对应的决策树,并根据所述各决策树生成对应的随机森林模型;With reference to the preset discriminant indicators, a decision tree corresponding to each training set is constructed, and a corresponding random forest model is generated according to each decision tree;
    获取待检测用户发布的多条第二文本信息及所述各第二文本对应的第二记录参数;Acquiring multiple pieces of second text information published by the user to be detected and second record parameters corresponding to the second texts;
    将所述各第二文本信息和所述各第二记录参数输入所述随机森林模型进行投票,得到投票结果;Inputting the second text information and the second recording parameters into the random forest model for voting to obtain a voting result;
    根据所述投票结果,确定所述待检测用户是否具有所述行为倾向。According to the voting result, it is determined whether the user to be detected has the behavioral tendency.
  16. 根据权利要求15所述的计算机可读存储介质,当所述计算机指令在计算机上运行时,使得计算机还执行以下步骤:The computer-readable storage medium of claim 15, when the computer instructions are executed on a computer, causing the computer to further perform the following steps:
    对所述第一文本信息进行分词处理,得到多个词单元;performing word segmentation processing on the first text information to obtain a plurality of word units;
    采用TF-IDF算法计算所述各词单元的区分度;TF-IDF algorithm is used to calculate the degree of discrimination of each word unit;
    对所述各词单元的区分度进行排序,并从排序结果中提取区分度最高的词单元作为关键词。Sort the degree of discrimination of each word unit, and extract the word unit with the highest degree of discrimination from the sorting result as a keyword.
  17. 根据权利要求15或16所述的计算机可读存储介质,当所述计算机指令在计算机上运行时,使得计算机还执行以下步骤:The computer-readable storage medium of claim 15 or 16, when the computer instructions are executed on a computer, causing the computer to further perform the following steps:
    根据所述各关键词,分别确定所述各样本用户发布的第一文本信息中所包含的关键词;According to the keywords, respectively determine the keywords included in the first text information published by the sample users;
    统计所述各样本用户发布的第一文本信息中所述各关键词出现的次数;Counting the occurrences of each keyword in the first text information published by each sample user;
    对所述各关键词出现的次数进行向量转化,得到各样本用户对应的关键词向量。Vector transformation is performed on the occurrence times of each keyword to obtain a keyword vector corresponding to each sample user.
  18. 根据权利要求15所述的计算机可读存储介质,当所述计算机指令在计算机上运行时,使得计算机还执行以下步骤:The computer-readable storage medium of claim 15, when the computer instructions are executed on a computer, causing the computer to further perform the following steps:
    采用分类回归树算法,以预置判别指标作为决策树的特征选择,对所述各训练集中的所述各训练样本进行决策树分类,得到多棵决策树;A classification regression tree algorithm is adopted, and a preset discriminant index is used as the feature selection of the decision tree, and each training sample in the each training set is subjected to decision tree classification to obtain a plurality of decision trees;
    依次组合所述各决策树,得到随机森林模型,其中,所述判别指标包括关键词向量、命中不同关键词的个数、命中关键词的总数、平均文本长度、敏感发言时间以及敏感发言天数。The decision trees are sequentially combined to obtain a random forest model, wherein the discriminant indicators include keyword vectors, the number of different keywords hit, the total number of hit keywords, average text length, sensitive speech time and sensitive speech days.
  19. 根据权利要求15或18所述的计算机可读存储介质,当所述计算机指令在计算机上运行时,使得计算机还执行以下步骤:The computer-readable storage medium of claim 15 or 18, when the computer instructions are executed on a computer, causing the computer to further perform the following steps:
    S1、选择一个判别指标作为根节点,计算所述根节点对应的各判别指标值对所述训练集的基尼指数;S1, select a discriminant index as a root node, and calculate the Gini index of each discriminant index value corresponding to the root node to the training set;
    S2、判断所述各基尼指数是否大于预置第一阈值且所述样本集中的样本个数大于预置第二阈值;S2. Determine whether each Gini index is greater than a preset first threshold and the number of samples in the sample set is greater than a preset second threshold;
    S3、若是,则将所述训练集划分为多个叶节点,并选择基尼指数最小的判别指标值作为根节点,循环执行S1-S2;S3. If yes, then divide the training set into a plurality of leaf nodes, and select the discriminant index value with the smallest Gini index as the root node, and execute S1-S2 cyclically;
    S3、若否,则生成所述训练集对应的决策树。S3. If not, generate a decision tree corresponding to the training set.
  20. 一种用户行为倾向识别装置,所述用户行为倾向识别装置包括:A user behavior tendency identification device, the user behavior tendency identification device comprising:
    第一获取模块,用于获取具有确定行为倾向的多个样本用户发布的多条第一文本信息及所述各第一文本信息对应的第一记录参数;a first acquisition module, configured to acquire a plurality of pieces of first text information published by a plurality of sample users with certain behavioral tendencies and first record parameters corresponding to the first text information;
    向量化模块,用于提取所述各第一文本信息中的多个关键词,统计所述各第一文本信息中所述各关键词出现的次数并进行向量化处理,得到多个关键词向量;A vectorization module, configured to extract multiple keywords in each of the first text information, count the number of occurrences of each keyword in each of the first text information, and perform vectorization processing to obtain multiple keyword vectors ;
    抽样模块,用于以所述各关键词向量及所述各第一记录参数为训练样本,从所述各训练样本中多次随机抽取多个样本,得到多个训练集;a sampling module, configured to use the keyword vectors and the first recording parameters as training samples, and randomly extract multiple samples from the training samples for multiple times to obtain multiple training sets;
    构建模块,用于参照预置判别指标,构建所述各训练集对应的决策树,并根据所述各决策树生成对应的随机森林模型;a building module for constructing a decision tree corresponding to each training set with reference to a preset discriminant index, and generating a corresponding random forest model according to each decision tree;
    第二获取模块,用于获取待检测用户发布的多条第二文本信息及所述各第二文本对应的第二记录参数;a second obtaining module, configured to obtain a plurality of pieces of second text information published by the user to be detected and second record parameters corresponding to the second texts;
    投票模块,用于将所述各第二文本信息和所述各第二记录参数输入所述随机森林模型进行投票,得到投票结果;a voting module, configured to input the second text information and the second record parameters into the random forest model for voting to obtain a voting result;
    确定模块,用于根据所述投票结果,确定所述待检测用户是否具有所述行为倾向。A determination module, configured to determine whether the user to be detected has the behavioral tendency according to the voting result.
PCT/CN2021/083480 2020-12-11 2021-03-29 User behavior tendency identification method, apparatus, and device, and storage medium WO2022121163A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011436696.4A CN112527958A (en) 2020-12-11 2020-12-11 User behavior tendency identification method, device, equipment and storage medium
CN202011436696.4 2020-12-11

Publications (1)

Publication Number Publication Date
WO2022121163A1 true WO2022121163A1 (en) 2022-06-16

Family

ID=74999586

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/083480 WO2022121163A1 (en) 2020-12-11 2021-03-29 User behavior tendency identification method, apparatus, and device, and storage medium

Country Status (2)

Country Link
CN (1) CN112527958A (en)
WO (1) WO2022121163A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115545085A (en) * 2022-11-04 2022-12-30 南方电网数字电网研究院有限公司 Weak fault current fault type identification method, device, equipment and medium

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112527958A (en) * 2020-12-11 2021-03-19 平安科技(深圳)有限公司 User behavior tendency identification method, device, equipment and storage medium
CN114663143B (en) * 2022-03-21 2024-06-28 平安健康保险股份有限公司 Intervention user screening method and device based on differential intervention response model
CN115620853A (en) * 2022-09-07 2023-01-17 国家康复辅具研究中心 Model training method for TMS strategy automatic selection, automatic selection method and system
CN116468096B (en) * 2023-03-30 2024-01-02 之江实验室 Model training method, device, equipment and readable storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109325106A (en) * 2018-07-31 2019-02-12 厦门快商通信息技术有限公司 A kind of U.S. chat robots intension recognizing method of doctor and device
US20190163817A1 (en) * 2017-11-29 2019-05-30 Oracle International Corporation Approaches for large-scale classification and semantic text summarization
CN109934260A (en) * 2019-01-31 2019-06-25 中国科学院信息工程研究所 Image, text and data fusion sensibility classification method and device based on random forest
CN111368076A (en) * 2020-02-27 2020-07-03 中国地质大学(武汉) Bernoulli naive Bayesian text classification method based on random forest
CN112527958A (en) * 2020-12-11 2021-03-19 平安科技(深圳)有限公司 User behavior tendency identification method, device, equipment and storage medium

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107016107B (en) * 2017-04-12 2020-05-12 四川九鼎瑞信软件开发有限公司 Public opinion analysis method and system
CN109241418B (en) * 2018-08-22 2024-04-09 中国平安人寿保险股份有限公司 Abnormal user identification method and device based on random forest, equipment and medium
CN111666502A (en) * 2020-07-08 2020-09-15 腾讯科技(深圳)有限公司 Abnormal user identification method and device based on deep learning and storage medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190163817A1 (en) * 2017-11-29 2019-05-30 Oracle International Corporation Approaches for large-scale classification and semantic text summarization
CN109325106A (en) * 2018-07-31 2019-02-12 厦门快商通信息技术有限公司 A kind of U.S. chat robots intension recognizing method of doctor and device
CN109934260A (en) * 2019-01-31 2019-06-25 中国科学院信息工程研究所 Image, text and data fusion sensibility classification method and device based on random forest
CN111368076A (en) * 2020-02-27 2020-07-03 中国地质大学(武汉) Bernoulli naive Bayesian text classification method based on random forest
CN112527958A (en) * 2020-12-11 2021-03-19 平安科技(深圳)有限公司 User behavior tendency identification method, device, equipment and storage medium

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115545085A (en) * 2022-11-04 2022-12-30 南方电网数字电网研究院有限公司 Weak fault current fault type identification method, device, equipment and medium

Also Published As

Publication number Publication date
CN112527958A (en) 2021-03-19

Similar Documents

Publication Publication Date Title
WO2022121163A1 (en) User behavior tendency identification method, apparatus, and device, and storage medium
WO2018086470A1 (en) Keyword extraction method and device, and server
CN105893478B (en) A kind of tag extraction method and apparatus
CN111898366B (en) Document subject word aggregation method and device, computer equipment and readable storage medium
CN104991891B (en) A kind of short text feature extracting method
TWI554896B (en) Information Classification Method and Information Classification System Based on Product Identification
CN108090216B (en) Label prediction method, device and storage medium
WO2022156328A1 (en) Restful-type web service clustering method fusing service cooperation relationships
Angeli et al. Stanford’s 2014 slot filling systems
CN110472043B (en) Clustering method and device for comment text
CN107844533A (en) A kind of intelligent Answer System and analysis method
CN110472203B (en) Article duplicate checking and detecting method, device, equipment and storage medium
CN112329460A (en) Text topic clustering method, device, equipment and storage medium
Bhutada et al. Semantic latent dirichlet allocation for automatic topic extraction
WO2023173537A1 (en) Text sentiment analysis method and apparatus, device and storage medium
CN103970888B (en) Document classifying method based on network measure index
CN110619212B (en) Character string-based malicious software identification method, system and related device
JP3765801B2 (en) Parallel translation expression extraction apparatus, parallel translation extraction method, and parallel translation extraction program
CN109753646B (en) Article attribute identification method and electronic equipment
Angeli et al. Stanford’s distantly supervised slot filling systems for KBP 2014
Elbarougy et al. Graph-Based Extractive Arabic Text Summarization Using Multiple Morphological Analyzers.
CN110413985B (en) Related text segment searching method and device
KR102126911B1 (en) Key player detection method in social media using KeyplayerRank
CN111341404B (en) Electronic medical record data set analysis method and system based on ernie model
JP4125951B2 (en) Text automatic classification method and apparatus, program, and recording medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21901879

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21901879

Country of ref document: EP

Kind code of ref document: A1