CN108304490B

CN108304490B - Text-based similarity determination method and device and computer equipment

Info

Publication number: CN108304490B
Application number: CN201810015523.1A
Authority: CN
Inventors: 周涛; 李百川; 李展铿
Original assignee: Youmi Technology Co ltd
Current assignee: Youmi Technology Co ltd
Priority date: 2018-01-08
Filing date: 2018-01-08
Publication date: 2020-12-15
Anticipated expiration: 2038-01-08
Also published as: CN108304490A

Abstract

The invention relates to a text-based similarity determination method and device, and belongs to the technical field of internet. The method comprises the following steps: acquiring historical network browsing records of candidate users, and acquiring a text set corresponding to the candidate users according to the historical network browsing records; acquiring pre-calculated conditional probability of each text in the text set falling into a text set corresponding to a reference user; obtaining a first text feature vector corresponding to the candidate user according to the text set corresponding to the candidate user and the conditional probability of each text in the text set; and inputting the first text feature vector of the candidate user into a pre-trained random forest model, and obtaining the similarity value between the candidate user and a reference user according to the output of the random forest model. According to the technical scheme, the problem that the similarity between the users cannot be accurately calculated is solved, the similarity between the candidate user and the reference user can be accurately calculated through the related information of the text, and then the similar user of the reference user can be found out.

Description

Text-based similarity determination method and device and computer equipment

Technical Field

The invention relates to the technical field of internet, in particular to a text-based similarity determination method, a text-based similarity determination device, a computer-readable storage medium and computer equipment.

Background

Currently, it has become an effective marketing method by searching for similar users and pushing messages or sending advertisements to similar users. The premise of the marketing mode is to accurately determine and calculate the similarity between users. The traditional text-based similarity determination method comprises k-means clustering and the like. In the process of implementing the invention, the inventor finds that at least the following problems exist in the prior art: traditional methods of determining similarity are either not suitable for determining similarity based on words; or the results have great randomness, so that the clustering of the same batch of users at a time can obtain different results. Therefore, it is necessary to find a method for calculating the similarity between users by using the related information of the text.

Disclosure of Invention

Based on the method and the device for determining the similarity based on the text, the similarity between the users can be accurately calculated based on the related information of the text, so that the similar users of the reference users can be determined.

The content of the embodiment of the invention is as follows:

a text-based similarity determination method includes: acquiring historical network browsing records of candidate users, and acquiring a text set corresponding to the candidate users according to the historical network browsing records; acquiring pre-calculated conditional probability of each text in the text set falling into a text set corresponding to a reference user; obtaining a first text feature vector corresponding to the candidate user according to the text set corresponding to the candidate user and the conditional probability of each text in the text set; and inputting the first text feature vector of the candidate user into a pre-trained random forest model, and obtaining the similarity value between the candidate user and a reference user according to the output of the random forest model.

In one embodiment, before the step of inputting the first text feature vector of the candidate user into a pre-trained random forest model, the method further includes: constructing a sample user set, wherein the sample user set comprises reference users and non-reference users; acquiring historical network browsing records of all sample users in a sample user set to obtain a text set corresponding to all sample users; calculating the conditional probability of each text in the text set of each sample user; obtaining a second text feature vector corresponding to each sample user in the sample user set according to the text set corresponding to each sample user and the conditional probability of each text in the text set; selecting a plurality of text feature vectors from the second text feature vectors as training text feature vectors of corresponding sample users; and training a random forest model according to the training text feature vector.

In one embodiment, the step of selecting a plurality of text feature vectors from the second text feature vectors as training text feature vectors of corresponding sample users includes: and respectively selecting a plurality of text characteristic vectors which are ranked in the front and a plurality of text characteristic vectors which are ranked in the back according to the size of the conditional probability value in the second text characteristic vector as training text characteristic vectors of corresponding sample users.

In one embodiment, the step of obtaining the text set corresponding to the candidate user according to the historical web browsing record includes: and obtaining words corresponding to the candidate users according to the historical network browsing records, and removing stop words in the words to obtain a text set corresponding to the candidate users.

In one embodiment, before the step of obtaining the pre-calculated conditional probability that each text in the text set falls into the text set corresponding to the reference user, the method further includes: and acquiring the word frequency characteristics of each word in the text set, and respectively calculating the conditional probability of each word falling into the text set corresponding to the reference user according to the word frequency characteristics.

In one embodiment, the conditional probability that each word falls into the text set corresponding to the reference user is calculated by the following formula:

wherein y is a text set label, 0 represents a text set corresponding to the candidate user, and 1 represents a text set corresponding to the reference user; i is the identifier of the word and represents the ith word, and the total number of the words is n; theta_yiIs the frequency with which the ith word appears in the text set y; n is a radical of_yiRepresenting the number of times the ith word appears in the text set y, N_yThe times of occurrence of all words in the text set y; alpha is a preset smoothing factor; lambda [ alpha ]_iIs the probability that the ith word falls into the text set corresponding to the reference user.

In one embodiment, after the step of inputting the first text feature vector of the candidate user into a pre-trained random forest model and obtaining the similarity value between the candidate user and a reference user according to the output of the random forest model, the method further includes: and if the similarity value corresponding to the candidate user is higher than a preset threshold value, the candidate user is a similar user of the reference user.

In one embodiment, the voting function in the random forest is:

wherein H (x) is a voting function; x is the input text feature vector; h is a decision tree, T is the T-th tree, and the random forest has T trees in total.

Correspondingly, an embodiment of the present invention provides a text-based similarity determination apparatus, including: the conditional probability calculation module is used for acquiring historical network browsing records of candidate users and obtaining a text set corresponding to the candidate users according to the historical network browsing records; acquiring pre-calculated conditional probability of each text in the text set falling into a text set corresponding to a reference user; the first feature vector acquisition module is used for acquiring a first text feature vector corresponding to the candidate user according to the text set corresponding to the candidate user and the conditional probability of each text in the text set; and the similarity value determining module is used for inputting the first text feature vector of the candidate user into a pre-trained random forest model and obtaining the similarity value between the candidate user and a reference user according to the output of the random forest model.

A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method described above, the computer program being stored thereby.

A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the method described above when executing the program.

According to the technical scheme, a text set corresponding to a candidate user is obtained according to a historical network browsing record of the candidate user; acquiring pre-calculated conditional probability of each text in the text set falling into a text set corresponding to a reference user; and determining a first text feature vector corresponding to the candidate user, and inputting the first text feature vector into a pre-trained random forest model to obtain a similarity value between the candidate user and a reference user. By the method, the similarity between the candidate user and the reference user can be accurately calculated, whether the candidate user is the similar user of the reference user or not is determined, corresponding operation is performed on the similar user in a targeted manner, the operation is prevented from being performed on all users, and the operation cost can be effectively solved.

Drawings

FIG. 1 is a schematic flow diagram of a text-based similarity determination method in one embodiment;

FIG. 2 is an example of an application of the text-based similarity determination method in an embodiment;

fig. 3 is a schematic structural diagram of a text-based similarity determination apparatus in an embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

The embodiment of the invention is described by taking network marketing as an example, but the text-based similarity determination method, the text-based similarity determination device, the computer equipment and the storage medium in the embodiment of the invention are not limited to solving the problem in network marketing, and can also be used for solving the problem in other application scenarios of similarity determination.

When a certain network marketing is carried out, the user group for putting the advertisement is often determined by the historical data of network users. Among these network users, there are some users who often perform specific network operations, such as: for network marketing of television advertisements, some users often watch network videos of a product corresponding to the advertisement, and the users belong to seed users (reference users) of the product. For some users similar to the network operation process of the aforementioned reference user, the users with relative intentions belong to, and the corresponding advertisements delivered to the users have stronger pertinence and higher return rate. For example: a batch of small active game seed users are provided, the cost for putting game advertisements to the batch of seed users is low, and high profit can be obtained by large-quantity targeted putting to users similar to the seed users. Therefore, it is necessary to calculate the similarity between users, and further find out similar users of reference users.

The conventional method for determining similar users of a reference user includes:

the method comprises the following steps: step 1, expressing an installation package list of a user as 1/0 characteristics by a bag-of-words (bow) method, and training a logistic regression model through the list; and 2, adding other three characteristics (the installed basic application proportion, the payment application number and the average payment price) to the output of the logistic regression model as input to train a GBDT (Gradient Boosting Decision Tree) classification model, wherein the classified GBDT is similar users if the classification is 1. The application effect of the bow feature in the text class is not ideal, and the effect of the adopted conditional probability feature in the experiment is obviously improved compared with that of the bow feature; in addition, the two-layer model adopted in the method is based on the additional payment information characteristic and is not applicable to the task of text information.

The second method comprises the following steps: step 1, mapping video tags of a video medium into x-dimensional tag vectors, and then accumulating all the tag vectors of the video and calculating an average value to obtain the x-dimensional video vector of each video; step 2, clustering the videos to obtain similar video clustering results; step 3, converting the similar video clustering result into a similar user clustering result; and 4, extracting clustering results from the seed users, and sequencing the similarity, thereby determining the sequencing of the users. The method firstly clusters videos and then clusters users, and due to the sensitivity of k-means clustering on initial values, the selection of the initial values has high randomness, and the clustering results of each time are different, so that the clustering results of users in the same batch are different.

The embodiment of the invention provides a text-based similarity determining method and a corresponding text-based similarity determining device. The following are detailed below.

Fig. 1 is a schematic flowchart of a text-based similarity determination method according to an embodiment. The text-based similarity determination method provided by the embodiment mainly includes steps S110 to S130, which are described in detail as follows:

s110, obtaining historical network browsing records of candidate users, and obtaining a text set corresponding to the candidate users according to the historical network browsing records; and acquiring the pre-calculated conditional probability of each text in the text set falling into the text set corresponding to the reference user.

Optionally, the sample user is all users performing corresponding web browsing operations, and the users include reference users and non-reference users. The reference user refers to a seed user, and the sample users except the seed user are non-seed users; the non-reference user may refer to a non-seed user or a part of users selected from the non-seed users. The candidate users may be some of the non-reference users, all of the non-reference users, or users extracted from the entire sample users. If the candidate user refers to all users in the non-reference users, the embodiment of the invention can determine the similarity between all users except the reference user and the reference user, and further determine the similar user of the reference user from the users. Whether the candidate user is the similar user of the reference user can be determined by calculating the similarity of the candidate user and the reference user.

The historical network browsing record is generated after the user performs network operation. The network operation may be watching a certain network video, searching a certain web page, playing a network game, and the like.

Optionally, the text collection is text used by the sample user extracted from the historical web browsing records of the sample user, such as: the user watches the search word used by a certain video, plays the text information corresponding to a certain operation executed by a certain network game, and the like.

Optionally, the method further includes obtaining a pre-calculated conditional probability that each text in the text set falls into a text set corresponding to a non-reference user.

In the step, the probability that each text is the text corresponding to the reference user is calculated, and the probability that each text is the text corresponding to the candidate user can also be calculated in the step.

And S120, obtaining a first text feature vector corresponding to the candidate user according to the text set corresponding to the candidate user and the conditional probability of each text in the text set.

The first text feature vector refers to a list for characterizing user feature information, and the list is composed of a text set and conditional probabilities corresponding to each text. Other parameters, such as the number of transactions of the user, may also be included in the first text feature vector.

S130, inputting the first text feature vector of the candidate user into a pre-trained random forest model, and obtaining a similarity value between the candidate user and a reference user according to the output of the random forest model.

A Random forest model (Random forest) refers to a classifier that trains and predicts a sample by using a plurality of trees. The random forest model of the embodiment of the invention adopts a bagging method, simultaneously trains a plurality of random decision trees by means of random sampling samples and random sampling characteristics, votes whether the input first text characteristic vector belongs to a text set corresponding to a reference user through the random decision trees, further obtains the similarity between a candidate user and the reference user, and can determine whether the candidate user is the similar user of the reference user according to the similarity.

The embodiment calculates the conditional probability of each text corresponding to the candidate user, inputs each text and the corresponding conditional probability into the trained random forest model as feature information, and finally obtains the similarity of the candidate user relative to the reference user. The similarity between the candidate user and the reference user can be accurately calculated according to the characteristics of the users, and whether the candidate user is the similar user of the reference user or not can be further determined.

In an embodiment, before the step of inputting the first text feature vector of the candidate user into a pre-trained random forest model, the method further includes: constructing a sample user set, wherein the sample user set comprises reference users and non-reference users; acquiring historical network browsing records of all sample users in a sample user set to obtain a text set corresponding to all sample users; calculating the conditional probability of each text in the text set of each sample user; obtaining a second text feature vector corresponding to each sample user in the sample user set according to the text set corresponding to each sample user and the conditional probability of each text in the text set; selecting a plurality of text feature vectors from the second text feature vectors as training text feature vectors of corresponding sample users; and training a random forest model according to the training text feature vector.

Wherein, the sample user can refer to all network users; it may also refer to users who meet a certain condition, such as: if it is desired to determine similar users of the reference users of a certain network game from among users of the network game, the sample user may be a user who has played the network game.

In this embodiment, the conditional probability corresponding to each text is calculated according to the text information of the sample user set to obtain a second text feature vector, a training text feature vector representative of each sample user is found from the second text feature vector, and the random forest model is trained through the training text feature vector. The random forest module can fully integrate the feature information in all the training text feature vectors and obtain a model capable of reasonably judging candidate users.

In an embodiment, the step of selecting a plurality of text feature vectors from the second text feature vectors as training text feature vectors of corresponding sample users includes: and respectively selecting a plurality of text characteristic vectors which are ranked in the front and a plurality of text characteristic vectors which are ranked in the back according to the size of the conditional probability value in the second text characteristic vector as training text characteristic vectors of corresponding sample users.

Optionally, the step of determining the training text feature vector in this embodiment includes determining the training text feature vector corresponding to each sample user.

Optionally, in this embodiment, a training text feature vector is selected from a second text feature vector of a certain sample user, the conditional probabilities in the second text feature vector are sorted according to size, k1 conditional probabilities λ 1 sorted at the top and k2 conditional probabilities λ 2 sorted at the bottom are determined, the conditional probabilities λ 1 and the conditional probabilities λ 2 are used as conditional probability features corresponding to the candidate user, and the text feature vectors corresponding to the conditional probability features are determined as the training text feature vector of the sample user. Wherein k1 and k2 can be any integer greater than 0. k1 and k2 may be the same or different. Specifically, k1 and k2 are both 20. Alternatively, the number of training text feature vectors may be different for each sample user.

According to the method, training text feature vectors of each sample user are determined according to the condition of the conditional probability, the training text feature vectors represent feature information of the sample users (for example, the higher the conditional probability value is, the more likely the corresponding text is the text used by the reference user), and training a random forest model through the training text feature vectors can obtain a trained random forest model, and the trained random forest can accurately determine the similarity value between the users.

In an embodiment, the step of obtaining the text set corresponding to the candidate user according to the historical web browsing record includes: and obtaining words corresponding to the candidate users according to the historical network browsing records, and removing stop words in the words to obtain a text set corresponding to the candidate users.

In the step, word segmentation is carried out on the text information corresponding to the historical network browsing records to obtain words corresponding to the candidate users, stop words which do not contain the user information are removed, and a text set corresponding to the candidate users is obtained after integration. The text set without stop words can represent the characteristics of the user more accurately, and meanwhile, the storage space can be effectively saved and the processing efficiency can be improved.

In an embodiment, before the step of obtaining the pre-calculated conditional probability that each text in the text set falls into the text set corresponding to the reference user, the method further includes: and acquiring the word frequency characteristics of each word in the text set, and respectively calculating the conditional probability of each word falling into the text set corresponding to the reference user according to the word frequency characteristics.

Optionally, the word frequency characteristics of each word include: the number of times that each word appears in the text set corresponding to the sample user set, the number of times that each word appears in the text set corresponding to the reference user, and the frequency that each word appears in the text set corresponding to the reference user (for example, the ratio of the number of times that a certain word appears in the text set corresponding to the reference user to the number of times that all words appear in the text set corresponding to the reference user).

Optionally, before the step of obtaining the pre-calculated conditional probability that each text in the text set falls into the text set corresponding to the reference user, the method further includes: and acquiring the word frequency characteristics of each word in the text set, and respectively calculating the conditional probability of each word falling into the text set corresponding to the non-reference user according to the word frequency characteristics.

Optionally, the process of calculating the conditional probability belongs to the step of the training phase. Wherein, the training phase refers to the process of training the random forest model.

The conditional probability of each word is calculated according to the word frequency characteristics of each word, and the conditional probability of each word can be obtained through simple calculation, so that the characteristic information of each user can be obtained. The present embodiment calculates the conditional probability in advance. Therefore, when the similarity between the users is calculated, the calculated conditional probability can be directly inquired, and the efficiency of the similarity calculation process can be effectively improved.

Here, the smoothing factor α is a commonly used laplace smoothing (laplace smoothing) when it is 1. Of course, α can take any other value greater than 0.

In this embodiment, the conditional probability that each word falls into the text set corresponding to the reference user is calculated through a formula, so that the first feature text vector of the candidate user can be determined, and the training text feature vector of each sample user for training the random forest model can also be determined.

In an embodiment, after the step of inputting the first text feature vector of the candidate user into a pre-trained random forest model and obtaining the similarity value between the candidate user and the reference user according to the output of the random forest model, the method further includes: and if the similarity value corresponding to the candidate user is higher than a preset threshold value, the candidate user is a similar user of the reference user.

The preset threshold value is generally 0.5-1.0, and of course, the threshold value may take other values outside the range. When the threshold takes 1, the candidate user is a very similar user, and may be the reference user.

Alternatively, the number of similar users may be adjusted accordingly by adjusting the size of the preset threshold value of 0.5.

According to the implementation, the similar users are determined by comparing the similarity with the preset threshold, and corresponding operations can be performed on the similar users in a targeted manner after the similar users are determined, so that the cost waste caused by the operation performed on the non-similar users is reduced.

In one embodiment, the voting function in the random forest is:

wherein, h (x) is a voting function, that is, a vote of whether a text feature vector x belongs to a reference user is obtained after the text feature vector x is input; x is the input text feature vector; h is a decision tree, T is the T-th tree, and the random forest has T trees in total.

And after the text feature vector x is input, voting is carried out on the text feature vector x by the t-th decision tree, and finally the voting result of the text set is obtained.

In this embodiment, the random forest votes for the input text feature vector through a voting function to obtain a voting result of whether the text feature vector belongs to a text set corresponding to a reference user, and according to the voting result, the random forest may obtain a similarity between a corresponding candidate user and the reference user, so as to determine whether the candidate user is a similar user to the reference user.

In order to better understand the above method, an application example of the text-based similarity determination method of the present invention is described in detail below. As shown in fig. 2, fig. 2 is an application example of the text-based similarity determination method.

The DSP (Demand-Side Platform) is a Platform of Demand Side serving advertisers, and the goal of the DSP is to bring as many transformed users as possible with as little cost as possible. It can be simply understood that the DSP is a platform connecting advertisers and traffic parties, and can serve advertisements on various traffic (e.g., love art, today's headlines, etc.) for advertisers.

There are a small number of active game seed users, the cost of delivering game advertisements to this group of seed users is low, the DSP wants to target the delivery to such users in large quantities, so the similar users are selected from the non-seed users, and the process of determining the similar users is as follows:

1) acquiring a title list of videos watched by seed users and non-seed users, performing word segmentation (word transformation) on the video titles to obtain words corresponding to all the users, and removing stop words in the words;

2) calculating the word frequency characteristics (word frequency) of each word corresponding to each user; and obtaining a text set corresponding to a reference user after the words of each seed user are subjected to frequency conversion, obtaining a text set corresponding to a non-reference user set after the words of each non-seed user are subjected to frequency conversion, and obtaining a text set corresponding to a non-reference user by sampling from the text set corresponding to the non-reference user set.

3) For a certain user, calculating the conditional probability characteristics of all words, mapping all words into corresponding conditional probabilities, sequencing the conditional probabilities, selecting the 20 conditional probabilities with the conditional probabilities sequenced in the front and the 20 conditional probabilities with the conditional probabilities sequenced in the back, and taking the 40 conditional probabilities and the corresponding words as the characteristic vector of the user; determining feature vectors corresponding to all users;

4) using the characteristic vectors of the users as a training set for training a random forest (random forest) to obtain a trained random forest model;

5) and classifying the non-seed users by using the trained model to obtain the similarity between the non-seed users and the seed users, and classifying the non-seed users with the similarity value larger than 0.5 as similar users.

After a batch of active game users are actually expanded, offline tests are carried out on games actually released by a plurality of DSP. The control group is the traffic covered by the actual delivery, the test group 1 is the traffic covered by the expanded similar users, and the test group 2 is the traffic covered by the seed users. The cpa of test group 1 (average cost per transformation) was reduced by an average of about 40% compared to the cpa of the control group, and the cpa of test group 2 was reduced by an average of about 70% compared to the control group; the number of conversion users of test group 1 was 5-10 times that of test group 2. According to the test result, the similarity determining method based on the text can accurately calculate the similarity between the seed user and the non-seed user and accurately find out the similar users under the condition that only the user watches the video title.

It should be noted that, for the sake of simplicity, the foregoing method embodiments are described as a series of acts or combinations, but those skilled in the art should understand that the present invention is not limited by the described order of acts, as some steps may be performed in other orders or simultaneously according to the present invention.

Based on the same idea as the text-based similarity determination method in the above-described embodiment, the present invention also provides a text-based similarity determination apparatus, which is operable to execute the above-described text-based similarity determination method. For convenience of description, in the schematic structural diagram of the embodiment of the text-based similarity determination apparatus, only the part related to the embodiment of the present invention is shown, and those skilled in the art will understand that the illustrated structure does not constitute a limitation to the apparatus, and may include more or less components than those illustrated, or combine some components, or arrange different components.

As shown in fig. 3, the text-based similarity determination apparatus includes a conditional probability calculation module 310, a first feature vector acquisition module 320, and a similarity value determination module 330, which are described in detail as follows:

the conditional probability calculation module 310 is configured to obtain a historical web browsing record of a candidate user, and obtain a text set corresponding to the candidate user according to the historical web browsing record; and acquiring the pre-calculated conditional probability of each text in the text set falling into the text set corresponding to the reference user.

The first feature vector obtaining module 320 is configured to obtain a first text feature vector corresponding to the candidate user according to the text set corresponding to the candidate user and the conditional probability of each text in the text set.

And the similarity value determining module 330 is configured to input the first text feature vector of the candidate user into a pre-trained random forest model, and obtain a similarity value between the candidate user and a reference user according to an output of the random forest model.

In one embodiment, the text-based similarity determination apparatus further includes: the system comprises a sample user set construction module, a sample user selection module and a sample user selection module, wherein the sample user set construction module is used for constructing a sample user set, and the sample user set comprises a reference user and a non-reference user; the second characteristic vector acquisition module is used for acquiring historical network browsing records of all sample users in the sample user set to obtain a text set corresponding to all sample users; calculating the conditional probability of each text in the text set of each sample user; obtaining a second text feature vector corresponding to each sample user in the sample user set according to the text set corresponding to each sample user and the conditional probability of each text in the text set; and the random forest training module is used for respectively selecting a plurality of text characteristic vectors which are ranked in the front according to the size of the conditional probability value and a plurality of text characteristic vectors which are ranked in the back in the second text characteristic vector as training text characteristic vectors of corresponding sample users.

In an embodiment, the random forest training module is further configured to sort the second text feature vectors from large to small and from small to large according to conditional probability values, and determine a plurality of text feature vectors sorted at the top respectively as training text feature vectors of corresponding sample users.

In an embodiment, the conditional probability calculating module 310 is further configured to obtain a word corresponding to the candidate user according to the historical web browsing record, remove a stop word in the word, and obtain a text set corresponding to the candidate user.

In an embodiment, the conditional probability calculating module 310 is further configured to obtain a word frequency feature of each word in the text set, and calculate a conditional probability that each word falls into the text set corresponding to the reference user according to the word frequency feature.

In an embodiment, the conditional probability calculating module 310 is further configured to calculate the conditional probability that each word falls into the text set corresponding to the reference user by the following formula:

wherein y is a text set label, 0 represents a text set corresponding to the candidate user, and 1 represents a text set corresponding to the reference user; i is the identifier of the word and represents the ith word, and the total number of the words is n; theta_yiIs the frequency with which the ith word appears in the text set y; n is the number of times, N_yiRepresenting the number of times the ith word appears in the text set y, N_yThe times of occurrence of all words in the text set y; alpha is a preset smoothing factor; lambda [ alpha ]_iIs the probability that the ith word, if it occurs, falls into the text set corresponding to the reference user.

In an embodiment, the text-based similarity determining apparatus further includes a similar user determining module, configured to determine that the candidate user is a similar user of the reference user if the similarity value corresponding to the candidate user is higher than a preset threshold.

In one embodiment, the voting function in the random forest is:

It should be noted that the text-based similarity determination apparatus of the present invention corresponds to the text-based similarity determination method of the present invention one to one, and the technical features and the advantages thereof described in the embodiments of the text-based similarity determination method are all applicable to the embodiments of the text-based similarity determination apparatus, and specific contents may refer to the descriptions in the embodiments of the text-based similarity determination method, and are not described herein again, and thus it is stated that.

In addition, in the embodiment of the text-based similarity determination apparatus, the logical division of the program modules is only an example, and in practical applications, the functions may be allocated by different program modules according to needs, for example, due to configuration requirements of corresponding hardware or due to convenience of implementation of software, that is, the internal structure of the text-based similarity determination apparatus is implemented

It will be understood by those skilled in the art that all or part of the processes of the methods of the above embodiments may be implemented by a computer program, which is stored in a computer readable storage medium and sold or used as a stand-alone product. More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable compact disc read-only memory (CDROM). Additionally, the computer-readable medium could even be paper or another suitable medium upon which the program is printed, as the program can be electronically captured, via for instance optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner if necessary, and then stored in a computer memory.

It should be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.

It should be noted that the terms "first \ second \ third" related to the embodiments of the present invention are merely used for distinguishing similar objects, and do not represent a specific ordering for the objects, and it should be understood that "first \ second \ third" may exchange a specific order or sequence order if allowed. It should be understood that the terms first, second, and third, as used herein, are interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in other sequences than those illustrated or otherwise described herein.

The terms "comprises" and "comprising," and any variations thereof, of embodiments of the present invention are intended to cover non-exclusive inclusions. For example, a process, method, system, article, or apparatus that comprises a list of steps or (module) elements is not limited to only those steps or elements but may alternatively include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

The technical features of the embodiments described above may be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the embodiments described above are not described, but should be considered as being within the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-described examples merely represent several embodiments of the present invention and should not be construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the inventive concept, which falls within the scope of the present invention. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A text-based user similarity determination method is characterized by comprising the following steps:

acquiring historical network browsing records of candidate users, and acquiring a text set corresponding to the candidate users according to the historical network browsing records;

acquiring pre-calculated conditional probability of each text in the text set falling into a text set corresponding to a reference user;

obtaining a first text feature vector corresponding to the candidate user according to the text set corresponding to the candidate user and the conditional probability of each text in the text set; the first text feature vector comprises a text set corresponding to the candidate user and the conditional probability of each text;

inputting the first text feature vector of the candidate user into a pre-trained random forest model, and obtaining a similarity value between the candidate user and a reference user according to the output of the random forest model; the pre-trained random forest model is obtained by training a random forest model according to training text feature vectors corresponding to sample users; the training text feature vector is a second text feature vector obtained by calculating the conditional probability corresponding to each text according to the text information of the sample user set; and respectively selecting a plurality of text characteristic vectors which are ranked in the front and a plurality of text characteristic vectors which are ranked in the back according to the size of the conditional probability value in the second text characteristic vector as training text characteristic vectors corresponding to the sample user.

2. The method of claim 1, wherein the step of inputting the first text feature vector of the candidate user into a pre-trained random forest model is preceded by the step of:

constructing a sample user set, wherein the sample user set comprises reference users and non-reference users;

acquiring historical network browsing records of all sample users in a sample user set to obtain a text set corresponding to all sample users; calculating the conditional probability of each text in the text set of each sample user; obtaining a second text feature vector corresponding to each sample user in the sample user set according to the text set corresponding to each sample user and the conditional probability of each text in the text set;

selecting a plurality of text feature vectors from the second text feature vectors as training text feature vectors of corresponding sample users; and training a random forest model according to the training text feature vector.

3. The method according to any one of claims 1 to 2, wherein the step of obtaining the text set corresponding to the candidate user according to the historical web browsing record comprises:

and obtaining words corresponding to the candidate users according to the historical network browsing records, and removing stop words in the words to obtain a text set corresponding to the candidate users.

4. The method according to claim 3, wherein before the step of obtaining the pre-computed conditional probability that each text in the text set falls into the text set corresponding to the reference user, the method further comprises:

and acquiring the word frequency characteristics of each word in the text set, and respectively calculating the conditional probability of each word falling into the text set corresponding to the reference user according to the word frequency characteristics.

5. The text-based user similarity determination method according to claim 4, wherein the conditional probability that each word falls into the text set corresponding to the reference user is calculated by the following formula:

wherein y is a text set label, 0 represents a text set corresponding to the candidate user, and 1 represents a referenceA text set corresponding to the user; i is the identifier of the word and represents the ith word, and the total number of the words is n; theta_yiIs the frequency with which the ith word appears in the text set y; n is a radical of_yiRepresenting the number of times the ith word appears in the text set y, N_yThe times of occurrence of all words in the text set y; alpha is a preset smoothing factor; lambda [ alpha ]_iIs the probability that the ith word falls into the text set corresponding to the reference user.

6. The method as claimed in claim 1, wherein the step of inputting the first text feature vector of the candidate user into a pre-trained random forest model and obtaining the similarity value between the candidate user and the reference user according to the output of the random forest model further comprises:

and if the similarity value corresponding to the candidate user is higher than a preset threshold value, the candidate user is a similar user of the reference user.

7. The text-based user similarity determination method according to claim 1, wherein the voting function in the random forest is:

8. A text-based user similarity determination apparatus, comprising:

the conditional probability calculation module is used for acquiring historical network browsing records of candidate users and obtaining a text set corresponding to the candidate users according to the historical network browsing records; acquiring pre-calculated conditional probability of each text in the text set falling into a text set corresponding to a reference user;

the first feature vector acquisition module is used for acquiring a first text feature vector corresponding to the candidate user according to the text set corresponding to the candidate user and the conditional probability of each text in the text set; the first text feature vector comprises a text set corresponding to the candidate user and the conditional probability of each text;

the similarity value determining module is used for inputting the first text feature vector of the candidate user into a pre-trained random forest model and obtaining the similarity value between the candidate user and a reference user according to the output of the random forest model; the pre-trained random forest model is obtained by training a random forest model according to training text feature vectors corresponding to sample users; the training text feature vector is a second text feature vector obtained by calculating the conditional probability corresponding to each text according to the text information of the sample user set; and respectively selecting a plurality of text characteristic vectors which are ranked in the front and a plurality of text characteristic vectors which are ranked in the back according to the size of the conditional probability value in the second text characteristic vector as training text characteristic vectors corresponding to the sample user.

9. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out a text-based user similarity determination method according to any one of claims 1 to 7.

10. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the text based user similarity determination method according to any of claims 1 to 7 when executing the program.