CN108304490A - Text based similarity determines method, apparatus and computer equipment - Google Patents
Text based similarity determines method, apparatus and computer equipment Download PDFInfo
- Publication number
- CN108304490A CN108304490A CN201810015523.1A CN201810015523A CN108304490A CN 108304490 A CN108304490 A CN 108304490A CN 201810015523 A CN201810015523 A CN 201810015523A CN 108304490 A CN108304490 A CN 108304490A
- Authority
- CN
- China
- Prior art keywords
- text
- user
- collection
- users
- candidate user
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3346—Query execution using probabilistic model
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/335—Filtering based on additional data, e.g. user or group profiles
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/22—Matching criteria, e.g. proximity measures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/243—Classification techniques relating to the number of classes
- G06F18/24323—Tree-organised classifiers
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Databases & Information Systems (AREA)
- Bioinformatics & Computational Biology (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Computational Linguistics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Probability & Statistics with Applications (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The present invention relates to text based similarities to determine method and device, belongs to Internet technical field.The method includes:The web-based history browsing record for obtaining candidate user obtains the corresponding text collection of the candidate user according to web-based history browsing record;It obtains each text in the text collection precalculated and falls into the conditional probability with reference to the corresponding text collection of user;According to the conditional probability of the corresponding text collection of the candidate user and wherein each text, corresponding first Text eigenvector of the candidate user is obtained;By the first Text eigenvector input of candidate user Random Forest model trained in advance, the similarity value of the candidate user and reference user is obtained according to the output of the Random Forest model.Above-mentioned technical proposal solves the problems, such as accurately calculate similarity between user, candidate user can be accurately calculated by the relevant information of text and refers to the similarity of user, and then can find out the similar users with reference to user.
Description
Technical field
The present invention relates to Internet technical fields, determine method, apparatus more particularly to text based similarity, calculate
Machine readable storage medium storing program for executing and computer equipment.
Background technology
Currently, having become by searching for similar users and to similar users PUSH message or transmission advertisement etc. a kind of effective
Marketing mode.The premise of this marketing mode is accurately to determine similarity between calculating user.Tradition is true based on text
The method for determining similarity has k-means clusters etc..In realizing process of the present invention, inventor has found at least exist in the prior art
Following problem:Tradition determines the method for similarity or is not suitable for determining similarity based on word;Or result has very
Big randomness causes to carry out the result difference that cluster obtains to same a collection of user every time.Therefore, it is necessary to which it is logical to find a kind of energy
The method for crossing similarity between the associated information calculation user of text.
Invention content
Based on this, the present invention provides text based similarities to determine method and device, can text based correlation letter
Breath accurately calculates the similarity between user, may thereby determine that the similar users with reference to user.
The content of the embodiment of the present invention is as follows:
A kind of text based similarity determines method, including:The web-based history browsing record for obtaining candidate user, according to
The web-based history browsing record obtains the corresponding text collection of the candidate user;Obtain the text collection precalculated
In each text fall into the conditional probability with reference to the corresponding text collection of user;According to the corresponding text collection of the candidate user with
And the conditional probability of wherein each text, obtain corresponding first Text eigenvector of the candidate user;By the candidate user
The trained in advance Random Forest model of the first Text eigenvector input, institute is obtained according to the output of the Random Forest model
State the similarity value of candidate user and reference user.
First Text eigenvector by the candidate user inputs training in advance in one of the embodiments,
Before the step of Random Forest model, further include:Build sample of users collection, the sample of users concentrate include with reference to user and
Non-reference user;Obtaining sample of users concentrates the web-based history of each sample of users to browse record, and it is corresponding to obtain each sample of users
Text collection;Calculate the conditional probability of each text in the text collection of each sample of users;It is corresponding according to each sample of users
The conditional probability of text collection and wherein each text obtains sample of users and concentrates corresponding second text feature of each sample of users
Vector;The training text that multiple Text eigenvectors are chosen from second Text eigenvector as corresponding sample of users is special
Sign vector;Random Forest model is trained according to the training text feature vector.
It is described in one of the embodiments, that multiple Text eigenvectors works are chosen from second Text eigenvector
For corresponding sample of users training text feature vector the step of, including:It chooses in second Text eigenvector and presses respectively
The preceding multiple Text eigenvectors of conditional probability value size sequence and the posterior multiple Text eigenvectors that sort, as correspondence
The training text feature vector of sample of users.
It is described in one of the embodiments, that according to web-based history browsing record, to obtain the candidate user corresponding
The step of text collection, including:The corresponding word of the candidate user is obtained according to web-based history browsing record, removes institute
Stop words in predicate language obtains the corresponding text collection of the candidate user.
Each text is fallen into reference to user in the text collection that the acquisition precalculates in one of the embodiments,
Before the step of conditional probability of corresponding text collection, further include:The word frequency for obtaining each word in the text collection is special
Sign, calculates separately each word according to the words-frequency feature and falls into the conditional probability with reference to the corresponding text collection of user.
Each word is calculated by following formula in one of the embodiments, to fall into reference to the corresponding text collection of user
Conditional probability:
Wherein, y is text collection label, and 0 indicates the corresponding text collection of candidate user, and 1 indicates corresponding with reference to user
Text collection;I is the mark of word, indicates i-th of word, word a total of n;θyiIt is i-th of word in text collection y
The frequency of appearance;NyiIndicate the number that i-th of word occurs in text collection y, NyIt is all words in text collection y
The number of appearance;α is preset smoothing factor;λiIt is to fall into the probability with reference to the corresponding text collection of user in i-th of word.
First Text eigenvector by the candidate user inputs training in advance in one of the embodiments,
Random Forest model obtains the candidate user and the similarity value with reference to user according to the output of the Random Forest model
After step, further include:If the corresponding similarity value of the candidate user is higher than preset threshold value, the candidate user is institute
State the similar users with reference to user.
The ballot function in the random forest is in one of the embodiments,:
Wherein, H (x) is ballot function;X is the Text eigenvector of input;H is decision tree, and t is the t tree, it is described with
A total of T tree in machine forest.
Correspondingly, the embodiment of the present invention provides a kind of text based similarity determining device, including:Conditional probability calculates
Module, the web-based history for obtaining candidate user browse record, and the candidate is obtained according to web-based history browsing record
The corresponding text collection of user;Each text in the text collection precalculated is obtained to fall into reference to the corresponding text set of user
The conditional probability of conjunction;First eigenvector acquisition module, for according to the corresponding text collection of the candidate user and wherein
The conditional probability of each text obtains corresponding first Text eigenvector of the candidate user;Similarity value determining module, is used for
By the first Text eigenvector input of candidate user Random Forest model trained in advance, according to the random forest mould
The output of type obtains the similarity value of the candidate user and reference user.
A kind of computer readable storage medium, is stored thereon with computer program, which realizes when being executed by processor
The step of method described above, passes through the computer program of its storage.
A kind of computer equipment, including memory, processor and storage can be run on a memory and on a processor
The step of computer program, the processor realizes method described above when executing described program.
Above-mentioned technical proposal browses according to the web-based history of candidate user and records, obtains the corresponding text of the candidate user
Set;It obtains each text in the text collection precalculated and falls into the conditional probability with reference to the corresponding text collection of user;
Determine corresponding first Text eigenvector of the candidate user, and by first Text eigenvector input training in advance with
Machine forest model obtains the similarity value of the candidate user and reference user.Candidate can be accurately calculated in this way
User determines whether the candidate user is the similar users for referring to user, and then targetedly with reference to the similarity of user
Ground operates similar users accordingly, prevents from all carrying out the operation to all users, can effectively solve operating cost.
Description of the drawings
Fig. 1 is the schematic flow chart that text based similarity determines method in one embodiment;
Fig. 2 is the application example that text based similarity determines method in an embodiment;
Fig. 3 is the structural schematic diagram of text based similarity determining device in an embodiment.
Specific implementation mode
In order to make the purpose , technical scheme and advantage of the present invention be clearer, with reference to the accompanying drawings and embodiments, right
The present invention is further elaborated.It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, and
It is not used in the restriction present invention.
The embodiment of the present invention is described by taking network marketing as an example, but the text based similarity of the embodiment of the present invention
It determines that method, apparatus, computer equipment and storage medium are not limited to solve the problems in network marketing, can be also used for solving
The problems in the application scenarios that other similarities determine.
When carrying out a certain network marketing, the user for launching advertisement is determined often through the historical data of the network user
Group.In these network users, some belongs to the user for often carrying out particular network operation, such as:It is wide for launching TV
The network marketing of announcement, certain user often watch the Internet video of a certain product corresponding with the advertisement, then these users belong to
The seed user (referring to user) of this product.For certain users similar with the network operation of above-mentioned reference user,
Belong to the user for comparing and having intention, launching corresponding advertisement to these users has stronger specific aim, return rate higher.Example
Such as:Have a collection of a small amount of active game seed user, the cost that game advertisement consumption is launched to this batch of seed user is relatively low, pair with
The similar user of seed user orients dispensing in large quantities, can obtain higher income.Therefore, it is necessary between calculating user
Similarity, and then the similar users with reference to user are found out, the embodiment of the present invention intends to solve that the text message based on user calculates and uses
Between family the problem of similarity.
The method that tradition determines the similar users with reference to user has:
Method one:Step 1, installation the package list of user is expressed as 1/0 spy by bag-of-words (bow) method
Sign trains Logic Regression Models by the list;Step 2, by the output of Logic Regression Models plus other three kinds of features (installations
Basic application percentage, payment applications number and average cost paid), train GBDT (Gradient Boosting as input
Decision Tree are a kind of decision Tree algorithms of iteration) disaggregated model, it is classified as 1 as similar users.Wherein bow is special
Sign is unsatisfactory in the application effect of text class, uses the effect of conditional probability aspect ratio bow features to have significantly in an experiment
It is promoted;In addition, the bilayer model used in the above method is based on additional payment information feature, it is not suitable for appointing for text message
Business.
Method two:Step 1, the video tab of video media is mapped as x dimension labels vector, later by by the institute of video
There is label vector to add up and average, obtains the x dimension video vectors of each video;Step 2, video is clustered, is obtained
Similar video cluster result;Step 3, similar video cluster result is converted into similar users cluster result;Step 4, from seed
Cluster result is extracted in user, sequencing of similarity is carried out, so that it is determined that the sequence of user.This method is then right first to Video clustering
User clustering, since k-means clusters are to the sensibility of initial value, the selection of initial value has prodigious randomness, clusters every time
The result is that different, cause the result clustered out every time with a collection of seed user different.
A kind of text based similarity of offer of the embodiment of the present invention determines that method and corresponding text based are similar
Spend determining device.It is described in detail separately below.
As shown in Figure 1, determining the schematic flow chart of method for the text based similarity of an embodiment.The implementation
The text based similarity that example provides determines method mainly including step S110 to step S130, and detailed description are as follows:
S110, the web-based history browsing record for obtaining candidate user obtain described according to web-based history browsing record
The corresponding text collection of candidate user;Each text in the text collection precalculated is obtained to fall into reference to the corresponding text of user
The conditional probability of this set.
Optionally, sample of users is all users for carrying out corresponding network browse operation, these users include with reference to use
Family and non-reference user.Wherein, it refers to seed user with reference to user, in addition to seed user is non-seed in sample of users
User;Non-reference user can refer to non-seed user, can also refer to a part of user chosen from non-seed user.Candidate uses
Family can be the certain customers in non-reference user, or all users in non-reference user can also be from whole
The user extracted in a sample of users.If the candidate user refers to that all users in non-reference user, the present invention implement
Example can determine with reference to user other than all users with reference to user similarity, and then from these users determine with reference to use
The similar users at family.It can determine whether the candidate user is the reference by calculating candidate user and the similarity with reference to user
The similar users of user.
Web-based history browsing is recorded as user and carries out the record generated after network operation.The network operation can be to watch certain
A Internet video, or search for some webpage, can also be to play the network operations such as online game.
Optionally, text collection is to be browsed from the web-based history of sample of users used in the sample of users extracted in record
Text, such as:Some operation that user watches search term used in some video, plays performed by some online game is corresponding
Text message etc..
Optionally, further include obtaining each text in the text collection precalculated to fall into the corresponding text of non-reference user
The conditional probability of this set.
It is the probability with reference to the corresponding text of user that this step, which calculates each text, and it is to wait that this step, which can also calculate each text,
Select the probability of the corresponding text in family.
S120, according to the conditional probability of the corresponding text collection of the candidate user and wherein each text, obtain described
Corresponding first Text eigenvector of candidate user.
Wherein, the first Text eigenvector refers to the list of characterization user's characteristic information, the list by text collection and
The corresponding conditional probability of each text is constituted.Can also include other parameters in first Text eigenvector, such as the transaction of user
The information such as number.
S130, the Random Forest model for training the first Text eigenvector input of the candidate user in advance, according to
The output of the Random Forest model obtains the similarity value of the candidate user and reference user.
Wherein, Random Forest model (Random forest) refers to that sample is trained and is predicted using more trees
A kind of grader.The Random Forest model of the embodiment of the present invention use bagging method, by stochastical sampling sample and with
Machine samples the mode of feature, while more random decision trees of training, decides by vote input jointly by these random decision trees
The first Text eigenvector whether belong to reference to the corresponding text collection of user, and further obtain candidate user with reference to use
The similarity at family can determine whether the candidate user is the similar users for referring to user according to the similarity.
The present embodiment calculates the conditional probability of the corresponding each text of candidate user, by each text and corresponding conditional probability
It is input in trained Random Forest model as characteristic information, finally obtains candidate user relative to the phase with reference to user
Like degree.The similarity of candidate user and reference user can be accurately calculated according to the feature of user, and then can determine the candidate
Whether user is the similar users for referring to user.
In one embodiment, the first Text eigenvector input training in advance by the candidate user is random gloomy
Before the step of woods model, further include:Sample of users collection is built, it includes referring to user and non-reference that the sample of users, which is concentrated,
User;Obtaining sample of users concentrates the web-based history of each sample of users to browse record, obtains the corresponding text set of each sample of users
It closes;Calculate the conditional probability of each text in the text collection of each sample of users;According to the corresponding text set of each sample of users
The conditional probability of conjunction and wherein each text obtains sample of users and concentrates corresponding second Text eigenvector of each sample of users;
Chosen from second Text eigenvector multiple Text eigenvectors as corresponding sample of users training text feature to
Amount;Random Forest model is trained according to the training text feature vector.
Wherein, sample of users can refer to all-network user;The user for meeting a certain condition can also be referred to, such as:Need from
The similar users of the reference user of the online game are determined in the user of a certain online game, then the sample of users can be played
The user of the online game.
The present embodiment calculates the corresponding conditional probability of each text according to the text message of sample of users collection, obtains second
Literary feature vector finds out training text feature vector representative in each sample of users from second this paper feature vectors,
Random Forest model is trained by the training text feature vector.The random forest module can fully integrate all training texts
Characteristic information in feature vector simultaneously obtains the model that can be reasonably judged candidate user.
In one embodiment, described that multiple Text eigenvectors are chosen from second Text eigenvector as correspondence
The step of training text feature vector of sample of users, including:It chooses respectively according to condition general in second Text eigenvector
The preceding multiple Text eigenvectors of rate value size sequence and the posterior multiple Text eigenvectors that sort, are used as corresponding sample
The training text feature vector at family.
Optionally, the present embodiment determines that the step of training text feature vector includes determining the corresponding instruction of each sample of users
Practice Text eigenvector.
Optionally, the present embodiment chosen from the second Text eigenvector of a certain sample of users training text feature to
Amount, is ranked up the conditional probability in the second Text eigenvector by size, determines that the preceding k1 condition that wherein sort is general
The rate λ 1 and posterior k2 conditional probability λ 2 of sorting, by the corresponding item of the conditional probability λ 1 and conditional probability λ 2 conducts candidate user
Part probability characteristics, the training text that Text eigenvector corresponding with these conditional probability features is determined as to the sample of users are special
Sign vector.Wherein, k1 and k2 can be any integer more than 0.K1 and k2 may be the same or different.Specifically, k1 and
K2 is 20.Optionally, the number of the training text feature vector of each sample of users can be different.
The present embodiment determines the training text feature vector of each sample of users according to the case where conditional probability, these instructions
Practice Text eigenvector and represents the characteristic information of sample of users (such as:The corresponding text of the higher expression of conditional probability value more has can
Can be the text used with reference to user), by these training text feature vectors come train Random Forest model can obtain through
Trained Random Forest model is crossed, which can accurately determine the similarity value between user.
In one embodiment, described that the corresponding text set of the candidate user is obtained according to web-based history browsing record
The step of conjunction, including:The corresponding word of the candidate user is obtained according to web-based history browsing record, removes the word
In stop words, obtain the corresponding text collection of the candidate user.
This step records corresponding text message to web-based history browsing and segments, and obtains the candidate user pair
The word answered, by wherein not include user information stop words remove, it is corresponding that the candidate user is obtained after being integrated
Text collection.Text collection after removal stop words can more accurately represent the feature of user, meanwhile, it can effectively save and deposit
Simultaneously improve treatment effeciency in storage space.
In one embodiment, each text is fallen into corresponding with reference to user in the text collection that the acquisition precalculates
Before the step of conditional probability of text collection, further include:The words-frequency feature for obtaining each word in the text collection, according to
The words-frequency feature calculates separately each word and falls into the conditional probability with reference to the corresponding text collection of user.
Optionally, the words-frequency feature of each word includes:Each word goes out in the corresponding text collection of sample of users collection
Existing number, the number that each word occurs in reference to the corresponding text collection of user further includes each word with reference to use
The frequency occurred in the corresponding text collection in family is (such as:Time that some word occurs in reference to the corresponding text collection of user
Number accounts for the ratio of all words occurrence number in reference to the corresponding text collection of user).
Optionally, each text is fallen into reference to the corresponding text set of user in the text collection that the acquisition precalculates
Before the step of conditional probability of conjunction, further include:The words-frequency feature for obtaining each word in the text collection, according to institute's predicate
Frequency feature calculates separately the conditional probability that each word falls into the corresponding text collection of non-reference user.
Optionally, the step of process of design conditions probability belongs to the training stage.Wherein, the training stage refer to training with
The process of machine forest model.
The present embodiment calculates the conditional probability of each word according to the words-frequency feature of each word, by simply calculating energy
It obtains the conditional probability of each word, and then can obtain the characteristic information of each user.It is general that the present embodiment precalculates condition
Rate.Therefore, when the similarity between user calculates, the conditional probability calculated, Neng Gouyou can directly be inquired
Effect improves the efficiency of similarity calculation process.
In one embodiment, each word is calculated by following formula and falls into the condition with reference to the corresponding text collection of user
Probability:
Wherein, y is text collection label, and 0 indicates the corresponding text collection of candidate user, and 1 indicates corresponding with reference to user
Text collection;I is the mark of word, indicates i-th of word, word a total of n;θyiIt is i-th of word in text collection y
The frequency of appearance;NyiIndicate the number that i-th of word occurs in text collection y, NyIt is all words in text collection y
The number of appearance;α is preset smoothing factor;λiIt is to fall into the probability with reference to the corresponding text collection of user in i-th of word.
Wherein, it is common laplace smooth (Laplce is smooth) when smoothing factor α takes 1.Certain α can also take it
He is any be more than 0 value.
The present embodiment calculates each word by formula and falls into the conditional probability with reference to the corresponding text collection of user, in turn
The fisrt feature text vector that can determine candidate user can also be determined for training each sample of Random Forest model to use
The training text feature vector at family.
In one embodiment, the first Text eigenvector input training in advance by the candidate user is random gloomy
Woods model, according to the output of the Random Forest model obtain the candidate user with reference to user similarity value the step of it
Afterwards, further include:If the corresponding similarity value of the candidate user is higher than preset threshold value, the candidate user is the reference
The similar users of user.
Wherein, the preset threshold value is generally 0.5-1.0, and certainly, threshold value can also take the other values except the range.
When the threshold value takes 1, which is closely similar user, it is also possible to be exactly to refer to user.
It optionally, can be by adjusting the size of the preset threshold value 0.5, come the similar users quantity correspondingly adjusted.
This implementation determines similar users by the result for being compared similarity with preset threshold value, determines similar use
Targetedly similar users can be operated accordingly behind family, reduce cost caused by executing operation to non-similar users
Waste.
In one embodiment, the ballot function in the random forest is:
Wherein, H (x) is ballot function, that is, input after Text eigenvector x obtain to text feature vector x whether
Belong to the ballot with reference to user;X is the Text eigenvector of input;H is decision tree, and t is the t and sets, in the random forest
A total of T tree.
After indicating input Text eigenvector x, the t decision tree votes simultaneously to text feature vector x
Finally obtain the voting results of text collection.
Random forest is voted by the Text eigenvector for function pair input of voting in the present embodiment, obtains the text
Whether feature vector belongs to the voting results with reference to the corresponding text collection of user, and random forest can be with according to these voting results
The similarity of corresponding candidate user and reference user is obtained, and then can determine whether the candidate user is that this refers to user
Similar users.
The above method in order to better understand, one detailed below the present invention is based on the similarities of text to determine method
Application example.As shown in Fig. 2, Fig. 2 is the application example that text based similarity determines method.
DSP (Demand-Side Platform) is the party in request's platform serviced for advertiser, and the target of DSP is by the greatest extent
Cost that may be less brings conversion user as much as possible.It is connection advertiser and flow side that DSP, which can be simply interpreted as,
Platform can be that advertiser launches advertisement on various flows (such as iqiyi.com, today's tops).
There is a small amount of active game seed user, the cost that game advertisement consumption is launched to this batch of seed user is relatively low, DSP
Wish to orient dispensing in large quantities to this kind of user, therefore similar users are selected from non-seed user, determines the mistake of similar users
Journey is as follows:
1) header list for the video that acquisition seed user and non-seed user have seen, (word is segmented to the video title
Language), the corresponding word of all users is obtained, the stop words in these words is removed;
2) words-frequency feature (word frequency) of the corresponding each word of each user is calculated;To being obtained after each seed user word frequency
To with reference to the corresponding text collection of user, collect corresponding text set to obtaining non-reference user after each non-seed user's word frequency
It closes, the corresponding text collection of non-reference user can be obtained by collecting sampling in corresponding text collection from non-reference user.
3) to a certain user, the conditional probability feature of each word is calculated, it is general that all words are mapped to corresponding condition
Rate is ranked up these conditional probabilities, therefrom chooses conditional probability sequence preceding 20 and posterior 20 conditions that sort
Probability, using this 40 conditional probabilities and corresponding word as the feature vector of the user;The really corresponding feature of all users
Vector;
4) it using the feature vector of these users as the training set of training random forest (random forest), is instructed
The Random Forest model perfected;
5) classified non-seed user with trained model, obtains the similarity of the non-seed user and seed user,
Non-seed user by similarity value more than 0.5 is classified as similar users.
After the game user active to a batch carries out true extension, the practical game launched of a few money DSP is surveyed offline
Examination.Control group is the practical flow for launching covering, and test group 1 is the flow of the similar users covering of extension, and test group 2 is seed
The flow of user's covering.The cpa of the cpa (the consumed cost of average each conversion) of test group 1 compared to the control group are averagely reduced
About 40%, the cpa of test group 2 averagely reduces about 70% compared to the control group;The conversion number of users of test group 1 is test group 2
5-10 times.According to the result of the test it is found that the text based similarity of the embodiment of the present invention determines that method can just known that
User watches under video title, accurately calculates the similarity of seed user and non-seed user, accurately finds out similar users.
It should be noted that for each method embodiment above-mentioned, describes, be all expressed as a series of for simplicity
Combination of actions, but those skilled in the art should understand that, the present invention is not limited by the described action sequence, because according to
According to the present invention, certain steps may be used other sequences or be carried out at the same time.
The identical thought of method is determined based on the text based similarity in above-described embodiment, and the present invention also provides bases
In the similarity determining device of text, which can be used for executing above-mentioned text based similarity and determines method.For the ease of
Illustrate, in the structural schematic diagram of text based similarity determining device embodiment, illustrate only and phase of the embodiment of the present invention
The part of pass, it will be understood by those skilled in the art that the restriction of schematic structure not structure twin installation, may include than illustrating more
More or less component either combines certain components or different components arrangement.
As described in Figure 3, text based similarity determining device include conditional probability computing module 310, fisrt feature to
Acquisition module 320 and similarity value determining module 330 are measured, detailed description are as follows:
Conditional probability computing module 310, the web-based history for obtaining candidate user browses record, according to the history net
Network browsing record obtains the corresponding text collection of the candidate user;Each text in the text collection precalculated is obtained to fall
Enter to refer to the conditional probability of the corresponding text collection of user.
First eigenvector acquisition module 320, for according to the corresponding text collection of the candidate user and wherein each
The conditional probability of text obtains corresponding first Text eigenvector of the candidate user.
And similarity value determining module 330, it is advance for inputting the first Text eigenvector of the candidate user
It is similar to reference user's to obtain the candidate user according to the output of the Random Forest model for trained Random Forest model
Angle value.
In one embodiment, the text based similarity determining device further includes:Sample of users collection builds module, uses
In structure sample of users collection, it includes referring to user and non-reference user that the sample of users, which is concentrated,;Second feature vector obtains
Module concentrates the web-based history of each sample of users to browse record, obtains the corresponding text of each sample of users for obtaining sample of users
This set;Calculate the conditional probability of each text in the text collection of each sample of users;According to the corresponding text of each sample of users
The conditional probability of this set and wherein each text, obtain sample of users concentrate corresponding second text feature of each sample of users to
Amount;Random forest training module exists for choosing according to condition probability value size sequence in second Text eigenvector respectively
Preceding multiple Text eigenvectors and the posterior multiple Text eigenvectors that sort, the training text as corresponding sample of users are special
Sign vector.
In one embodiment, the random forest training module is additionally operable to second Text eigenvector according to condition
Probability value sorts from big to small and from small to large, determines the preceding multiple Text eigenvectors that sort respectively, as corresponding sample
The training text feature vector of user.
In one embodiment, the conditional probability computing module 310 is additionally operable to be recorded according to web-based history browsing
To the corresponding word of the candidate user, the stop words in the word is removed, obtains the corresponding text set of the candidate user
It closes.
In one embodiment, the conditional probability computing module 310 is additionally operable to obtain each word in the text collection
Words-frequency feature, it is general that the condition that each word is fallen into reference to the corresponding text collection of user is calculated separately according to the words-frequency feature
Rate.
In one embodiment, the conditional probability computing module 310 is additionally operable to fall by each word of following formula calculating
Enter to refer to the conditional probability of the corresponding text collection of user:
Wherein, y is text collection label, and 0 indicates the corresponding text collection of candidate user, and 1 indicates corresponding with reference to user
Text collection;I is the mark of word, indicates i-th of word, word a total of n;θyiIt is i-th of word in text collection y
The frequency of appearance;N is number, NyiIndicate the number that i-th of word occurs in text collection y, NyIt is all words in text
The number occurred in this set y;α is preset smoothing factor;λiIt is in the case where i-th of word occurs, which falls into
With reference to the probability of the corresponding text collection of user.
In one embodiment, the text based similarity determining device further includes similar users determining module, is used for
If the corresponding similarity value of the candidate user is higher than preset threshold value, the candidate user is described with reference to the similar of user
User.
In one embodiment, the ballot function in the random forest is:
Wherein, H (x) is ballot function;X is the Text eigenvector of input;H is decision tree, and t is the t tree, it is described with
A total of T tree in machine forest.
It should be noted that the text based similarity determining device of the present invention is similar to the text based of the present invention
Degree determines that method corresponds, and the technical characteristic and its have that the embodiment of method illustrates are determined in above-mentioned text based similarity
For beneficial effect suitable for the embodiment of text based similarity determining device, particular content can be found in the method for the present invention implementation
Narration in example, details are not described herein again, hereby give notice that.
In addition, in the embodiment of the text based similarity determining device of above-mentioned example, the logic of each program module
Division is merely illustrative of, can be as needed in practical application, such as the configuration requirement or software of corresponding hardware
The convenient of realization considers, above-mentioned function distribution is completed by different program modules, i.e., the text based similarity is true
Determine the internal junction of device
It will appreciated by the skilled person that realizing all or part of flow in above-described embodiment method, being can
It is completed with instructing relevant hardware by computer program, the program can be stored in a computer-readable storage and be situated between
In matter, sells or use as independent product.The more specific example (non-exhaustive list) of computer-readable medium includes
Below:Electrical connection section (electronic device) with one or more wiring, portable computer diskette box (magnetic device), arbitrary access
Memory (RAM), read-only memory (ROM), erasable edit read-only storage (EPROM or flash memory), optical fiber dress
It sets and portable optic disk read-only storage (CDROM).It can be printed on it in addition, computer-readable medium can even is that
The paper of described program or other suitable media, because can be for example by carrying out optical scanner to paper or other media, then
It is handled electronically to obtain described program, then by it into edlin, interpretation or when necessary with other suitable methods
Storage is in computer storage.
It should be appreciated that each section of the present invention can be realized with hardware, software, firmware or combination thereof.Above-mentioned
In embodiment, software that multiple steps or method can in memory and by suitable instruction execution system be executed with storage
Or firmware is realized.It, and in another embodiment, can be under well known in the art for example, if realized with hardware
Any one of row technology or their combination are realized:With the logic gates for realizing logic function to data-signal
Discrete logic, with suitable combinational logic gate circuit application-specific integrated circuit, programmable gate array (PGA), scene
Programmable gate array (FPGA) etc..
It should be noted that the term " first second third " involved by the embodiment of the present invention is only to be that difference is similar
Object, do not represent the particular sorted for object, it is possible to understand that ground, " Yi Er thirds " can be in the case of permission
Exchange specific sequence or precedence.It should be appreciated that the object that " first second third " is distinguished in the appropriate case can be mutual
It changes, so that the embodiment of the present invention described herein can be real with the sequence other than those of illustrating or describing herein
It applies.
The term " comprising " and " having " of the embodiment of the present invention and their any deformations, it is intended that cover non-exclusive
Including.Such as contain series of steps or the process, method, system, product or equipment of (module) unit are not limited to
The step of listing or unit, but further include the steps that optionally not listing or unit, or further include optionally for these
The intrinsic other steps of process, method, product or equipment or unit.
Each technical characteristic of embodiment described above can be combined arbitrarily, to keep description succinct, not to above-mentioned reality
It applies all possible combination of each technical characteristic in example to be all described, as long as however, the combination of these technical characteristics is not deposited
In contradiction, it is all considered to be the range of this specification record.
Several embodiments of the invention above described embodiment only expresses, should not be understood as to the scope of the claims of the present invention
Limitation.It should be pointed out that for those of ordinary skill in the art, without departing from the inventive concept of the premise,
Various modifications and improvements can be made, these are all within the scope of protection of the present invention.Therefore, the protection domain of patent of the present invention
It should be determined by the appended claims.
Claims (11)
1. a kind of text based similarity determines method, which is characterized in that include the following steps:
The web-based history browsing record for obtaining candidate user obtains the candidate user pair according to web-based history browsing record
The text collection answered;It obtains each text in the text collection precalculated and falls into the item with reference to the corresponding text collection of user
Part probability;
According to the conditional probability of the corresponding text collection of the candidate user and wherein each text, the candidate user pair is obtained
The first Text eigenvector answered;
By the first Text eigenvector input of candidate user Random Forest model trained in advance, according to described random gloomy
The output of woods model obtains the similarity value of the candidate user and reference user.
2. text based similarity according to claim 1 determines method, which is characterized in that described by the candidate use
Before the step of Random Forest model that the first Text eigenvector input at family is trained in advance, further include:
Sample of users collection is built, it includes referring to user and non-reference user that the sample of users, which is concentrated,;
Obtaining sample of users concentrates the web-based history of each sample of users to browse record, obtains the corresponding text set of each sample of users
It closes;Calculate the conditional probability of each text in the text collection of each sample of users;According to the corresponding text set of each sample of users
The conditional probability of conjunction and wherein each text obtains sample of users and concentrates corresponding second Text eigenvector of each sample of users;
The training text that multiple Text eigenvectors are chosen from second Text eigenvector as corresponding sample of users is special
Sign vector;Random Forest model is trained according to the training text feature vector.
3. text based similarity according to claim 2 determines method, which is characterized in that described from second text
The step of training text feature vector of multiple Text eigenvectors as corresponding sample of users is chosen in eigen vector, packet
It includes:
The preceding multiple Text eigenvectors of according to condition probability value size sequence in second Text eigenvector are chosen respectively
With the posterior multiple Text eigenvectors that sort, the training text feature vector as corresponding sample of users.
4. text based similarity according to any one of claims 1 to 3 determines method, which is characterized in that described
The step of candidate user corresponding text collection being obtained according to web-based history browsing record, including:
The corresponding word of the candidate user is obtained according to web-based history browsing record, removes deactivating in the word
Word obtains the corresponding text collection of the candidate user.
5. text based similarity according to claim 4 determines method, which is characterized in that the acquisition precalculates
The text collection in before each text the step of falling into the conditional probability with reference to the corresponding text collection of user, further include:
The words-frequency feature for obtaining each word in the text collection calculates separately each word according to the words-frequency feature and falls into
With reference to the conditional probability of the corresponding text collection of user.
6. text based similarity according to claim 5 determines method, which is characterized in that calculated by following formula
Each word falls into the conditional probability with reference to the corresponding text collection of user:
Wherein, y is text collection label, and 0 indicates the corresponding text collection of candidate user, and 1 indicates to refer to the corresponding text of user
Set;I is the mark of word, indicates i-th of word, word a total of n;θyiIt is that i-th of word occurs in text collection y
Frequency;NyiIndicate the number that i-th of word occurs in text collection y, NyOccur in text collection y for all words
Number;α is preset smoothing factor;λiIt is to fall into the probability with reference to the corresponding text collection of user in i-th of word.
7. text based similarity according to claim 1 determines method, which is characterized in that described by the candidate use
The Random Forest model that the first Text eigenvector input at family is trained in advance, obtains according to the output of the Random Forest model
After the step of candidate user and the similarity value of reference user, further include:
If the corresponding similarity value of the candidate user is higher than preset threshold value, the candidate user is described with reference to user's
Similar users.
8. text based similarity according to claim 1 determines method, which is characterized in that in the random forest
Ballot function be:
Wherein, H (x) is ballot function;X is the Text eigenvector of input;H is decision tree, and t is the t tree, described random gloomy
A total of T tree in woods.
9. a kind of text based similarity determining device, which is characterized in that including:
Conditional probability computing module, the web-based history for obtaining candidate user browse record, are browsed according to the web-based history
Record obtains the corresponding text collection of the candidate user;It obtains each text in the text collection precalculated and falls into reference
The conditional probability of the corresponding text collection of user;
First eigenvector acquisition module, for the item according to the corresponding text collection of the candidate user and wherein each text
Part probability obtains corresponding first Text eigenvector of the candidate user;
Similarity value determining module, it is random gloomy for training the first Text eigenvector input of the candidate user in advance
Woods model obtains the similarity value of the candidate user and reference user according to the output of the Random Forest model.
10. a kind of computer readable storage medium, is stored thereon with computer program, which is characterized in that the program is by processor
It is realized when execution as claim 1 to 8 any one of them text based similarity determines method.
11. a kind of computer equipment, including memory, processor and storage are on a memory and the meter that can run on a processor
Calculation machine program, which is characterized in that the processor realizes such as claim 1 to 8 any one of them base when executing described program
Method is determined in the similarity of text.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810015523.1A CN108304490B (en) | 2018-01-08 | 2018-01-08 | Text-based similarity determination method and device and computer equipment |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810015523.1A CN108304490B (en) | 2018-01-08 | 2018-01-08 | Text-based similarity determination method and device and computer equipment |
Publications (2)
Publication Number | Publication Date |
---|---|
CN108304490A true CN108304490A (en) | 2018-07-20 |
CN108304490B CN108304490B (en) | 2020-12-15 |
Family
ID=62868406
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810015523.1A Active CN108304490B (en) | 2018-01-08 | 2018-01-08 | Text-based similarity determination method and device and computer equipment |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108304490B (en) |
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109492687A (en) * | 2018-10-31 | 2019-03-19 | 北京字节跳动网络技术有限公司 | Method and apparatus for handling information |
CN110988317A (en) * | 2019-11-27 | 2020-04-10 | 兰州大学第一医院 | Detection method and system for biological samples in refrigerating chamber |
CN111027994A (en) * | 2018-10-09 | 2020-04-17 | 百度在线网络技术(北京)有限公司 | Similar object determination method, device, equipment and medium |
CN111310840A (en) * | 2020-02-24 | 2020-06-19 | 北京百度网讯科技有限公司 | Data fusion processing method, device, equipment and storage medium |
CN111753763A (en) * | 2020-06-28 | 2020-10-09 | 广联达科技股份有限公司 | Method and device for identifying table content of construction drawing and computer equipment |
CN111753079A (en) * | 2019-03-11 | 2020-10-09 | 阿里巴巴集团控股有限公司 | Text classification method and device, electronic equipment and computer-readable storage medium |
CN112287236A (en) * | 2020-11-19 | 2021-01-29 | 每日互动股份有限公司 | Text message pushing method and device, computer equipment and storage medium |
CN112651439A (en) * | 2020-12-25 | 2021-04-13 | 平安科技(深圳)有限公司 | Material classification method and device, computer equipment and storage medium |
CN112987940A (en) * | 2021-04-27 | 2021-06-18 | 广州智品网络科技有限公司 | Input method and device based on sample probability quantization and electronic equipment |
CN113139034A (en) * | 2020-01-17 | 2021-07-20 | 深圳市优必选科技股份有限公司 | Statement matching method, statement matching device and intelligent equipment |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101122909A (en) * | 2006-08-10 | 2008-02-13 | 株式会社日立制作所 | Text message indexing unit and text message indexing method |
JP2010021761A (en) * | 2008-07-10 | 2010-01-28 | Nippon Hoso Kyokai <Nhk> | Video image automatic recording control device |
CN103116637A (en) * | 2013-02-08 | 2013-05-22 | 无锡南理工科技发展有限公司 | Text sentiment classification method facing Chinese Web comments |
CN104142998A (en) * | 2014-08-01 | 2014-11-12 | 中国传媒大学 | Text classification method |
US20170104752A1 (en) * | 2015-10-13 | 2017-04-13 | Fujitsu Limited | Method of processing a ciphertext, apparatus, and storage medium |
CN107341233A (en) * | 2017-07-03 | 2017-11-10 | 北京拉勾科技有限公司 | A kind of position recommends method and computing device |
-
2018
- 2018-01-08 CN CN201810015523.1A patent/CN108304490B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101122909A (en) * | 2006-08-10 | 2008-02-13 | 株式会社日立制作所 | Text message indexing unit and text message indexing method |
JP2010021761A (en) * | 2008-07-10 | 2010-01-28 | Nippon Hoso Kyokai <Nhk> | Video image automatic recording control device |
CN103116637A (en) * | 2013-02-08 | 2013-05-22 | 无锡南理工科技发展有限公司 | Text sentiment classification method facing Chinese Web comments |
CN104142998A (en) * | 2014-08-01 | 2014-11-12 | 中国传媒大学 | Text classification method |
US20170104752A1 (en) * | 2015-10-13 | 2017-04-13 | Fujitsu Limited | Method of processing a ciphertext, apparatus, and storage medium |
CN107341233A (en) * | 2017-07-03 | 2017-11-10 | 北京拉勾科技有限公司 | A kind of position recommends method and computing device |
Cited By (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111027994A (en) * | 2018-10-09 | 2020-04-17 | 百度在线网络技术(北京)有限公司 | Similar object determination method, device, equipment and medium |
CN109492687A (en) * | 2018-10-31 | 2019-03-19 | 北京字节跳动网络技术有限公司 | Method and apparatus for handling information |
CN111753079A (en) * | 2019-03-11 | 2020-10-09 | 阿里巴巴集团控股有限公司 | Text classification method and device, electronic equipment and computer-readable storage medium |
CN110988317A (en) * | 2019-11-27 | 2020-04-10 | 兰州大学第一医院 | Detection method and system for biological samples in refrigerating chamber |
CN110988317B (en) * | 2019-11-27 | 2021-04-20 | 兰州大学第一医院 | Detection method and system for biological samples in refrigerating chamber |
CN113139034A (en) * | 2020-01-17 | 2021-07-20 | 深圳市优必选科技股份有限公司 | Statement matching method, statement matching device and intelligent equipment |
CN111310840A (en) * | 2020-02-24 | 2020-06-19 | 北京百度网讯科技有限公司 | Data fusion processing method, device, equipment and storage medium |
CN111310840B (en) * | 2020-02-24 | 2023-10-17 | 北京百度网讯科技有限公司 | Data fusion processing method, device, equipment and storage medium |
CN111753763A (en) * | 2020-06-28 | 2020-10-09 | 广联达科技股份有限公司 | Method and device for identifying table content of construction drawing and computer equipment |
CN112287236A (en) * | 2020-11-19 | 2021-01-29 | 每日互动股份有限公司 | Text message pushing method and device, computer equipment and storage medium |
CN112651439A (en) * | 2020-12-25 | 2021-04-13 | 平安科技(深圳)有限公司 | Material classification method and device, computer equipment and storage medium |
CN112651439B (en) * | 2020-12-25 | 2023-12-22 | 平安科技(深圳)有限公司 | Material classification method, device, computer equipment and storage medium |
CN112987940A (en) * | 2021-04-27 | 2021-06-18 | 广州智品网络科技有限公司 | Input method and device based on sample probability quantization and electronic equipment |
Also Published As
Publication number | Publication date |
---|---|
CN108304490B (en) | 2020-12-15 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108304490A (en) | Text based similarity determines method, apparatus and computer equipment | |
CN112581191A (en) | Training method and device of behavior prediction model | |
Bruni et al. | Distributional semantics from text and images | |
CN104809243B (en) | It is a kind of that method is recommended based on the mixing excavated to user behavior composite factor | |
Zhou et al. | A classification-based approach to question routing in community question answering | |
DeCoste | Collaborative prediction using ensembles of maximum margin matrix factorizations | |
CN103914468B (en) | A kind of method and apparatus of impression information search | |
CN104111933B (en) | Obtain business object label, set up the method and device of training pattern | |
US10348550B2 (en) | Method and system for processing network media information | |
Chang et al. | Searching persuasively: Joint event detection and evidence recounting with limited supervision | |
CN112434151A (en) | Patent recommendation method and device, computer equipment and storage medium | |
CN106372249A (en) | Click rate estimating method and device and electronic equipment | |
CN103116588A (en) | Method and system for personalized recommendation | |
CN105095187A (en) | Search intention identification method and device | |
CN106339383A (en) | Method and system for sorting search | |
CN106105096A (en) | System and method for continuous social communication | |
CN108427708A (en) | Data processing method, device, storage medium and electronic device | |
CN107451148A (en) | Video classification method and device and electronic equipment | |
CN102053971A (en) | Recommending method and equipment for sequencing-oriented collaborative filtering | |
CN105022754A (en) | Social network based object classification method and apparatus | |
CN112380453B (en) | Article recommendation method and device, storage medium and equipment | |
CN111973996A (en) | Game resource putting method and device | |
CN111612519B (en) | Method, device and storage medium for identifying potential customers of financial products | |
CN102428467A (en) | Similarity-Based Feature Set Supplementation For Classification | |
CN106919588A (en) | A kind of application program search system and method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |