CN108304490A - Text based similarity determines method, apparatus and computer equipment - Google Patents

Text based similarity determines method, apparatus and computer equipment Download PDF

Info

Publication number
CN108304490A
CN108304490A CN201810015523.1A CN201810015523A CN108304490A CN 108304490 A CN108304490 A CN 108304490A CN 201810015523 A CN201810015523 A CN 201810015523A CN 108304490 A CN108304490 A CN 108304490A
Authority
CN
China
Prior art keywords
text
user
collection
users
candidate user
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201810015523.1A
Other languages
Chinese (zh)
Other versions
CN108304490B (en
Inventor
周涛
李百川
李展铿
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Umi-Tech Co Ltd
Original Assignee
Umi-Tech Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Umi-Tech Co Ltd filed Critical Umi-Tech Co Ltd
Priority to CN201810015523.1A priority Critical patent/CN108304490B/en
Publication of CN108304490A publication Critical patent/CN108304490A/en
Application granted granted Critical
Publication of CN108304490B publication Critical patent/CN108304490B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3346Query execution using probabilistic model
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/335Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/24323Tree-organised classifiers

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Computational Linguistics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Probability & Statistics with Applications (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention relates to text based similarities to determine method and device, belongs to Internet technical field.The method includes:The web-based history browsing record for obtaining candidate user obtains the corresponding text collection of the candidate user according to web-based history browsing record;It obtains each text in the text collection precalculated and falls into the conditional probability with reference to the corresponding text collection of user;According to the conditional probability of the corresponding text collection of the candidate user and wherein each text, corresponding first Text eigenvector of the candidate user is obtained;By the first Text eigenvector input of candidate user Random Forest model trained in advance, the similarity value of the candidate user and reference user is obtained according to the output of the Random Forest model.Above-mentioned technical proposal solves the problems, such as accurately calculate similarity between user, candidate user can be accurately calculated by the relevant information of text and refers to the similarity of user, and then can find out the similar users with reference to user.

Description

Text based similarity determines method, apparatus and computer equipment
Technical field
The present invention relates to Internet technical fields, determine method, apparatus more particularly to text based similarity, calculate Machine readable storage medium storing program for executing and computer equipment.
Background technology
Currently, having become by searching for similar users and to similar users PUSH message or transmission advertisement etc. a kind of effective Marketing mode.The premise of this marketing mode is accurately to determine similarity between calculating user.Tradition is true based on text The method for determining similarity has k-means clusters etc..In realizing process of the present invention, inventor has found at least exist in the prior art Following problem:Tradition determines the method for similarity or is not suitable for determining similarity based on word;Or result has very Big randomness causes to carry out the result difference that cluster obtains to same a collection of user every time.Therefore, it is necessary to which it is logical to find a kind of energy The method for crossing similarity between the associated information calculation user of text.
Invention content
Based on this, the present invention provides text based similarities to determine method and device, can text based correlation letter Breath accurately calculates the similarity between user, may thereby determine that the similar users with reference to user.
The content of the embodiment of the present invention is as follows:
A kind of text based similarity determines method, including:The web-based history browsing record for obtaining candidate user, according to The web-based history browsing record obtains the corresponding text collection of the candidate user;Obtain the text collection precalculated In each text fall into the conditional probability with reference to the corresponding text collection of user;According to the corresponding text collection of the candidate user with And the conditional probability of wherein each text, obtain corresponding first Text eigenvector of the candidate user;By the candidate user The trained in advance Random Forest model of the first Text eigenvector input, institute is obtained according to the output of the Random Forest model State the similarity value of candidate user and reference user.
First Text eigenvector by the candidate user inputs training in advance in one of the embodiments, Before the step of Random Forest model, further include:Build sample of users collection, the sample of users concentrate include with reference to user and Non-reference user;Obtaining sample of users concentrates the web-based history of each sample of users to browse record, and it is corresponding to obtain each sample of users Text collection;Calculate the conditional probability of each text in the text collection of each sample of users;It is corresponding according to each sample of users The conditional probability of text collection and wherein each text obtains sample of users and concentrates corresponding second text feature of each sample of users Vector;The training text that multiple Text eigenvectors are chosen from second Text eigenvector as corresponding sample of users is special Sign vector;Random Forest model is trained according to the training text feature vector.
It is described in one of the embodiments, that multiple Text eigenvectors works are chosen from second Text eigenvector For corresponding sample of users training text feature vector the step of, including:It chooses in second Text eigenvector and presses respectively The preceding multiple Text eigenvectors of conditional probability value size sequence and the posterior multiple Text eigenvectors that sort, as correspondence The training text feature vector of sample of users.
It is described in one of the embodiments, that according to web-based history browsing record, to obtain the candidate user corresponding The step of text collection, including:The corresponding word of the candidate user is obtained according to web-based history browsing record, removes institute Stop words in predicate language obtains the corresponding text collection of the candidate user.
Each text is fallen into reference to user in the text collection that the acquisition precalculates in one of the embodiments, Before the step of conditional probability of corresponding text collection, further include:The word frequency for obtaining each word in the text collection is special Sign, calculates separately each word according to the words-frequency feature and falls into the conditional probability with reference to the corresponding text collection of user.
Each word is calculated by following formula in one of the embodiments, to fall into reference to the corresponding text collection of user Conditional probability:
Wherein, y is text collection label, and 0 indicates the corresponding text collection of candidate user, and 1 indicates corresponding with reference to user Text collection;I is the mark of word, indicates i-th of word, word a total of n;θyiIt is i-th of word in text collection y The frequency of appearance;NyiIndicate the number that i-th of word occurs in text collection y, NyIt is all words in text collection y The number of appearance;α is preset smoothing factor;λiIt is to fall into the probability with reference to the corresponding text collection of user in i-th of word.
First Text eigenvector by the candidate user inputs training in advance in one of the embodiments, Random Forest model obtains the candidate user and the similarity value with reference to user according to the output of the Random Forest model After step, further include:If the corresponding similarity value of the candidate user is higher than preset threshold value, the candidate user is institute State the similar users with reference to user.
The ballot function in the random forest is in one of the embodiments,:
Wherein, H (x) is ballot function;X is the Text eigenvector of input;H is decision tree, and t is the t tree, it is described with A total of T tree in machine forest.
Correspondingly, the embodiment of the present invention provides a kind of text based similarity determining device, including:Conditional probability calculates Module, the web-based history for obtaining candidate user browse record, and the candidate is obtained according to web-based history browsing record The corresponding text collection of user;Each text in the text collection precalculated is obtained to fall into reference to the corresponding text set of user The conditional probability of conjunction;First eigenvector acquisition module, for according to the corresponding text collection of the candidate user and wherein The conditional probability of each text obtains corresponding first Text eigenvector of the candidate user;Similarity value determining module, is used for By the first Text eigenvector input of candidate user Random Forest model trained in advance, according to the random forest mould The output of type obtains the similarity value of the candidate user and reference user.
A kind of computer readable storage medium, is stored thereon with computer program, which realizes when being executed by processor The step of method described above, passes through the computer program of its storage.
A kind of computer equipment, including memory, processor and storage can be run on a memory and on a processor The step of computer program, the processor realizes method described above when executing described program.
Above-mentioned technical proposal browses according to the web-based history of candidate user and records, obtains the corresponding text of the candidate user Set;It obtains each text in the text collection precalculated and falls into the conditional probability with reference to the corresponding text collection of user; Determine corresponding first Text eigenvector of the candidate user, and by first Text eigenvector input training in advance with Machine forest model obtains the similarity value of the candidate user and reference user.Candidate can be accurately calculated in this way User determines whether the candidate user is the similar users for referring to user, and then targetedly with reference to the similarity of user Ground operates similar users accordingly, prevents from all carrying out the operation to all users, can effectively solve operating cost.
Description of the drawings
Fig. 1 is the schematic flow chart that text based similarity determines method in one embodiment;
Fig. 2 is the application example that text based similarity determines method in an embodiment;
Fig. 3 is the structural schematic diagram of text based similarity determining device in an embodiment.
Specific implementation mode
In order to make the purpose , technical scheme and advantage of the present invention be clearer, with reference to the accompanying drawings and embodiments, right The present invention is further elaborated.It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, and It is not used in the restriction present invention.
The embodiment of the present invention is described by taking network marketing as an example, but the text based similarity of the embodiment of the present invention It determines that method, apparatus, computer equipment and storage medium are not limited to solve the problems in network marketing, can be also used for solving The problems in the application scenarios that other similarities determine.
When carrying out a certain network marketing, the user for launching advertisement is determined often through the historical data of the network user Group.In these network users, some belongs to the user for often carrying out particular network operation, such as:It is wide for launching TV The network marketing of announcement, certain user often watch the Internet video of a certain product corresponding with the advertisement, then these users belong to The seed user (referring to user) of this product.For certain users similar with the network operation of above-mentioned reference user, Belong to the user for comparing and having intention, launching corresponding advertisement to these users has stronger specific aim, return rate higher.Example Such as:Have a collection of a small amount of active game seed user, the cost that game advertisement consumption is launched to this batch of seed user is relatively low, pair with The similar user of seed user orients dispensing in large quantities, can obtain higher income.Therefore, it is necessary between calculating user Similarity, and then the similar users with reference to user are found out, the embodiment of the present invention intends to solve that the text message based on user calculates and uses Between family the problem of similarity.
The method that tradition determines the similar users with reference to user has:
Method one:Step 1, installation the package list of user is expressed as 1/0 spy by bag-of-words (bow) method Sign trains Logic Regression Models by the list;Step 2, by the output of Logic Regression Models plus other three kinds of features (installations Basic application percentage, payment applications number and average cost paid), train GBDT (Gradient Boosting as input Decision Tree are a kind of decision Tree algorithms of iteration) disaggregated model, it is classified as 1 as similar users.Wherein bow is special Sign is unsatisfactory in the application effect of text class, uses the effect of conditional probability aspect ratio bow features to have significantly in an experiment It is promoted;In addition, the bilayer model used in the above method is based on additional payment information feature, it is not suitable for appointing for text message Business.
Method two:Step 1, the video tab of video media is mapped as x dimension labels vector, later by by the institute of video There is label vector to add up and average, obtains the x dimension video vectors of each video;Step 2, video is clustered, is obtained Similar video cluster result;Step 3, similar video cluster result is converted into similar users cluster result;Step 4, from seed Cluster result is extracted in user, sequencing of similarity is carried out, so that it is determined that the sequence of user.This method is then right first to Video clustering User clustering, since k-means clusters are to the sensibility of initial value, the selection of initial value has prodigious randomness, clusters every time The result is that different, cause the result clustered out every time with a collection of seed user different.
A kind of text based similarity of offer of the embodiment of the present invention determines that method and corresponding text based are similar Spend determining device.It is described in detail separately below.
As shown in Figure 1, determining the schematic flow chart of method for the text based similarity of an embodiment.The implementation The text based similarity that example provides determines method mainly including step S110 to step S130, and detailed description are as follows:
S110, the web-based history browsing record for obtaining candidate user obtain described according to web-based history browsing record The corresponding text collection of candidate user;Each text in the text collection precalculated is obtained to fall into reference to the corresponding text of user The conditional probability of this set.
Optionally, sample of users is all users for carrying out corresponding network browse operation, these users include with reference to use Family and non-reference user.Wherein, it refers to seed user with reference to user, in addition to seed user is non-seed in sample of users User;Non-reference user can refer to non-seed user, can also refer to a part of user chosen from non-seed user.Candidate uses Family can be the certain customers in non-reference user, or all users in non-reference user can also be from whole The user extracted in a sample of users.If the candidate user refers to that all users in non-reference user, the present invention implement Example can determine with reference to user other than all users with reference to user similarity, and then from these users determine with reference to use The similar users at family.It can determine whether the candidate user is the reference by calculating candidate user and the similarity with reference to user The similar users of user.
Web-based history browsing is recorded as user and carries out the record generated after network operation.The network operation can be to watch certain A Internet video, or search for some webpage, can also be to play the network operations such as online game.
Optionally, text collection is to be browsed from the web-based history of sample of users used in the sample of users extracted in record Text, such as:Some operation that user watches search term used in some video, plays performed by some online game is corresponding Text message etc..
Optionally, further include obtaining each text in the text collection precalculated to fall into the corresponding text of non-reference user The conditional probability of this set.
It is the probability with reference to the corresponding text of user that this step, which calculates each text, and it is to wait that this step, which can also calculate each text, Select the probability of the corresponding text in family.
S120, according to the conditional probability of the corresponding text collection of the candidate user and wherein each text, obtain described Corresponding first Text eigenvector of candidate user.
Wherein, the first Text eigenvector refers to the list of characterization user's characteristic information, the list by text collection and The corresponding conditional probability of each text is constituted.Can also include other parameters in first Text eigenvector, such as the transaction of user The information such as number.
S130, the Random Forest model for training the first Text eigenvector input of the candidate user in advance, according to The output of the Random Forest model obtains the similarity value of the candidate user and reference user.
Wherein, Random Forest model (Random forest) refers to that sample is trained and is predicted using more trees A kind of grader.The Random Forest model of the embodiment of the present invention use bagging method, by stochastical sampling sample and with Machine samples the mode of feature, while more random decision trees of training, decides by vote input jointly by these random decision trees The first Text eigenvector whether belong to reference to the corresponding text collection of user, and further obtain candidate user with reference to use The similarity at family can determine whether the candidate user is the similar users for referring to user according to the similarity.
The present embodiment calculates the conditional probability of the corresponding each text of candidate user, by each text and corresponding conditional probability It is input in trained Random Forest model as characteristic information, finally obtains candidate user relative to the phase with reference to user Like degree.The similarity of candidate user and reference user can be accurately calculated according to the feature of user, and then can determine the candidate Whether user is the similar users for referring to user.
In one embodiment, the first Text eigenvector input training in advance by the candidate user is random gloomy Before the step of woods model, further include:Sample of users collection is built, it includes referring to user and non-reference that the sample of users, which is concentrated, User;Obtaining sample of users concentrates the web-based history of each sample of users to browse record, obtains the corresponding text set of each sample of users It closes;Calculate the conditional probability of each text in the text collection of each sample of users;According to the corresponding text set of each sample of users The conditional probability of conjunction and wherein each text obtains sample of users and concentrates corresponding second Text eigenvector of each sample of users; Chosen from second Text eigenvector multiple Text eigenvectors as corresponding sample of users training text feature to Amount;Random Forest model is trained according to the training text feature vector.
Wherein, sample of users can refer to all-network user;The user for meeting a certain condition can also be referred to, such as:Need from The similar users of the reference user of the online game are determined in the user of a certain online game, then the sample of users can be played The user of the online game.
The present embodiment calculates the corresponding conditional probability of each text according to the text message of sample of users collection, obtains second Literary feature vector finds out training text feature vector representative in each sample of users from second this paper feature vectors, Random Forest model is trained by the training text feature vector.The random forest module can fully integrate all training texts Characteristic information in feature vector simultaneously obtains the model that can be reasonably judged candidate user.
In one embodiment, described that multiple Text eigenvectors are chosen from second Text eigenvector as correspondence The step of training text feature vector of sample of users, including:It chooses respectively according to condition general in second Text eigenvector The preceding multiple Text eigenvectors of rate value size sequence and the posterior multiple Text eigenvectors that sort, are used as corresponding sample The training text feature vector at family.
Optionally, the present embodiment determines that the step of training text feature vector includes determining the corresponding instruction of each sample of users Practice Text eigenvector.
Optionally, the present embodiment chosen from the second Text eigenvector of a certain sample of users training text feature to Amount, is ranked up the conditional probability in the second Text eigenvector by size, determines that the preceding k1 condition that wherein sort is general The rate λ 1 and posterior k2 conditional probability λ 2 of sorting, by the corresponding item of the conditional probability λ 1 and conditional probability λ 2 conducts candidate user Part probability characteristics, the training text that Text eigenvector corresponding with these conditional probability features is determined as to the sample of users are special Sign vector.Wherein, k1 and k2 can be any integer more than 0.K1 and k2 may be the same or different.Specifically, k1 and K2 is 20.Optionally, the number of the training text feature vector of each sample of users can be different.
The present embodiment determines the training text feature vector of each sample of users according to the case where conditional probability, these instructions Practice Text eigenvector and represents the characteristic information of sample of users (such as:The corresponding text of the higher expression of conditional probability value more has can Can be the text used with reference to user), by these training text feature vectors come train Random Forest model can obtain through Trained Random Forest model is crossed, which can accurately determine the similarity value between user.
In one embodiment, described that the corresponding text set of the candidate user is obtained according to web-based history browsing record The step of conjunction, including:The corresponding word of the candidate user is obtained according to web-based history browsing record, removes the word In stop words, obtain the corresponding text collection of the candidate user.
This step records corresponding text message to web-based history browsing and segments, and obtains the candidate user pair The word answered, by wherein not include user information stop words remove, it is corresponding that the candidate user is obtained after being integrated Text collection.Text collection after removal stop words can more accurately represent the feature of user, meanwhile, it can effectively save and deposit Simultaneously improve treatment effeciency in storage space.
In one embodiment, each text is fallen into corresponding with reference to user in the text collection that the acquisition precalculates Before the step of conditional probability of text collection, further include:The words-frequency feature for obtaining each word in the text collection, according to The words-frequency feature calculates separately each word and falls into the conditional probability with reference to the corresponding text collection of user.
Optionally, the words-frequency feature of each word includes:Each word goes out in the corresponding text collection of sample of users collection Existing number, the number that each word occurs in reference to the corresponding text collection of user further includes each word with reference to use The frequency occurred in the corresponding text collection in family is (such as:Time that some word occurs in reference to the corresponding text collection of user Number accounts for the ratio of all words occurrence number in reference to the corresponding text collection of user).
Optionally, each text is fallen into reference to the corresponding text set of user in the text collection that the acquisition precalculates Before the step of conditional probability of conjunction, further include:The words-frequency feature for obtaining each word in the text collection, according to institute's predicate Frequency feature calculates separately the conditional probability that each word falls into the corresponding text collection of non-reference user.
Optionally, the step of process of design conditions probability belongs to the training stage.Wherein, the training stage refer to training with The process of machine forest model.
The present embodiment calculates the conditional probability of each word according to the words-frequency feature of each word, by simply calculating energy It obtains the conditional probability of each word, and then can obtain the characteristic information of each user.It is general that the present embodiment precalculates condition Rate.Therefore, when the similarity between user calculates, the conditional probability calculated, Neng Gouyou can directly be inquired Effect improves the efficiency of similarity calculation process.
In one embodiment, each word is calculated by following formula and falls into the condition with reference to the corresponding text collection of user Probability:
Wherein, y is text collection label, and 0 indicates the corresponding text collection of candidate user, and 1 indicates corresponding with reference to user Text collection;I is the mark of word, indicates i-th of word, word a total of n;θyiIt is i-th of word in text collection y The frequency of appearance;NyiIndicate the number that i-th of word occurs in text collection y, NyIt is all words in text collection y The number of appearance;α is preset smoothing factor;λiIt is to fall into the probability with reference to the corresponding text collection of user in i-th of word.
Wherein, it is common laplace smooth (Laplce is smooth) when smoothing factor α takes 1.Certain α can also take it He is any be more than 0 value.
The present embodiment calculates each word by formula and falls into the conditional probability with reference to the corresponding text collection of user, in turn The fisrt feature text vector that can determine candidate user can also be determined for training each sample of Random Forest model to use The training text feature vector at family.
In one embodiment, the first Text eigenvector input training in advance by the candidate user is random gloomy Woods model, according to the output of the Random Forest model obtain the candidate user with reference to user similarity value the step of it Afterwards, further include:If the corresponding similarity value of the candidate user is higher than preset threshold value, the candidate user is the reference The similar users of user.
Wherein, the preset threshold value is generally 0.5-1.0, and certainly, threshold value can also take the other values except the range. When the threshold value takes 1, which is closely similar user, it is also possible to be exactly to refer to user.
It optionally, can be by adjusting the size of the preset threshold value 0.5, come the similar users quantity correspondingly adjusted.
This implementation determines similar users by the result for being compared similarity with preset threshold value, determines similar use Targetedly similar users can be operated accordingly behind family, reduce cost caused by executing operation to non-similar users Waste.
In one embodiment, the ballot function in the random forest is:
Wherein, H (x) is ballot function, that is, input after Text eigenvector x obtain to text feature vector x whether Belong to the ballot with reference to user;X is the Text eigenvector of input;H is decision tree, and t is the t and sets, in the random forest A total of T tree.
After indicating input Text eigenvector x, the t decision tree votes simultaneously to text feature vector x Finally obtain the voting results of text collection.
Random forest is voted by the Text eigenvector for function pair input of voting in the present embodiment, obtains the text Whether feature vector belongs to the voting results with reference to the corresponding text collection of user, and random forest can be with according to these voting results The similarity of corresponding candidate user and reference user is obtained, and then can determine whether the candidate user is that this refers to user Similar users.
The above method in order to better understand, one detailed below the present invention is based on the similarities of text to determine method Application example.As shown in Fig. 2, Fig. 2 is the application example that text based similarity determines method.
DSP (Demand-Side Platform) is the party in request's platform serviced for advertiser, and the target of DSP is by the greatest extent Cost that may be less brings conversion user as much as possible.It is connection advertiser and flow side that DSP, which can be simply interpreted as, Platform can be that advertiser launches advertisement on various flows (such as iqiyi.com, today's tops).
There is a small amount of active game seed user, the cost that game advertisement consumption is launched to this batch of seed user is relatively low, DSP Wish to orient dispensing in large quantities to this kind of user, therefore similar users are selected from non-seed user, determines the mistake of similar users Journey is as follows:
1) header list for the video that acquisition seed user and non-seed user have seen, (word is segmented to the video title Language), the corresponding word of all users is obtained, the stop words in these words is removed;
2) words-frequency feature (word frequency) of the corresponding each word of each user is calculated;To being obtained after each seed user word frequency To with reference to the corresponding text collection of user, collect corresponding text set to obtaining non-reference user after each non-seed user's word frequency It closes, the corresponding text collection of non-reference user can be obtained by collecting sampling in corresponding text collection from non-reference user.
3) to a certain user, the conditional probability feature of each word is calculated, it is general that all words are mapped to corresponding condition Rate is ranked up these conditional probabilities, therefrom chooses conditional probability sequence preceding 20 and posterior 20 conditions that sort Probability, using this 40 conditional probabilities and corresponding word as the feature vector of the user;The really corresponding feature of all users Vector;
4) it using the feature vector of these users as the training set of training random forest (random forest), is instructed The Random Forest model perfected;
5) classified non-seed user with trained model, obtains the similarity of the non-seed user and seed user, Non-seed user by similarity value more than 0.5 is classified as similar users.
After the game user active to a batch carries out true extension, the practical game launched of a few money DSP is surveyed offline Examination.Control group is the practical flow for launching covering, and test group 1 is the flow of the similar users covering of extension, and test group 2 is seed The flow of user's covering.The cpa of the cpa (the consumed cost of average each conversion) of test group 1 compared to the control group are averagely reduced About 40%, the cpa of test group 2 averagely reduces about 70% compared to the control group;The conversion number of users of test group 1 is test group 2 5-10 times.According to the result of the test it is found that the text based similarity of the embodiment of the present invention determines that method can just known that User watches under video title, accurately calculates the similarity of seed user and non-seed user, accurately finds out similar users.
It should be noted that for each method embodiment above-mentioned, describes, be all expressed as a series of for simplicity Combination of actions, but those skilled in the art should understand that, the present invention is not limited by the described action sequence, because according to According to the present invention, certain steps may be used other sequences or be carried out at the same time.
The identical thought of method is determined based on the text based similarity in above-described embodiment, and the present invention also provides bases In the similarity determining device of text, which can be used for executing above-mentioned text based similarity and determines method.For the ease of Illustrate, in the structural schematic diagram of text based similarity determining device embodiment, illustrate only and phase of the embodiment of the present invention The part of pass, it will be understood by those skilled in the art that the restriction of schematic structure not structure twin installation, may include than illustrating more More or less component either combines certain components or different components arrangement.
As described in Figure 3, text based similarity determining device include conditional probability computing module 310, fisrt feature to Acquisition module 320 and similarity value determining module 330 are measured, detailed description are as follows:
Conditional probability computing module 310, the web-based history for obtaining candidate user browses record, according to the history net Network browsing record obtains the corresponding text collection of the candidate user;Each text in the text collection precalculated is obtained to fall Enter to refer to the conditional probability of the corresponding text collection of user.
First eigenvector acquisition module 320, for according to the corresponding text collection of the candidate user and wherein each The conditional probability of text obtains corresponding first Text eigenvector of the candidate user.
And similarity value determining module 330, it is advance for inputting the first Text eigenvector of the candidate user It is similar to reference user's to obtain the candidate user according to the output of the Random Forest model for trained Random Forest model Angle value.
In one embodiment, the text based similarity determining device further includes:Sample of users collection builds module, uses In structure sample of users collection, it includes referring to user and non-reference user that the sample of users, which is concentrated,;Second feature vector obtains Module concentrates the web-based history of each sample of users to browse record, obtains the corresponding text of each sample of users for obtaining sample of users This set;Calculate the conditional probability of each text in the text collection of each sample of users;According to the corresponding text of each sample of users The conditional probability of this set and wherein each text, obtain sample of users concentrate corresponding second text feature of each sample of users to Amount;Random forest training module exists for choosing according to condition probability value size sequence in second Text eigenvector respectively Preceding multiple Text eigenvectors and the posterior multiple Text eigenvectors that sort, the training text as corresponding sample of users are special Sign vector.
In one embodiment, the random forest training module is additionally operable to second Text eigenvector according to condition Probability value sorts from big to small and from small to large, determines the preceding multiple Text eigenvectors that sort respectively, as corresponding sample The training text feature vector of user.
In one embodiment, the conditional probability computing module 310 is additionally operable to be recorded according to web-based history browsing To the corresponding word of the candidate user, the stop words in the word is removed, obtains the corresponding text set of the candidate user It closes.
In one embodiment, the conditional probability computing module 310 is additionally operable to obtain each word in the text collection Words-frequency feature, it is general that the condition that each word is fallen into reference to the corresponding text collection of user is calculated separately according to the words-frequency feature Rate.
In one embodiment, the conditional probability computing module 310 is additionally operable to fall by each word of following formula calculating Enter to refer to the conditional probability of the corresponding text collection of user:
Wherein, y is text collection label, and 0 indicates the corresponding text collection of candidate user, and 1 indicates corresponding with reference to user Text collection;I is the mark of word, indicates i-th of word, word a total of n;θyiIt is i-th of word in text collection y The frequency of appearance;N is number, NyiIndicate the number that i-th of word occurs in text collection y, NyIt is all words in text The number occurred in this set y;α is preset smoothing factor;λiIt is in the case where i-th of word occurs, which falls into With reference to the probability of the corresponding text collection of user.
In one embodiment, the text based similarity determining device further includes similar users determining module, is used for If the corresponding similarity value of the candidate user is higher than preset threshold value, the candidate user is described with reference to the similar of user User.
In one embodiment, the ballot function in the random forest is:
Wherein, H (x) is ballot function;X is the Text eigenvector of input;H is decision tree, and t is the t tree, it is described with A total of T tree in machine forest.
It should be noted that the text based similarity determining device of the present invention is similar to the text based of the present invention Degree determines that method corresponds, and the technical characteristic and its have that the embodiment of method illustrates are determined in above-mentioned text based similarity For beneficial effect suitable for the embodiment of text based similarity determining device, particular content can be found in the method for the present invention implementation Narration in example, details are not described herein again, hereby give notice that.
In addition, in the embodiment of the text based similarity determining device of above-mentioned example, the logic of each program module Division is merely illustrative of, can be as needed in practical application, such as the configuration requirement or software of corresponding hardware The convenient of realization considers, above-mentioned function distribution is completed by different program modules, i.e., the text based similarity is true Determine the internal junction of device
It will appreciated by the skilled person that realizing all or part of flow in above-described embodiment method, being can It is completed with instructing relevant hardware by computer program, the program can be stored in a computer-readable storage and be situated between In matter, sells or use as independent product.The more specific example (non-exhaustive list) of computer-readable medium includes Below:Electrical connection section (electronic device) with one or more wiring, portable computer diskette box (magnetic device), arbitrary access Memory (RAM), read-only memory (ROM), erasable edit read-only storage (EPROM or flash memory), optical fiber dress It sets and portable optic disk read-only storage (CDROM).It can be printed on it in addition, computer-readable medium can even is that The paper of described program or other suitable media, because can be for example by carrying out optical scanner to paper or other media, then It is handled electronically to obtain described program, then by it into edlin, interpretation or when necessary with other suitable methods Storage is in computer storage.
It should be appreciated that each section of the present invention can be realized with hardware, software, firmware or combination thereof.Above-mentioned In embodiment, software that multiple steps or method can in memory and by suitable instruction execution system be executed with storage Or firmware is realized.It, and in another embodiment, can be under well known in the art for example, if realized with hardware Any one of row technology or their combination are realized:With the logic gates for realizing logic function to data-signal Discrete logic, with suitable combinational logic gate circuit application-specific integrated circuit, programmable gate array (PGA), scene Programmable gate array (FPGA) etc..
It should be noted that the term " first second third " involved by the embodiment of the present invention is only to be that difference is similar Object, do not represent the particular sorted for object, it is possible to understand that ground, " Yi Er thirds " can be in the case of permission Exchange specific sequence or precedence.It should be appreciated that the object that " first second third " is distinguished in the appropriate case can be mutual It changes, so that the embodiment of the present invention described herein can be real with the sequence other than those of illustrating or describing herein It applies.
The term " comprising " and " having " of the embodiment of the present invention and their any deformations, it is intended that cover non-exclusive Including.Such as contain series of steps or the process, method, system, product or equipment of (module) unit are not limited to The step of listing or unit, but further include the steps that optionally not listing or unit, or further include optionally for these The intrinsic other steps of process, method, product or equipment or unit.
Each technical characteristic of embodiment described above can be combined arbitrarily, to keep description succinct, not to above-mentioned reality It applies all possible combination of each technical characteristic in example to be all described, as long as however, the combination of these technical characteristics is not deposited In contradiction, it is all considered to be the range of this specification record.
Several embodiments of the invention above described embodiment only expresses, should not be understood as to the scope of the claims of the present invention Limitation.It should be pointed out that for those of ordinary skill in the art, without departing from the inventive concept of the premise, Various modifications and improvements can be made, these are all within the scope of protection of the present invention.Therefore, the protection domain of patent of the present invention It should be determined by the appended claims.

Claims (11)

1. a kind of text based similarity determines method, which is characterized in that include the following steps:
The web-based history browsing record for obtaining candidate user obtains the candidate user pair according to web-based history browsing record The text collection answered;It obtains each text in the text collection precalculated and falls into the item with reference to the corresponding text collection of user Part probability;
According to the conditional probability of the corresponding text collection of the candidate user and wherein each text, the candidate user pair is obtained The first Text eigenvector answered;
By the first Text eigenvector input of candidate user Random Forest model trained in advance, according to described random gloomy The output of woods model obtains the similarity value of the candidate user and reference user.
2. text based similarity according to claim 1 determines method, which is characterized in that described by the candidate use Before the step of Random Forest model that the first Text eigenvector input at family is trained in advance, further include:
Sample of users collection is built, it includes referring to user and non-reference user that the sample of users, which is concentrated,;
Obtaining sample of users concentrates the web-based history of each sample of users to browse record, obtains the corresponding text set of each sample of users It closes;Calculate the conditional probability of each text in the text collection of each sample of users;According to the corresponding text set of each sample of users The conditional probability of conjunction and wherein each text obtains sample of users and concentrates corresponding second Text eigenvector of each sample of users;
The training text that multiple Text eigenvectors are chosen from second Text eigenvector as corresponding sample of users is special Sign vector;Random Forest model is trained according to the training text feature vector.
3. text based similarity according to claim 2 determines method, which is characterized in that described from second text The step of training text feature vector of multiple Text eigenvectors as corresponding sample of users is chosen in eigen vector, packet It includes:
The preceding multiple Text eigenvectors of according to condition probability value size sequence in second Text eigenvector are chosen respectively With the posterior multiple Text eigenvectors that sort, the training text feature vector as corresponding sample of users.
4. text based similarity according to any one of claims 1 to 3 determines method, which is characterized in that described The step of candidate user corresponding text collection being obtained according to web-based history browsing record, including:
The corresponding word of the candidate user is obtained according to web-based history browsing record, removes deactivating in the word Word obtains the corresponding text collection of the candidate user.
5. text based similarity according to claim 4 determines method, which is characterized in that the acquisition precalculates The text collection in before each text the step of falling into the conditional probability with reference to the corresponding text collection of user, further include:
The words-frequency feature for obtaining each word in the text collection calculates separately each word according to the words-frequency feature and falls into With reference to the conditional probability of the corresponding text collection of user.
6. text based similarity according to claim 5 determines method, which is characterized in that calculated by following formula Each word falls into the conditional probability with reference to the corresponding text collection of user:
Wherein, y is text collection label, and 0 indicates the corresponding text collection of candidate user, and 1 indicates to refer to the corresponding text of user Set;I is the mark of word, indicates i-th of word, word a total of n;θyiIt is that i-th of word occurs in text collection y Frequency;NyiIndicate the number that i-th of word occurs in text collection y, NyOccur in text collection y for all words Number;α is preset smoothing factor;λiIt is to fall into the probability with reference to the corresponding text collection of user in i-th of word.
7. text based similarity according to claim 1 determines method, which is characterized in that described by the candidate use The Random Forest model that the first Text eigenvector input at family is trained in advance, obtains according to the output of the Random Forest model After the step of candidate user and the similarity value of reference user, further include:
If the corresponding similarity value of the candidate user is higher than preset threshold value, the candidate user is described with reference to user's Similar users.
8. text based similarity according to claim 1 determines method, which is characterized in that in the random forest Ballot function be:
Wherein, H (x) is ballot function;X is the Text eigenvector of input;H is decision tree, and t is the t tree, described random gloomy A total of T tree in woods.
9. a kind of text based similarity determining device, which is characterized in that including:
Conditional probability computing module, the web-based history for obtaining candidate user browse record, are browsed according to the web-based history Record obtains the corresponding text collection of the candidate user;It obtains each text in the text collection precalculated and falls into reference The conditional probability of the corresponding text collection of user;
First eigenvector acquisition module, for the item according to the corresponding text collection of the candidate user and wherein each text Part probability obtains corresponding first Text eigenvector of the candidate user;
Similarity value determining module, it is random gloomy for training the first Text eigenvector input of the candidate user in advance Woods model obtains the similarity value of the candidate user and reference user according to the output of the Random Forest model.
10. a kind of computer readable storage medium, is stored thereon with computer program, which is characterized in that the program is by processor It is realized when execution as claim 1 to 8 any one of them text based similarity determines method.
11. a kind of computer equipment, including memory, processor and storage are on a memory and the meter that can run on a processor Calculation machine program, which is characterized in that the processor realizes such as claim 1 to 8 any one of them base when executing described program Method is determined in the similarity of text.
CN201810015523.1A 2018-01-08 2018-01-08 Text-based similarity determination method and device and computer equipment Active CN108304490B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810015523.1A CN108304490B (en) 2018-01-08 2018-01-08 Text-based similarity determination method and device and computer equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810015523.1A CN108304490B (en) 2018-01-08 2018-01-08 Text-based similarity determination method and device and computer equipment

Publications (2)

Publication Number Publication Date
CN108304490A true CN108304490A (en) 2018-07-20
CN108304490B CN108304490B (en) 2020-12-15

Family

ID=62868406

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810015523.1A Active CN108304490B (en) 2018-01-08 2018-01-08 Text-based similarity determination method and device and computer equipment

Country Status (1)

Country Link
CN (1) CN108304490B (en)

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109492687A (en) * 2018-10-31 2019-03-19 北京字节跳动网络技术有限公司 Method and apparatus for handling information
CN110988317A (en) * 2019-11-27 2020-04-10 兰州大学第一医院 Detection method and system for biological samples in refrigerating chamber
CN111027994A (en) * 2018-10-09 2020-04-17 百度在线网络技术(北京)有限公司 Similar object determination method, device, equipment and medium
CN111310840A (en) * 2020-02-24 2020-06-19 北京百度网讯科技有限公司 Data fusion processing method, device, equipment and storage medium
CN111753763A (en) * 2020-06-28 2020-10-09 广联达科技股份有限公司 Method and device for identifying table content of construction drawing and computer equipment
CN111753079A (en) * 2019-03-11 2020-10-09 阿里巴巴集团控股有限公司 Text classification method and device, electronic equipment and computer-readable storage medium
CN112287236A (en) * 2020-11-19 2021-01-29 每日互动股份有限公司 Text message pushing method and device, computer equipment and storage medium
CN112651439A (en) * 2020-12-25 2021-04-13 平安科技(深圳)有限公司 Material classification method and device, computer equipment and storage medium
CN112987940A (en) * 2021-04-27 2021-06-18 广州智品网络科技有限公司 Input method and device based on sample probability quantization and electronic equipment
CN113139034A (en) * 2020-01-17 2021-07-20 深圳市优必选科技股份有限公司 Statement matching method, statement matching device and intelligent equipment

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101122909A (en) * 2006-08-10 2008-02-13 株式会社日立制作所 Text message indexing unit and text message indexing method
JP2010021761A (en) * 2008-07-10 2010-01-28 Nippon Hoso Kyokai <Nhk> Video image automatic recording control device
CN103116637A (en) * 2013-02-08 2013-05-22 无锡南理工科技发展有限公司 Text sentiment classification method facing Chinese Web comments
CN104142998A (en) * 2014-08-01 2014-11-12 中国传媒大学 Text classification method
US20170104752A1 (en) * 2015-10-13 2017-04-13 Fujitsu Limited Method of processing a ciphertext, apparatus, and storage medium
CN107341233A (en) * 2017-07-03 2017-11-10 北京拉勾科技有限公司 A kind of position recommends method and computing device

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101122909A (en) * 2006-08-10 2008-02-13 株式会社日立制作所 Text message indexing unit and text message indexing method
JP2010021761A (en) * 2008-07-10 2010-01-28 Nippon Hoso Kyokai <Nhk> Video image automatic recording control device
CN103116637A (en) * 2013-02-08 2013-05-22 无锡南理工科技发展有限公司 Text sentiment classification method facing Chinese Web comments
CN104142998A (en) * 2014-08-01 2014-11-12 中国传媒大学 Text classification method
US20170104752A1 (en) * 2015-10-13 2017-04-13 Fujitsu Limited Method of processing a ciphertext, apparatus, and storage medium
CN107341233A (en) * 2017-07-03 2017-11-10 北京拉勾科技有限公司 A kind of position recommends method and computing device

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111027994A (en) * 2018-10-09 2020-04-17 百度在线网络技术(北京)有限公司 Similar object determination method, device, equipment and medium
CN109492687A (en) * 2018-10-31 2019-03-19 北京字节跳动网络技术有限公司 Method and apparatus for handling information
CN111753079A (en) * 2019-03-11 2020-10-09 阿里巴巴集团控股有限公司 Text classification method and device, electronic equipment and computer-readable storage medium
CN110988317A (en) * 2019-11-27 2020-04-10 兰州大学第一医院 Detection method and system for biological samples in refrigerating chamber
CN110988317B (en) * 2019-11-27 2021-04-20 兰州大学第一医院 Detection method and system for biological samples in refrigerating chamber
CN113139034A (en) * 2020-01-17 2021-07-20 深圳市优必选科技股份有限公司 Statement matching method, statement matching device and intelligent equipment
CN111310840A (en) * 2020-02-24 2020-06-19 北京百度网讯科技有限公司 Data fusion processing method, device, equipment and storage medium
CN111310840B (en) * 2020-02-24 2023-10-17 北京百度网讯科技有限公司 Data fusion processing method, device, equipment and storage medium
CN111753763A (en) * 2020-06-28 2020-10-09 广联达科技股份有限公司 Method and device for identifying table content of construction drawing and computer equipment
CN112287236A (en) * 2020-11-19 2021-01-29 每日互动股份有限公司 Text message pushing method and device, computer equipment and storage medium
CN112651439A (en) * 2020-12-25 2021-04-13 平安科技(深圳)有限公司 Material classification method and device, computer equipment and storage medium
CN112651439B (en) * 2020-12-25 2023-12-22 平安科技(深圳)有限公司 Material classification method, device, computer equipment and storage medium
CN112987940A (en) * 2021-04-27 2021-06-18 广州智品网络科技有限公司 Input method and device based on sample probability quantization and electronic equipment

Also Published As

Publication number Publication date
CN108304490B (en) 2020-12-15

Similar Documents

Publication Publication Date Title
CN108304490A (en) Text based similarity determines method, apparatus and computer equipment
CN112581191A (en) Training method and device of behavior prediction model
Bruni et al. Distributional semantics from text and images
CN104809243B (en) It is a kind of that method is recommended based on the mixing excavated to user behavior composite factor
Zhou et al. A classification-based approach to question routing in community question answering
DeCoste Collaborative prediction using ensembles of maximum margin matrix factorizations
CN103914468B (en) A kind of method and apparatus of impression information search
CN104111933B (en) Obtain business object label, set up the method and device of training pattern
US10348550B2 (en) Method and system for processing network media information
Chang et al. Searching persuasively: Joint event detection and evidence recounting with limited supervision
CN112434151A (en) Patent recommendation method and device, computer equipment and storage medium
CN106372249A (en) Click rate estimating method and device and electronic equipment
CN103116588A (en) Method and system for personalized recommendation
CN105095187A (en) Search intention identification method and device
CN106339383A (en) Method and system for sorting search
CN106105096A (en) System and method for continuous social communication
CN108427708A (en) Data processing method, device, storage medium and electronic device
CN107451148A (en) Video classification method and device and electronic equipment
CN102053971A (en) Recommending method and equipment for sequencing-oriented collaborative filtering
CN105022754A (en) Social network based object classification method and apparatus
CN112380453B (en) Article recommendation method and device, storage medium and equipment
CN111973996A (en) Game resource putting method and device
CN111612519B (en) Method, device and storage medium for identifying potential customers of financial products
CN102428467A (en) Similarity-Based Feature Set Supplementation For Classification
CN106919588A (en) A kind of application program search system and method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant