CN106294295A

CN106294295A - Article similarity recognition method based on word frequency

Info

Publication number: CN106294295A
Application number: CN201610653494.2A
Authority: CN
Inventors: 张俤
Original assignee: Chengdu Light Horse Network Technology Co Ltd
Current assignee: Chengdu Light Horse Network Technology Co Ltd
Priority date: 2016-08-10
Filing date: 2016-08-10
Publication date: 2017-01-04
Anticipated expiration: 2036-08-10
Also published as: CN106294295B

Abstract

The invention provides a kind of article similarity recognition method based on word frequency, the method includes: web page characteristics vector is carried out dimensionality reduction and mapping, obtains the similarity represented with hashed value；The difference degree of entry is calculated based on each entry word frequency in webpage；Webpage weights are obtained according to described difference degree；The recommendation of similar web page is carried out by Candidate Recommendation similarity and webpage weights product.The present invention proposes a kind of article similarity recognition method based on word frequency, for large-scale dataset, checks set of metadata of similar data fast and efficiently, quickly excavates valuable information, promotes the Consumer's Experience of search engine.

Description

Article similarity recognition method based on word frequency

Technical field

The present invention relates to natural language processing, particularly to a kind of article similarity recognition method based on word frequency.

Background technology

Along with Internet technology and the fast development of related industry, data just increase with unprecedented scale, greatly rapidly Data, while bringing motive force, also bring challenge.How in magnanimity internet data, to seek valuable resource, root Recommend Similar content according to the search of user, be the vital task of big data text process.It is directed to the approx imately-detecting of webpage, it is desirable to The space complexity of algorithm and time complexity will reduce as much as possible, to meet the demand of user.Existing based on text The recommendation method of similarity has the following disadvantages, and when data scale is the hugest, the generation of web page characteristics value and calculating will consumptions Take long time；To professional field, too much rely on basis corpus and calculate word weights；Short text similarity discrimination Low.

Summary of the invention

For solving the problem existing for above-mentioned prior art, the present invention proposes a kind of article similarity based on word frequency and knows Other method, including:

Webpage X and Y characteristic vector are carried out dimensionality reduction and mapping, obtains similarity sim (X, Y) represented with hashed value；

Difference degree WD (w) of entry w is calculated based on each entry w word frequency in webpage；

The webpage weights of webpage X and Y are obtained according to described difference degree WD (w)；

Similar web page is carried out by the webpage weights product of Candidate Recommendation similarity sim (X, Y) Yu webpage X and Y Recommend.

Preferably, described webpage X and Y characteristic vector are carried out dimensionality reduction and mapping, obtain the similarity represented with hashed value Sim (X, Y), farther includes:

Obtain and calculate whole sentence eigenvalue in units of whole sentence in webpage, then use editing distance to calculate similar Degree；It is mapped to a dimensionality reduction vector space for a multidimensional characteristic vectors, and produces an x dimension according to the vector after this dimensionality reduction Eigenvalue, wherein x > 1, the most one-dimensional value is 1 or-1, is weighted processing in x gt by each characteristic item, finally will Weights the most one-dimensional in this x dimensional vector are mapped as 0 or 1 according to pre-defined rule, then are coupled together by these binary digits, X position hashed value to webpage vector.

It is preferably based on each entry w word frequency in webpage and calculates difference degree WD (w) of entry w, be expressed as:

W D (w) = | 1 - \underset{p &Element; P}{Σ} {(\frac{F P (p, w)}{\underset{w &Element; T}{Σ} F P (p, w)})}^{2} |^{2}

All collections of web pages that wherein P crawls in being gatherer process, T is the set of all entries, and (p w) represents entry to FP The word frequency that w occurs in webpage p；

The described webpage weights obtaining webpage X and Y according to described difference degree, particularly as follows:

I M (p) = \underset{w &Element; T}{Σ} \underset{p &Element; P}{Σ} (\frac{F P (p, w)}{\underset{w &Element; T}{Σ} F P (p, w)}) \times W D (w) .

Preferably, carry out similar by Candidate Recommendation similarity sim (X, Y) to the webpage weights product of webpage X and Y The recommendation of webpage, is expressed as:

According to webpage similarity sim (X, Y), calculate Candidate Recommendation similarity sim (X, the Y) × IM with webpage weights (X) × IM (Y), preserves final similarity and recommends more than the web results of threshold alpha more than predetermined threshold value Φ and number of visits.

The present invention compared to existing technology, has the advantage that

The present invention proposes a kind of article similarity recognition method based on word frequency, for large-scale dataset, quick, high Check set of metadata of similar data to effect, quickly excavate valuable information, promote the Consumer's Experience of search engine.

Accompanying drawing explanation

Fig. 1 is the flow chart of article similarity recognition method based on word frequency according to embodiments of the present invention.

Detailed description of the invention

The detailed description of embodiment one or more to the present invention is hereafter provided together with the accompanying drawing of the diagram principle of the invention. Describe the present invention in conjunction with such embodiment, but the invention is not restricted to any embodiment.The scope of the present invention is only wanted by right Ask book to limit, and the present invention contains many replacements, amendment and equivalent.Illustrate many details in the following description so that Thorough understanding of the present invention is provided.These details are provided for exemplary purposes, and without in these details A little or all details can also realize the present invention according to claims.

An aspect of of the present present invention provides a kind of article similarity recognition method based on word frequency.Fig. 1 is according to the present invention The article similarity recognition method flow chart based on word frequency of embodiment.

The present invention is by webpage approx imately-detecting, and circulation is read user and searched for the entry in text, close with predefined class gathering, Each class bunch Chinese version and each entry are initial condition in the word frequency of class bunch, and search text is carried out participle and index；Then In training set in each class bunch text, statistical nature word word frequency is higher than the quantity of threshold value；Entry is calculated special in each class bunch Value indicative, is stored in web page characteristics set, completes the extraction to text feature.After the eigenvalue obtaining webpage, by this feature Value sorts as key word and sets up index；It is indexed in existing web page library with the whole sentence eigenvalue of webpage to be analyzed, Retrieve candidate web pages；Finally, with webpage to be analyzed, candidate web pages being performed Similarity Measure, according to result of calculation, decision is The no webpage recommending that is analysed to is to user.

The present invention is primarily based on the net page data source crawled, and defined feature extracts strategy, including page structure, position letter Breath, extraction flow process, rules back, output result etc.；Then, carry out page pretreatment, determine obtain webpage content, abandon with The entry attribute that extraction information is unrelated；According to extracting strategy, it is thus achieved that required data item, and it is saved in XML document；By XML Document obtains characteristic vector by feature extraction and clusters.By the document after cluster, by class bunch storage to correspondence database.

Wherein, characteristic extraction procedure farther includes:

Predefined class gathering closes { c₁,c₂,…,c_m, each class bunch c_jInclude text (d_j1,d_j2,…d_jn), each text d_jIncluding entry (t₁,t₂,…t_k), entry t_kAt class bunch c_jThreshold value word frequency MM of middle appearance；Number NM that Feature Words is chosen.

(1) participle and to text collection set up index, initialization feature set S be sky；

(2) entry during index file is read in circulation；

(3) entry t is calculated_kThe word frequency text number DF (t no less than MM time in the text of each class bunch of training set_k, c_i)；

(4) t is calculated_kCharacteristic frequency FF and average word frequency AN relative to each class bunch:

F F = Σ_{k = 1}^{n} {tf}_{i k} / Σ_{i = 1}^{m} Σ_{k = 1}^{n} {tf}_{i k}

Wherein tf_ikIt is characterized t at text d_ikThe word frequency of middle appearance；

A N = \frac{Σ_{k = 1}^{n} {tf}_{i k}}{n}

(5) t is calculated_kFeature weight MI (t in each class bunch_k,c_i):

MI(t_k,c_i)=FF × AN × log (P_m(t_k,c_i)/P(c_i)P_m(t_k))

Wherein P_m(t_k,c_i)=DF (t_k,c_i)/DF(t_k)

P(c_i)=n/N

P_m(t_k)=DF (t_k)/N

Wherein DF (t_k) represent feature t in whole training text_kWord frequency minimize the text number of value, N is whole instruction Practice the text sum of collection.

(6) select the document feature sets that MI value is maximum, be incorporated in set S, as first Feature Words, and with in set S Between entry, the minimum principle of interdependence selects next document feature sets；

(7) step 6 is repeated, until Feature Words number reaches threshold value NM.

Alternatively, for the webpage with summary, its feature extraction uses the higher method of following accuracy rate, specifically walks Suddenly:

(1), filter out the information that web page text head and the tail are unrelated with feature extraction, obtain the web page text after denoising；

(2), summary and the Chinese word segmentation result of textual content are respectively obtained；

(3), the Chinese word segmentation result of summary and textual content is carried out parts of speech classification, after completing classification, to textual content Carry out predicate with the parts of speech classification result of summary to extract and notional word identification；

(4), according to presetting the parts of speech classification result of web page text after described predicate is extracted by merger rule set and described The notional word recognition result of web page text carries out merger, obtains the merger result of original text；The word of the summary after described predicate is extracted Property classification results and the notional word recognition result of described summary carry out merger, obtain the merger result of summary；

(5), the merger result of web page text and the merger result of summary are carried out unit merger, obtain the letter of web page text The unit merger result of interest statement unit merger result and summary；

(6), the unit merger result of web page text is clustered, webpage literary composition after being clustered according to characterization rules collection This feature extraction result；Described characterization rules collection is cut by the statement of weights allocation strategy, the unit merger result of web page text Divider then, atomic sentence segmentation rules, voice decimation rule, tone recognition rule constitute.

Described cluster process farther includes:

(6.1), the webpage text content inputted is carried out dimension-reduction treatment, it is thus achieved that each Feature Words in web page text and The group of word frequency is right, is designated as ＜ word, value ＞；

(6.2), described group is ranked up according to lexicographic order, and sets up index according to described sequence；

(6.3), described index and described Feature Words are set up corresponding relation, will the group of each Feature Words and its frequency right ＜ word, value ＞ is converted to the corresponding relation of each index and its word frequency, is designated as vector ＜ index, value ＞；

(6.4) definition cycle-index t, maximum cycle t_max；And initialize t=0；Take turns from index vector collection ＜ at t Index, value ＞ obtains n index vector, is designated as N^(t)={ N₁ ^(t),N₂ ^(t),…,N_n ^(t), N_i ^(t)Represent the i-th of t wheel Index vector ＜ index_i ^(t),value_i ^(t)＞；Calculate the i-th index vector N of t wheel_i ^(t)With jth index vector N_j ^(t)'s Regularization similarity Nsim (i, j)=N_j ^(t)·N_i ^(t)；

(6.5), n the index vector N that described t is taken turns^(t)Weights be designated as WEN^(t)={ WEN₁ ^(t),WEN₂ ^(t),…,WEN_n ^(t), WEN_i ^(t)Represent the i-th index vector N of t wheel_i ^(t)Weights；Initialize WEN_i ^(t)=1；Calculate the i-th index of t wheel Vector N_i ^(t)With jth index vector N_j ^(t)Similarity distance matrix S^(t)(i, j):

S^(t)(i, j)=(1+WEN_i ^(t)/WEN_j ^(t))/Nsim(i,j)

(6.6), the S that t is taken turns^(t)(i, j) is assigned to Affinity Propagation algorithm, n the rope taking turns described t The amount of guiding into N^(t)Cluster, it is thus achieved that the m of t wheel_tIndividual preliminary clusters center, is designated as C^(t)={ C₁ ^(t),C₂ ^(t),…,C_mt ^(t)}；By t Increase 1；And judge t=t_maxWhether setting up, if setting up, then performing step 2.11；Otherwise from described index vector collection ＜ index, Value ＞ obtains n index vector N of t wheel^(t)={ N₁ ^(t),N₂ ^(t),…,N_n ^(t)}

(6.7), the m that described t-1 is taken turns_t-1Individual cluster centre C^(t-1)It is appended to n index vector N of described t wheel^(t)In, Thus obtain n+m_t-1Individual index vector, the n+m that will update_t-1Individual index vector N^(t)' it is assigned to the index vector N that described t takes turns^(t), and return step 6.5 order execution；Thus obtain the m of t wheel_tIndividual final cluster centre C^(t)；

(6.8)；Obtain each cluster centre taken turns, complete described cluster.

After obtaining eigenvalue, on the one hand the Similarity Measure of the present invention uses whole sentence is that unit obtains and calculate whole Sentence eigenvalue, then uses editing distance to calculate similarity.It is mapped to a dimensionality reduction vector empty for a multidimensional characteristic vectors Between, and producing an x dimensional feature value (x > 1) according to the vector after this dimensionality reduction, the most one-dimensional value is 1 or-1, by each characteristic item It is weighted processing in x gt, finally the most one-dimensional weights in this x dimensional vector is mapped as 0 according to pre-defined rule Or 1, then these binary digits are coupled together, obtain the x position hashed value of webpage vector.And carry out similarity detection process:

Step 1；The vector v of one x dimension is initialized as 0, and the binary number fbin of x position is initialized as 0.

Step 2: to statement s in whole sentence set SP_i, use SHA1 hashing algorithm to obtain the hashed value of an x position.

Step 3: defined function g (h_j(s_i)):

g (h_{j} (S_{i})) = \{\begin{matrix} 1 & h_{j} (s_{i}) = 1 \\ - 1 & h_{j} (s_{i}) = 0 \end{matrix}

Wherein h_j(s_i) represent s_iThe binary numeral that jth position is corresponding；Definition v_jRepresent the jth dimension of vector v, to 1 to x, meter Calculate v_jWeights

v_j=v_j+W(s_i)×g(h_j(s_i))

Wherein, W (s_i) represent statement s_iWeights.

Step 4, if there is the most untreated statement in set SP, then jumps to step 2 and is iterated calculating；Otherwise turn step Rapid 5.

Step 5, defines fbin_jRepresent the jth bit value in fbin, to 1 to x, if v_j> 0, then fbin_j=1；If v_j≤ 0, Then fbin_j=0.

Step 6, using the binary sequence fbin that obtains as the eigenvalue of current whole sentence；Then for given webpage X With webpage Y, respectively the characteristic value combinations of each whole sentence is formed whole sentence characteristic value collection S_XAnd S_Y, use | S_X| and | S_Y| table respectively Show the element number in each set, | S_X∩S_Y| represent the number approximating sentence in two set, the similarity of calculating webpage X and Y:

Sim (X, Y)=| S_X∩S_Y|/(|S_X|+|S_Y|-|S_X∩S_Y|)

The judgment criterion wherein approximating sentence is, if two respective eigenvalues of whole sentence a, b meetIt is higher than Predefined threshold value η, then be judged as that two whole sentences are for approximation sentence.

Step 7, if sim (X, Y) ＞ λ (presetting similarity threshold), it is determined that webpage X with Y is similar, otherwise dissimilar.

And in search-engine web page recommendation process, the webpage that number of visits is different is used different methods to enter by the present invention Row is recommended.

For the number of visits webpage more than predetermined threshold α, making to complete user using the following method and recommend, concrete step is such as Under:

1.1 search user gathers similar users u of each user u in U ', by the user of browsed same web page be wherein Similar users.To each similar users u ' the entry t that browsed, give weights according to the sequence number of entry；For each word Bar, calculating total weight value:

Wgh(t_i)=θ × Fr (t_i)+ζ×Se(t_i)；

Wherein Fr (t_i) represent that all users use entry to browse the number of times of webpage, Se (t_i) represent entry browse order, θ, ζ are regulation coefficient, and meet θ+ζ=1；

1.2 press entry total weight value descending, merge synonym entry；Finally, multiple by the maximum weight of predetermined number Webpage recommending corresponding to entry is to user u.

For number of visits less than the webpage of predetermined threshold α, search and number of visits the highest with current web page similarity Many webpages, recommend user by entry bigger for total weight value in calculated webpage.Concrete step is as follows:

2.1 make to evaluate using the following method the difference degree of entry w,

W D (w) = | 1 - \underset{p &Element; P}{Σ} {(\frac{F P (p, w)}{\underset{w &Element; T}{Σ} F P (p, w)})}^{2} |^{2}

All collections of web pages that wherein P crawls in being gatherer process, T is the set of all entries, and (p w) represents entry to FP The word frequency that w occurs in webpage p.

The 2.2 webpage weights height with more high difference degree entry, calculate webpage weights as follows:

I M (p) = \underset{w &Element; T}{Σ} \underset{p &Element; P}{Σ} (\frac{F P (p, w)}{\underset{w &Element; T}{Σ} F P (p, w)}) \times W D (w)

Further according to aforementioned webpage similarity sim (X, Y), calculate Candidate Recommendation similarity sim (X, Y) with webpage weights × IM (X) × IM (Y), preserves final similarity and carries out more than the web results of threshold alpha more than predetermined threshold value Φ and number of visits Recommend.

Still optionally further, for above-mentioned webpage weights, it is possible to use entry semantic similarity quaternary tree, then with former phase The calculating seemingly spending sim (X, Y) is weighted summation.Entry semantic similarity quaternary tree comprises leaf node and nonleaf node, leaf segment In point, all similarities exceed the entry of threshold value Phi and arrange the most in descending order, and are sequentially saved in leaf node.And entry number information It is saved in nonleaf node.During the semantic similarity calculated between document feature sets vector, if Feature Words vector v_iAnd v_j Feature w of certain dimension_ikAnd w_jlMeet following condition 1 or 2, then to document feature sets vector v_iAnd v_jSimilarity result carry out Weighting processes.

Condition 1: if w_jlBelong to the entry descending queue of some leaf node in quaternary tree, and w_ikIt is not belonging to above-mentioned fall Sequence queue, then according to w_ikWith the similarity of other entry in the entry descending queue of place, containing w_jlEntry descending queue in Determine w_ikOrdinal position in entry descending queue.

Condition 2: if w_ikAnd w_jlAll it is not belonging to the entry descending queue of some leaf node in quaternary tree, w_ikAnd w_jlWith The document feature sets with maximum similarity in the entry descending queue of certain leaf node and have minimum similar in quaternary tree When the Similarity value of the document feature sets of degree is both less than a certain threshold value Phi, then set up a branch, and by w_ikAnd w_jlIt is inserted into this In the document feature sets queue of individual branch leaf node.

After entry semantic similarity quaternary tree has built, from v_iIn each entry start, find v_jIn with w_jl Most like entry, the similarity between record entry.By v_iIn other entries repeat above-mentioned searching process, until v_iIn all Entry is all at v_jIn have found the most most like entry.Similarity between the entry that will obtain adds up, divided by v_iIn all words Bar number, as v_iAnd v_jSimilarity sim (v_i, v_j).Then sim (v is calculated_i, v_j) and sim (v_j, v_i) meansigma methods, as Vector v_iAnd v_jSemantic similarity.To vector v_iAnd v_jSemantic similarity be weighted processing, finally give the semantic phase of weighting Like degree.

In sum, the present invention proposes a kind of article similarity recognition method based on word frequency, for large-scale data Collection, checks set of metadata of similar data fast and efficiently, quickly excavates valuable information, promotes the Consumer's Experience of search engine.

Obviously, it should be appreciated by those skilled in the art, each module of the above-mentioned present invention or each step can be with general Calculating system realize, they can concentrate in single calculating system, or is distributed in what multiple calculating system was formed On network, alternatively, they can realize with the executable program code of calculating system, it is thus possible to be stored in Storage system is performed by calculating system.So, the present invention is not restricted to the combination of any specific hardware and software.

It should be appreciated that the above-mentioned detailed description of the invention of the present invention is used only for exemplary illustration or explains the present invention's Principle, and be not construed as limiting the invention.Therefore, that is done in the case of without departing from the spirit and scope of the present invention is any Amendment, equivalent, improvement etc., should be included within the scope of the present invention.Additionally, claims purport of the present invention Whole within containing the equivalents falling into scope and border or this scope and border change and modifications Example.

Claims

1. an article similarity recognition method based on word frequency, it is characterised in that including:

The recommendation of similar web page is carried out by the webpage weights product of Candidate Recommendation similarity sim (X, Y) Yu webpage X and Y.

Method the most according to claim 1, it is characterised in that described dimensionality reduction and reflecting that webpage X and Y characteristic vector are carried out Penetrate, obtain similarity sim (X, Y) represented with hashed value, farther include:

Obtain and calculate whole sentence eigenvalue in units of whole sentence in webpage, then use editing distance to calculate similarity；Pin One multidimensional characteristic vectors is mapped to a dimensionality reduction vector space, and produces an x dimensional feature according to the vector after this dimensionality reduction Value, wherein x > 1, the most one-dimensional value is 1 or-1, is weighted processing, finally by this in x gt by each characteristic item Weights the most one-dimensional in x dimensional vector are mapped as 0 or 1 according to pre-defined rule, then are coupled together by these binary digits, obtain net The x position hashed value of page vector.

Method the most according to claim 2, it is characterised in that the described word frequency based on each entry w in webpage calculates Difference degree WD (w) of entry w, is expressed as:

W D (w) = | 1 - \underset{p &Element; P}{Σ} {(\frac{F P (p, w)}{\underset{w &Element; T}{Σ} F P (p, w)})}^{2} |^{2}

All collections of web pages that wherein P crawls in being gatherer process, T is the set of all entries, and (p w) represents that entry w exists to FP The word frequency occurred in webpage p；

I M (p) = \underset{w &Element; T}{Σ} \underset{p &Element; P}{Σ} (\frac{F P (p, w)}{\underset{w &Element; T}{Σ} F P (p, w)}) \times W D (w) .

Method the most according to claim 3, it is characterised in that by Candidate Recommendation similarity sim (X, Y) and webpage X and Y Both webpage weights products carry out the recommendation of similar web page, are expressed as:

According to webpage similarity sim (X, Y), calculate Candidate Recommendation similarity sim (X, the Y) × IM (X) with webpage weights × IM (Y), preserves final similarity and recommends more than the web results of threshold alpha more than predetermined threshold value Φ and number of visits.