CN103970789A

CN103970789A - Algorithm for correlation between words

Info

Publication number: CN103970789A
Application number: CN201310040098.9A
Authority: CN
Inventors: 尹科
Original assignee: BEIJING INFCN INFORMATION TECHNOLOGY Co Ltd
Current assignee: BEIJING INFCN INFORMATION TECHNOLOGY Co Ltd
Priority date: 2013-02-01
Filing date: 2013-02-01
Publication date: 2014-08-06

Abstract

The invention relates to an algorithm for correlation between words. The algorithm is implemented by the following steps: generating a keyword net formed by keywords of all document records according to correlation value and correlation distance of all the document records in a database; after a search word is input by a user, searching document records of which keywords include the search word in the keyword net, and outputting names of the document records including the keywords according to the sequence of correlation between the search word and the keywords. According to the invention, through generating the keyword net formed by keywords of all document records, after the user inputs the search word, document records of which keywords include the search word are searched in the keyword net, and then names of the document records including the keywords according to the sequence of correlation between the search word and the keywords are output, so that the searching efficiency about document records is greatly improved, and names of the document records arranged by correlation are provided for the user, so that use of the user is facilitated, and searching experience of the user is improved.

Description

Degree of correlation algorithm between a kind of word and word

Technical field

The invention belongs to technical field of information retrieval, be specifically related to the degree of correlation algorithm between a kind of word and word.

Background technology

In the time carrying out data retrieval, when general, retrieve by inputting in the search box corresponding keyword, particularly in some journal article databases, after input keyword, system according to relevant periodical or the paper of the rule output of oneself keyword corresponding and input, is selected for user automatically.The method of this retrieve data, has improved people's recall precision and level greatly, but can't meet people's retrieval user demand.As this search method is only available to the document that user is relevant to the keyword of input, can not provide more to the keyword of inputting and have certain relevant more document for user, and determine when inaccurate at user's keyword, often retrieve less than the relevant document that records, need user repeatedly to determine keyword, repeatedly retrieve, side likely realizes retrieval object.And user as also needed peripheral related documents time, also needs to determine separately that different keywords retrieves again, can not once offer the corresponding resource of user simultaneously, for user, cumbersome, recall precision is also lower.

Summary of the invention

The object of the invention is to overcome the deficiency of above-mentioned technology and the algorithm of the degree of correlation between a kind of word and word is provided.

The present invention is achieved in that the degree of correlation algorithm between a kind of word and word, comprises the following steps:

Generate by all relevance degree and correlation distance values that record document in database the keyword net being formed by described all keywords that record document;

User inputs after term, searches the document that records that its keyword comprises described term in described keyword net, and by described term and described keyword degree of correlation size order, the document name that records that comprises described keyword is exported.

The generation step of described keyword net is as follows:

According to formula: record the keyword quantity that document A and square ÷ that records the relevance degree of document B=record document A and the keyword that records B and repeat number record document A × record the keyword quantity of document A, calculate in described database every section of relevance degree that records document and other and record document:

According to formula: record document A and record document A and the phase that records document B with the distance=1-that records B

According to formula: record document A and record document A and the relevance degree that records document B with the distance=1-that records B, calculate in described database every section of correlation distance value that records document and other and record document;

According in described database every section record document and other correlation distance value that records document forms taking described all keywords of document that record as the keyword net that comprises N layer keyword node of node.

Described every section is recorded document and other described correlation distance value that records document is the internodal distance of each described keyword in described keyword net.

Describedly in described keyword net, search the document that records that keyword comprises described term, and by described term and described keyword degree of correlation size order by as follows the step that records document name output that comprises described keyword:

The relevance degree of the keyword that calculates described term and comprise described term;

By the size order of described relevance degree, the document that records that comprises described keyword is exported;

Computing formula is as follows:

The square root of mean value × keyword occurrence number of relevance degree=N layer keyword nodal distance of term and keyword.

The present invention is by generating by all relevance degree and correlation distance values that record document in database the keyword net being made up of described all keywords that record document, input after term user, in described keyword net, search the document that records that its keyword comprises described term, calculate the degree of correlation between term and keyword by quantification, and by described term and described keyword degree of correlation size order, the document name that records that comprises described keyword is exported, thereby greatly improve the recall precision that records document, and provide the document name that records that user's degree of correlation arranges, being conducive to user uses, having improved user's retrieval experiences.

Brief description of the drawings

Fig. 1 is the process flow diagram that the embodiment of the present invention provides the document record retrieval based on keyword;

Fig. 2 is the composition schematic diagram of the keyword net that provides of the embodiment of the present invention;

Fig. 3 is the radar map of the degree of correlation between the term that provides of the embodiment of the present invention and keyword;

Fig. 4 is that the embodiment of the present invention provides a schematic diagram that records literature search example.

Embodiment

Describe the specific embodiment of the present invention in detail below in conjunction with drawings and Examples.

As everyone knows, section document of each in database, all can select several keywords, for the information that shows to be closely related with the document.In addition, several keywords in one section of document itself possess certain correlativity, different documents, and in the time having certain correlativity in its description, the keyword of these documents also has correlativity, and it is identical even having indivedual keywords.The present invention is just by utilizing the above-mentioned character of the keyword that records document in database, by quantizing the relation between keyword, set up a keyword net, realizes quick-searching by this key net.

As shown in Figure 1, the figure shows the flow process of the degree of correlation algorithm between a kind of word and the word that the embodiment of the present invention provides, for convenience of explanation, only show the part relevant to the embodiment of the present invention.

Refer to Fig. 1, the degree of correlation algorithm described in the embodiment of the present invention between a kind of word and word, comprises the following steps:

S101: generate the keyword net being formed by described all keywords that record document by all relevance degree and correlation distance values that record document in database;

S102: user inputs after term searches the document that records that its keyword comprises described term in described keyword net, and by described term and described keyword degree of correlation size order, the document name that records that comprises described keyword is exported.

In the embodiment of the present invention, the generation step of described keyword net is as follows:

In the embodiment of the present invention, according in described database every section record document and other correlation distance value that records document forms taking described all keywords of document that record as the keyword net that comprises N layer keyword node of node.

In the embodiment of the present invention, describedly in described keyword net, search the document that records that keyword comprises described term, and by described term and described keyword degree of correlation size order by as follows the step that records document name output that comprises described keyword:

Computing formula is as follows:

Below, according to specific embodiment, the present invention is described in detail.

Prepare the keyword data of some documents, as follows, wherein, a record is the keyword of one section of document:

Suppose to record respectively document A and record document B, quantize two with following formula and record our degree of correlation:

Record the keyword quantity that document A and square ÷ that records the relevance degree of document B=record document A and the keyword that records B and repeat number record document A × record the keyword quantity of document A, calculate in described database every section of relevance degree that records document and other and record document:

The described relevance degree that records document A and record document B that calculates is between 0 to 1;

Mate completely with the keyword that records document B if record document A, relevance degree is 1; If keyword does not mate completely, relevance degree is 0.

Thus, according under formula, can set the correlation distance of document A and B, correlation distance computing formula is: record document A and record document A and the relevance degree that records document B with the correlation distance=1-that records document B; Record document A and the correlation distance formula that records document B by this, calculate the correlation distance value that records document A and record document, this correlation distance value is between 0 and 1; Can find out, two degrees of correlation that record document are larger, and the distance between them is nearer; Mate completely with the key word that records document B if record document A, distance is 0; Do not mate completely with the key word that records document B if record document A, distance is 1.

As, [record document: 36/10000], keyword: Angiogenesis, microvessel density, vascular endothelial growth factor, lymphatic metastasis, immunohistochemistry;

[record document: 52/10000], keyword: neovascular glaucoma, vascular endothelial growth factor, diabetes iris.

Can find out, this records document 36 and 52 and only has an identical keyword " vascular endothelial growth factor ", utilize formula mentioned above, known, the degree of correlation=the 1*1/ (5*3)=0.066667 of document record 36 and record 52, the distance=1-0.066667=0.933333 of record 36 and record 52

Generate after the correlation distance between multiple document records by above-mentioned algorithm, then draw out keyword net, as shown in 2 figure.Fig. 2, each circle with three bit digital represents a keyword node, is interconnected to form described keyword net between multiple keyword nodes.A circle is a node, represent a record, numeral in circle is the numbering of the record of this node representative, record is from 1 open numbering, establish a capital so in fact differ is that this net of 3 bit digital is to generate according to the distance between record, distance is 1 do not connect, 1 > connects when algorithm process apart from >'s 0, calculate the distance of each record and other records, then draw connected graph, finally form the overall connected graph between each record.

On the basis of described keyword net, when after a given term, the keyword node that can contain this term in described keyword locate line goes out keyword, then find out other keyword node that has direct correlation with this keyword node of finding out, just can obtain other words relevant to given term; By that analogy, when the keyword node along described direct correlation continues to search follow-up indirect association keyword node, just can obtain the more related term relevant to described retrieval.

Then the degree of correlation with the multiple keywords that comprise this term by calculating term, the degree of correlation of each keyword that comprises described term of determining described term and find, and export the document that records that comprises accordingly described keyword by described degree of correlation size, select for user.Relatedness computation method see lower described in.

Suppose that given term is C1, while quantizing the degree of correlation of the keyword C2 in this term C1 and keyword net, by reference to the number of times of the distance between the node of the keyword C2 that comprises term C1 and this keyword C2 appearance, just can calculate the relevance degree between term C1 and crucial C2 by following computing formula, thereby the degree of correlation that quantizes term C1 and keyword C2, calculates the relevance degree between term C1 and keyword C2:

The square root of mean value × keyword C2 occurrence number of relevance degree=N node layer distance of retrieval C word 1 and keyword C2.N node layer refers to all nodes of the N layer starting from central point, 1 layer at last of each direct connection, and the connection of interval N level is that N layer connects, and such as the node of numbering 629, its 1 node layer is and its direct-connected node, has 614,734,883,915; Its 2 node layers are and the direct-connected node of 1 node layer, have 747,763,630 etc.Can calculate two distances between node by description before, the mean value of N node layer distance just refers to the range averaging value of Centroid to all node of N layer so, if 614,734,883,915 these 4 nodes are exactly the mean value of the 1st node layer distance to the range averaging value of 629 Centroids.C2 occurrence number refers to that how many times has appearred in keyword C2 altogether in 614,734,883,915 these nodes, and this number of times is exactly the number of times that keyword C2 occurs at the 1st node layer.

When concrete application, the keyword of same layer, relevance degree is different, if keyword C2 in the different layer of keyword net, all occur, get the degree of correlation of conduct itself and the term of relevance degree maximum.

Method described in the application of the invention embodiment, in the time that user inputs a term retrieval, except result for retrieval, user can also know other keywords relevant to this term, and utilizes this keyword to find and record accordingly document.As shown in Figure 3, at given term " hamburger ", by calculating, get the keyword relevant to " hamburger ", comprise the snack food in minimum distance, the second in-plant KFC (agreeing moral chicken) and McDonald, the 3rd in-plant " alien word ", " French fries " " tonyred " and distant " food security " again, then can also be according to being somebody's turn to do " food security ", obtain the word relevant to this " food security ", as " the law of food safety ", " Sanlu " finally form 6 layers and the radar map of N=6

For another example, a journal article data in literature for retrieval platform in, comprised the keyword net that the keyword by millions of journal articles records in described database described in the embodiment of the present invention generates.In the time that user inputs corresponding term retrieve relevant journal article data in literature in described database, the embodiment of the present invention is according to the term of user's input, in described keyword net, search and the keyword that comprises described term, and include the journal article document of described term by described keyword output display, shown in Fig. 4 A, when user selects after corresponding database (Periodical Database Based), when in search column, input " education " term is retrieved, the embodiment of the present invention is searched related journals paper that the keyword that comprises described " education " term is corresponding as shown in Figure 4 B according to described keyword net, comprise " IT education ", " education ideas ", and " education education ideas " etc. relevant periodical or paper, being very easy to user search uses, improve effectiveness of retrieval.

The above is only the preferred embodiment of the present invention; it should be pointed out that for those skilled in the art, under the premise without departing from the principles of the invention; can also make some improvements and modifications, these improvements and modifications also should be considered as protection scope of the present invention.

Claims

1. the degree of correlation algorithm between word and word, is characterized in that, comprises the following steps:

2. the degree of correlation algorithm between a kind of word and word according to claim 1, is characterized in that, the generation step of described keyword net is as follows:

3. the degree of correlation algorithm between a kind of word and word according to claim 2, is characterized in that, described every section is recorded document and other described correlation distance value that records document is the internodal distance of each described keyword in described keyword net.

4. the degree of correlation algorithm between a kind of word and word according to claim 3, it is characterized in that, describedly in described keyword net, search the document that records that keyword comprises described term, and by described term and described keyword degree of correlation size order by as follows the step that records document name output that comprises described keyword:

Computing formula is as follows: