CN103970789A - Algorithm for correlation between words - Google Patents
Algorithm for correlation between words Download PDFInfo
- Publication number
- CN103970789A CN103970789A CN201310040098.9A CN201310040098A CN103970789A CN 103970789 A CN103970789 A CN 103970789A CN 201310040098 A CN201310040098 A CN 201310040098A CN 103970789 A CN103970789 A CN 103970789A
- Authority
- CN
- China
- Prior art keywords
- keyword
- document
- records
- record
- correlation
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/36—Creation of semantic tools, e.g. ontology or thesauri
- G06F16/374—Thesaurus
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention relates to an algorithm for correlation between words. The algorithm is implemented by the following steps: generating a keyword net formed by keywords of all document records according to correlation value and correlation distance of all the document records in a database; after a search word is input by a user, searching document records of which keywords include the search word in the keyword net, and outputting names of the document records including the keywords according to the sequence of correlation between the search word and the keywords. According to the invention, through generating the keyword net formed by keywords of all document records, after the user inputs the search word, document records of which keywords include the search word are searched in the keyword net, and then names of the document records including the keywords according to the sequence of correlation between the search word and the keywords are output, so that the searching efficiency about document records is greatly improved, and names of the document records arranged by correlation are provided for the user, so that use of the user is facilitated, and searching experience of the user is improved.
Description
Technical field
The invention belongs to technical field of information retrieval, be specifically related to the degree of correlation algorithm between a kind of word and word.
Background technology
In the time carrying out data retrieval, when general, retrieve by inputting in the search box corresponding keyword, particularly in some journal article databases, after input keyword, system according to relevant periodical or the paper of the rule output of oneself keyword corresponding and input, is selected for user automatically.The method of this retrieve data, has improved people's recall precision and level greatly, but can't meet people's retrieval user demand.As this search method is only available to the document that user is relevant to the keyword of input, can not provide more to the keyword of inputting and have certain relevant more document for user, and determine when inaccurate at user's keyword, often retrieve less than the relevant document that records, need user repeatedly to determine keyword, repeatedly retrieve, side likely realizes retrieval object.And user as also needed peripheral related documents time, also needs to determine separately that different keywords retrieves again, can not once offer the corresponding resource of user simultaneously, for user, cumbersome, recall precision is also lower.
Summary of the invention
The object of the invention is to overcome the deficiency of above-mentioned technology and the algorithm of the degree of correlation between a kind of word and word is provided.
The present invention is achieved in that the degree of correlation algorithm between a kind of word and word, comprises the following steps:
Generate by all relevance degree and correlation distance values that record document in database the keyword net being formed by described all keywords that record document;
User inputs after term, searches the document that records that its keyword comprises described term in described keyword net, and by described term and described keyword degree of correlation size order, the document name that records that comprises described keyword is exported.
The generation step of described keyword net is as follows:
According to formula: record the keyword quantity that document A and square ÷ that records the relevance degree of document B=record document A and the keyword that records B and repeat number record document A × record the keyword quantity of document A, calculate in described database every section of relevance degree that records document and other and record document:
According to formula: record document A and record document A and the phase that records document B with the distance=1-that records B
According to formula: record document A and record document A and the relevance degree that records document B with the distance=1-that records B, calculate in described database every section of correlation distance value that records document and other and record document;
According in described database every section record document and other correlation distance value that records document forms taking described all keywords of document that record as the keyword net that comprises N layer keyword node of node.
Described every section is recorded document and other described correlation distance value that records document is the internodal distance of each described keyword in described keyword net.
Describedly in described keyword net, search the document that records that keyword comprises described term, and by described term and described keyword degree of correlation size order by as follows the step that records document name output that comprises described keyword:
The relevance degree of the keyword that calculates described term and comprise described term;
By the size order of described relevance degree, the document that records that comprises described keyword is exported;
Computing formula is as follows:
The square root of mean value × keyword occurrence number of relevance degree=N layer keyword nodal distance of term and keyword.
The present invention is by generating by all relevance degree and correlation distance values that record document in database the keyword net being made up of described all keywords that record document, input after term user, in described keyword net, search the document that records that its keyword comprises described term, calculate the degree of correlation between term and keyword by quantification, and by described term and described keyword degree of correlation size order, the document name that records that comprises described keyword is exported, thereby greatly improve the recall precision that records document, and provide the document name that records that user's degree of correlation arranges, being conducive to user uses, having improved user's retrieval experiences.
Brief description of the drawings
Fig. 1 is the process flow diagram that the embodiment of the present invention provides the document record retrieval based on keyword;
Fig. 2 is the composition schematic diagram of the keyword net that provides of the embodiment of the present invention;
Fig. 3 is the radar map of the degree of correlation between the term that provides of the embodiment of the present invention and keyword;
Fig. 4 is that the embodiment of the present invention provides a schematic diagram that records literature search example.
Embodiment
Describe the specific embodiment of the present invention in detail below in conjunction with drawings and Examples.
As everyone knows, section document of each in database, all can select several keywords, for the information that shows to be closely related with the document.In addition, several keywords in one section of document itself possess certain correlativity, different documents, and in the time having certain correlativity in its description, the keyword of these documents also has correlativity, and it is identical even having indivedual keywords.The present invention is just by utilizing the above-mentioned character of the keyword that records document in database, by quantizing the relation between keyword, set up a keyword net, realizes quick-searching by this key net.
As shown in Figure 1, the figure shows the flow process of the degree of correlation algorithm between a kind of word and the word that the embodiment of the present invention provides, for convenience of explanation, only show the part relevant to the embodiment of the present invention.
Refer to Fig. 1, the degree of correlation algorithm described in the embodiment of the present invention between a kind of word and word, comprises the following steps:
S101: generate the keyword net being formed by described all keywords that record document by all relevance degree and correlation distance values that record document in database;
S102: user inputs after term searches the document that records that its keyword comprises described term in described keyword net, and by described term and described keyword degree of correlation size order, the document name that records that comprises described keyword is exported.
In the embodiment of the present invention, the generation step of described keyword net is as follows:
According to formula: record the keyword quantity that document A and square ÷ that records the relevance degree of document B=record document A and the keyword that records B and repeat number record document A × record the keyword quantity of document A, calculate in described database every section of relevance degree that records document and other and record document:
According to formula: record document A and record document A and the relevance degree that records document B with the distance=1-that records B, calculate in described database every section of correlation distance value that records document and other and record document;
In the embodiment of the present invention, according in described database every section record document and other correlation distance value that records document forms taking described all keywords of document that record as the keyword net that comprises N layer keyword node of node.
Described every section is recorded document and other described correlation distance value that records document is the internodal distance of each described keyword in described keyword net.
In the embodiment of the present invention, describedly in described keyword net, search the document that records that keyword comprises described term, and by described term and described keyword degree of correlation size order by as follows the step that records document name output that comprises described keyword:
The relevance degree of the keyword that calculates described term and comprise described term;
By the size order of described relevance degree, the document that records that comprises described keyword is exported;
Computing formula is as follows:
The square root of mean value × keyword occurrence number of relevance degree=N layer keyword nodal distance of term and keyword.
Below, according to specific embodiment, the present invention is described in detail.
Prepare the keyword data of some documents, as follows, wherein, a record is the keyword of one section of document:
Suppose to record respectively document A and record document B, quantize two with following formula and record our degree of correlation:
Record the keyword quantity that document A and square ÷ that records the relevance degree of document B=record document A and the keyword that records B and repeat number record document A × record the keyword quantity of document A, calculate in described database every section of relevance degree that records document and other and record document:
The described relevance degree that records document A and record document B that calculates is between 0 to 1;
Mate completely with the keyword that records document B if record document A, relevance degree is 1; If keyword does not mate completely, relevance degree is 0.
Thus, according under formula, can set the correlation distance of document A and B, correlation distance computing formula is: record document A and record document A and the relevance degree that records document B with the correlation distance=1-that records document B; Record document A and the correlation distance formula that records document B by this, calculate the correlation distance value that records document A and record document, this correlation distance value is between 0 and 1; Can find out, two degrees of correlation that record document are larger, and the distance between them is nearer; Mate completely with the key word that records document B if record document A, distance is 0; Do not mate completely with the key word that records document B if record document A, distance is 1.
As, [record document: 36/10000], keyword: Angiogenesis, microvessel density, vascular endothelial growth factor, lymphatic metastasis, immunohistochemistry;
[record document: 52/10000], keyword: neovascular glaucoma, vascular endothelial growth factor, diabetes iris.
Can find out, this records document 36 and 52 and only has an identical keyword " vascular endothelial growth factor ", utilize formula mentioned above, known, the degree of correlation=the 1*1/ (5*3)=0.066667 of document record 36 and record 52, the distance=1-0.066667=0.933333 of record 36 and record 52
Generate after the correlation distance between multiple document records by above-mentioned algorithm, then draw out keyword net, as shown in 2 figure.Fig. 2, each circle with three bit digital represents a keyword node, is interconnected to form described keyword net between multiple keyword nodes.A circle is a node, represent a record, numeral in circle is the numbering of the record of this node representative, record is from 1 open numbering, establish a capital so in fact differ is that this net of 3 bit digital is to generate according to the distance between record, distance is 1 do not connect, 1 > connects when algorithm process apart from >'s 0, calculate the distance of each record and other records, then draw connected graph, finally form the overall connected graph between each record.
On the basis of described keyword net, when after a given term, the keyword node that can contain this term in described keyword locate line goes out keyword, then find out other keyword node that has direct correlation with this keyword node of finding out, just can obtain other words relevant to given term; By that analogy, when the keyword node along described direct correlation continues to search follow-up indirect association keyword node, just can obtain the more related term relevant to described retrieval.
Then the degree of correlation with the multiple keywords that comprise this term by calculating term, the degree of correlation of each keyword that comprises described term of determining described term and find, and export the document that records that comprises accordingly described keyword by described degree of correlation size, select for user.Relatedness computation method see lower described in.
Suppose that given term is C1, while quantizing the degree of correlation of the keyword C2 in this term C1 and keyword net, by reference to the number of times of the distance between the node of the keyword C2 that comprises term C1 and this keyword C2 appearance, just can calculate the relevance degree between term C1 and crucial C2 by following computing formula, thereby the degree of correlation that quantizes term C1 and keyword C2, calculates the relevance degree between term C1 and keyword C2:
The square root of mean value × keyword C2 occurrence number of relevance degree=N node layer distance of retrieval C word 1 and keyword C2.N node layer refers to all nodes of the N layer starting from central point, 1 layer at last of each direct connection, and the connection of interval N level is that N layer connects, and such as the node of numbering 629, its 1 node layer is and its direct-connected node, has 614,734,883,915; Its 2 node layers are and the direct-connected node of 1 node layer, have 747,763,630 etc.Can calculate two distances between node by description before, the mean value of N node layer distance just refers to the range averaging value of Centroid to all node of N layer so, if 614,734,883,915 these 4 nodes are exactly the mean value of the 1st node layer distance to the range averaging value of 629 Centroids.C2 occurrence number refers to that how many times has appearred in keyword C2 altogether in 614,734,883,915 these nodes, and this number of times is exactly the number of times that keyword C2 occurs at the 1st node layer.
When concrete application, the keyword of same layer, relevance degree is different, if keyword C2 in the different layer of keyword net, all occur, get the degree of correlation of conduct itself and the term of relevance degree maximum.
Method described in the application of the invention embodiment, in the time that user inputs a term retrieval, except result for retrieval, user can also know other keywords relevant to this term, and utilizes this keyword to find and record accordingly document.As shown in Figure 3, at given term " hamburger ", by calculating, get the keyword relevant to " hamburger ", comprise the snack food in minimum distance, the second in-plant KFC (agreeing moral chicken) and McDonald, the 3rd in-plant " alien word ", " French fries " " tonyred " and distant " food security " again, then can also be according to being somebody's turn to do " food security ", obtain the word relevant to this " food security ", as " the law of food safety ", " Sanlu " finally form 6 layers and the radar map of N=6
For another example, a journal article data in literature for retrieval platform in, comprised the keyword net that the keyword by millions of journal articles records in described database described in the embodiment of the present invention generates.In the time that user inputs corresponding term retrieve relevant journal article data in literature in described database, the embodiment of the present invention is according to the term of user's input, in described keyword net, search and the keyword that comprises described term, and include the journal article document of described term by described keyword output display, shown in Fig. 4 A, when user selects after corresponding database (Periodical Database Based), when in search column, input " education " term is retrieved, the embodiment of the present invention is searched related journals paper that the keyword that comprises described " education " term is corresponding as shown in Figure 4 B according to described keyword net, comprise " IT education ", " education ideas ", and " education education ideas " etc. relevant periodical or paper, being very easy to user search uses, improve effectiveness of retrieval.
The present invention is by generating by all relevance degree and correlation distance values that record document in database the keyword net being made up of described all keywords that record document, input after term user, in described keyword net, search the document that records that its keyword comprises described term, calculate the degree of correlation between term and keyword by quantification, and by described term and described keyword degree of correlation size order, the document name that records that comprises described keyword is exported, thereby greatly improve the recall precision that records document, and provide the document name that records that user's degree of correlation arranges, being conducive to user uses, having improved user's retrieval experiences.
The above is only the preferred embodiment of the present invention; it should be pointed out that for those skilled in the art, under the premise without departing from the principles of the invention; can also make some improvements and modifications, these improvements and modifications also should be considered as protection scope of the present invention.
Claims (4)
1. the degree of correlation algorithm between word and word, is characterized in that, comprises the following steps:
Generate by all relevance degree and correlation distance values that record document in database the keyword net being formed by described all keywords that record document;
User inputs after term, searches the document that records that its keyword comprises described term in described keyword net, and by described term and described keyword degree of correlation size order, the document name that records that comprises described keyword is exported.
2. the degree of correlation algorithm between a kind of word and word according to claim 1, is characterized in that, the generation step of described keyword net is as follows:
According to formula: record the keyword quantity that document A and square ÷ that records the relevance degree of document B=record document A and the keyword that records B and repeat number record document A × record the keyword quantity of document A, calculate in described database every section of relevance degree that records document and other and record document:
According to formula: record document A and record document A and the relevance degree that records document B with the distance=1-that records B, calculate in described database every section of correlation distance value that records document and other and record document;
According in described database every section record document and other correlation distance value that records document forms taking described all keywords of document that record as the keyword net that comprises N layer keyword node of node.
3. the degree of correlation algorithm between a kind of word and word according to claim 2, is characterized in that, described every section is recorded document and other described correlation distance value that records document is the internodal distance of each described keyword in described keyword net.
4. the degree of correlation algorithm between a kind of word and word according to claim 3, it is characterized in that, describedly in described keyword net, search the document that records that keyword comprises described term, and by described term and described keyword degree of correlation size order by as follows the step that records document name output that comprises described keyword:
The relevance degree of the keyword that calculates described term and comprise described term;
By the size order of described relevance degree, the document that records that comprises described keyword is exported;
Computing formula is as follows:
The square root of mean value × keyword occurrence number of relevance degree=N layer keyword nodal distance of term and keyword.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201310040098.9A CN103970789A (en) | 2013-02-01 | 2013-02-01 | Algorithm for correlation between words |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201310040098.9A CN103970789A (en) | 2013-02-01 | 2013-02-01 | Algorithm for correlation between words |
Publications (1)
Publication Number | Publication Date |
---|---|
CN103970789A true CN103970789A (en) | 2014-08-06 |
Family
ID=51240301
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201310040098.9A Pending CN103970789A (en) | 2013-02-01 | 2013-02-01 | Algorithm for correlation between words |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN103970789A (en) |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1389811A (en) * | 2002-02-06 | 2003-01-08 | 北京造极人工智能技术有限公司 | Intelligent search method of search engine |
CN101517603A (en) * | 2006-04-27 | 2009-08-26 | 盖亚软件知识产权有限公司 | Content delivery system and method therefor |
CN101930447A (en) * | 2009-12-31 | 2010-12-29 | 北京中加国道科技有限公司 | Retrieval system for network academic resources |
-
2013
- 2013-02-01 CN CN201310040098.9A patent/CN103970789A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1389811A (en) * | 2002-02-06 | 2003-01-08 | 北京造极人工智能技术有限公司 | Intelligent search method of search engine |
CN101517603A (en) * | 2006-04-27 | 2009-08-26 | 盖亚软件知识产权有限公司 | Content delivery system and method therefor |
CN101930447A (en) * | 2009-12-31 | 2010-12-29 | 北京中加国道科技有限公司 | Retrieval system for network academic resources |
Non-Patent Citations (1)
Title |
---|
李信利: "基于关键词聚类的论文相似性检索", 《2005通信理论与技术新进展-第十届全国青年通信学术会议论文集》 * |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Chen et al. | Collabseer: a search engine for collaboration discovery | |
US9703895B2 (en) | Organizing search results based upon clustered content | |
CN105468605A (en) | Entity information map generation method and device | |
KR101827764B1 (en) | Visual semantic complex network and method for forming the network | |
Ashokkumar et al. | Intelligent optimal route recommendation among heterogeneous objects with keywords | |
CN105518667A (en) | Understanding tables for search | |
WO2013014329A1 (en) | Weighting metric for visual search of entity-relationship databases | |
Hauff et al. | Placing images on the world map: a microblog-based enrichment approach | |
CN103198136B (en) | A kind of PC file polling method based on sequential correlation | |
US10795895B1 (en) | Business data lake search engine | |
US20070271228A1 (en) | Documentary search procedure in a distributed system | |
CN105243149B (en) | A kind of semantic-based web query recommended method and system | |
CN105868366A (en) | Concept space navigation method based on concept association | |
Singh et al. | Big data-a review | |
CN102915304B (en) | Document retrieving apparatus and method | |
Varga et al. | Integrating dbpedia and sentiwordnet for a tourism recommender system | |
Grineva et al. | Blognoon: Exploring a topic in the blogosphere | |
CN105740476A (en) | Associated problem recommending method, device and system | |
CN103970789A (en) | Algorithm for correlation between words | |
Qingjie et al. | Research on domain knowledge graph based on the large scale online knowledge fragment | |
Tabrizi et al. | Search personalization based on social-network-based interestedness measures | |
Hintsa et al. | Leveraging linked data in Social Event Detection. | |
Zemede et al. | Personalized search with editable profiles | |
Cui et al. | Deep web data source classification based on query interface context | |
Singh et al. | Phrase Based Web Document Clustering: An Indexing Approach |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
CB02 | Change of applicant information |
Address after: 100190, room 2509, block B, century trade building, building 1, Zhongguancun East Road, No. 66, Haidian District East Road, Beijing, China Applicant after: Beijing Fusen software Limited by Share Ltd Address before: 100190, room 2509, block B, century trade building, building 1, Zhongguancun East Road, No. 66, Haidian District East Road, Beijing, China Applicant before: Beijing INFCN Information Technology Co., Ltd. |
|
COR | Change of bibliographic data | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20140806 |
|
WD01 | Invention patent application deemed withdrawn after publication |