CN103970789A - Algorithm for correlation between words - Google Patents

Algorithm for correlation between words Download PDF

Info

Publication number
CN103970789A
CN103970789A CN201310040098.9A CN201310040098A CN103970789A CN 103970789 A CN103970789 A CN 103970789A CN 201310040098 A CN201310040098 A CN 201310040098A CN 103970789 A CN103970789 A CN 103970789A
Authority
CN
China
Prior art keywords
keyword
document
records
record
correlation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201310040098.9A
Other languages
Chinese (zh)
Inventor
尹科
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
BEIJING INFCN INFORMATION TECHNOLOGY Co Ltd
Original Assignee
BEIJING INFCN INFORMATION TECHNOLOGY Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by BEIJING INFCN INFORMATION TECHNOLOGY Co Ltd filed Critical BEIJING INFCN INFORMATION TECHNOLOGY Co Ltd
Priority to CN201310040098.9A priority Critical patent/CN103970789A/en
Publication of CN103970789A publication Critical patent/CN103970789A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/374Thesaurus

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to an algorithm for correlation between words. The algorithm is implemented by the following steps: generating a keyword net formed by keywords of all document records according to correlation value and correlation distance of all the document records in a database; after a search word is input by a user, searching document records of which keywords include the search word in the keyword net, and outputting names of the document records including the keywords according to the sequence of correlation between the search word and the keywords. According to the invention, through generating the keyword net formed by keywords of all document records, after the user inputs the search word, document records of which keywords include the search word are searched in the keyword net, and then names of the document records including the keywords according to the sequence of correlation between the search word and the keywords are output, so that the searching efficiency about document records is greatly improved, and names of the document records arranged by correlation are provided for the user, so that use of the user is facilitated, and searching experience of the user is improved.

Description

Degree of correlation algorithm between a kind of word and word
Technical field
The invention belongs to technical field of information retrieval, be specifically related to the degree of correlation algorithm between a kind of word and word.
Background technology
In the time carrying out data retrieval, when general, retrieve by inputting in the search box corresponding keyword, particularly in some journal article databases, after input keyword, system according to relevant periodical or the paper of the rule output of oneself keyword corresponding and input, is selected for user automatically.The method of this retrieve data, has improved people's recall precision and level greatly, but can't meet people's retrieval user demand.As this search method is only available to the document that user is relevant to the keyword of input, can not provide more to the keyword of inputting and have certain relevant more document for user, and determine when inaccurate at user's keyword, often retrieve less than the relevant document that records, need user repeatedly to determine keyword, repeatedly retrieve, side likely realizes retrieval object.And user as also needed peripheral related documents time, also needs to determine separately that different keywords retrieves again, can not once offer the corresponding resource of user simultaneously, for user, cumbersome, recall precision is also lower.
Summary of the invention
The object of the invention is to overcome the deficiency of above-mentioned technology and the algorithm of the degree of correlation between a kind of word and word is provided.
The present invention is achieved in that the degree of correlation algorithm between a kind of word and word, comprises the following steps:
Generate by all relevance degree and correlation distance values that record document in database the keyword net being formed by described all keywords that record document;
User inputs after term, searches the document that records that its keyword comprises described term in described keyword net, and by described term and described keyword degree of correlation size order, the document name that records that comprises described keyword is exported.
The generation step of described keyword net is as follows:
According to formula: record the keyword quantity that document A and square ÷ that records the relevance degree of document B=record document A and the keyword that records B and repeat number record document A × record the keyword quantity of document A, calculate in described database every section of relevance degree that records document and other and record document:
According to formula: record document A and record document A and the phase that records document B with the distance=1-that records B
According to formula: record document A and record document A and the relevance degree that records document B with the distance=1-that records B, calculate in described database every section of correlation distance value that records document and other and record document;
According in described database every section record document and other correlation distance value that records document forms taking described all keywords of document that record as the keyword net that comprises N layer keyword node of node.
Described every section is recorded document and other described correlation distance value that records document is the internodal distance of each described keyword in described keyword net.
Describedly in described keyword net, search the document that records that keyword comprises described term, and by described term and described keyword degree of correlation size order by as follows the step that records document name output that comprises described keyword:
The relevance degree of the keyword that calculates described term and comprise described term;
By the size order of described relevance degree, the document that records that comprises described keyword is exported;
Computing formula is as follows:
The square root of mean value × keyword occurrence number of relevance degree=N layer keyword nodal distance of term and keyword.
The present invention is by generating by all relevance degree and correlation distance values that record document in database the keyword net being made up of described all keywords that record document, input after term user, in described keyword net, search the document that records that its keyword comprises described term, calculate the degree of correlation between term and keyword by quantification, and by described term and described keyword degree of correlation size order, the document name that records that comprises described keyword is exported, thereby greatly improve the recall precision that records document, and provide the document name that records that user's degree of correlation arranges, being conducive to user uses, having improved user's retrieval experiences.
Brief description of the drawings
Fig. 1 is the process flow diagram that the embodiment of the present invention provides the document record retrieval based on keyword;
Fig. 2 is the composition schematic diagram of the keyword net that provides of the embodiment of the present invention;
Fig. 3 is the radar map of the degree of correlation between the term that provides of the embodiment of the present invention and keyword;
Fig. 4 is that the embodiment of the present invention provides a schematic diagram that records literature search example.
Embodiment
Describe the specific embodiment of the present invention in detail below in conjunction with drawings and Examples.
As everyone knows, section document of each in database, all can select several keywords, for the information that shows to be closely related with the document.In addition, several keywords in one section of document itself possess certain correlativity, different documents, and in the time having certain correlativity in its description, the keyword of these documents also has correlativity, and it is identical even having indivedual keywords.The present invention is just by utilizing the above-mentioned character of the keyword that records document in database, by quantizing the relation between keyword, set up a keyword net, realizes quick-searching by this key net.
As shown in Figure 1, the figure shows the flow process of the degree of correlation algorithm between a kind of word and the word that the embodiment of the present invention provides, for convenience of explanation, only show the part relevant to the embodiment of the present invention.
Refer to Fig. 1, the degree of correlation algorithm described in the embodiment of the present invention between a kind of word and word, comprises the following steps:
S101: generate the keyword net being formed by described all keywords that record document by all relevance degree and correlation distance values that record document in database;
S102: user inputs after term searches the document that records that its keyword comprises described term in described keyword net, and by described term and described keyword degree of correlation size order, the document name that records that comprises described keyword is exported.
In the embodiment of the present invention, the generation step of described keyword net is as follows:
According to formula: record the keyword quantity that document A and square ÷ that records the relevance degree of document B=record document A and the keyword that records B and repeat number record document A × record the keyword quantity of document A, calculate in described database every section of relevance degree that records document and other and record document:
According to formula: record document A and record document A and the relevance degree that records document B with the distance=1-that records B, calculate in described database every section of correlation distance value that records document and other and record document;
In the embodiment of the present invention, according in described database every section record document and other correlation distance value that records document forms taking described all keywords of document that record as the keyword net that comprises N layer keyword node of node.
Described every section is recorded document and other described correlation distance value that records document is the internodal distance of each described keyword in described keyword net.
In the embodiment of the present invention, describedly in described keyword net, search the document that records that keyword comprises described term, and by described term and described keyword degree of correlation size order by as follows the step that records document name output that comprises described keyword:
The relevance degree of the keyword that calculates described term and comprise described term;
By the size order of described relevance degree, the document that records that comprises described keyword is exported;
Computing formula is as follows:
The square root of mean value × keyword occurrence number of relevance degree=N layer keyword nodal distance of term and keyword.
Below, according to specific embodiment, the present invention is described in detail.
Prepare the keyword data of some documents, as follows, wherein, a record is the keyword of one section of document:
Suppose to record respectively document A and record document B, quantize two with following formula and record our degree of correlation:
Record the keyword quantity that document A and square ÷ that records the relevance degree of document B=record document A and the keyword that records B and repeat number record document A × record the keyword quantity of document A, calculate in described database every section of relevance degree that records document and other and record document:
The described relevance degree that records document A and record document B that calculates is between 0 to 1;
Mate completely with the keyword that records document B if record document A, relevance degree is 1; If keyword does not mate completely, relevance degree is 0.
Thus, according under formula, can set the correlation distance of document A and B, correlation distance computing formula is: record document A and record document A and the relevance degree that records document B with the correlation distance=1-that records document B; Record document A and the correlation distance formula that records document B by this, calculate the correlation distance value that records document A and record document, this correlation distance value is between 0 and 1; Can find out, two degrees of correlation that record document are larger, and the distance between them is nearer; Mate completely with the key word that records document B if record document A, distance is 0; Do not mate completely with the key word that records document B if record document A, distance is 1.
As, [record document: 36/10000], keyword: Angiogenesis, microvessel density, vascular endothelial growth factor, lymphatic metastasis, immunohistochemistry;
[record document: 52/10000], keyword: neovascular glaucoma, vascular endothelial growth factor, diabetes iris.
Can find out, this records document 36 and 52 and only has an identical keyword " vascular endothelial growth factor ", utilize formula mentioned above, known, the degree of correlation=the 1*1/ (5*3)=0.066667 of document record 36 and record 52, the distance=1-0.066667=0.933333 of record 36 and record 52
Generate after the correlation distance between multiple document records by above-mentioned algorithm, then draw out keyword net, as shown in 2 figure.Fig. 2, each circle with three bit digital represents a keyword node, is interconnected to form described keyword net between multiple keyword nodes.A circle is a node, represent a record, numeral in circle is the numbering of the record of this node representative, record is from 1 open numbering, establish a capital so in fact differ is that this net of 3 bit digital is to generate according to the distance between record, distance is 1 do not connect, 1 > connects when algorithm process apart from >'s 0, calculate the distance of each record and other records, then draw connected graph, finally form the overall connected graph between each record.
On the basis of described keyword net, when after a given term, the keyword node that can contain this term in described keyword locate line goes out keyword, then find out other keyword node that has direct correlation with this keyword node of finding out, just can obtain other words relevant to given term; By that analogy, when the keyword node along described direct correlation continues to search follow-up indirect association keyword node, just can obtain the more related term relevant to described retrieval.
Then the degree of correlation with the multiple keywords that comprise this term by calculating term, the degree of correlation of each keyword that comprises described term of determining described term and find, and export the document that records that comprises accordingly described keyword by described degree of correlation size, select for user.Relatedness computation method see lower described in.
Suppose that given term is C1, while quantizing the degree of correlation of the keyword C2 in this term C1 and keyword net, by reference to the number of times of the distance between the node of the keyword C2 that comprises term C1 and this keyword C2 appearance, just can calculate the relevance degree between term C1 and crucial C2 by following computing formula, thereby the degree of correlation that quantizes term C1 and keyword C2, calculates the relevance degree between term C1 and keyword C2:
The square root of mean value × keyword C2 occurrence number of relevance degree=N node layer distance of retrieval C word 1 and keyword C2.N node layer refers to all nodes of the N layer starting from central point, 1 layer at last of each direct connection, and the connection of interval N level is that N layer connects, and such as the node of numbering 629, its 1 node layer is and its direct-connected node, has 614,734,883,915; Its 2 node layers are and the direct-connected node of 1 node layer, have 747,763,630 etc.Can calculate two distances between node by description before, the mean value of N node layer distance just refers to the range averaging value of Centroid to all node of N layer so, if 614,734,883,915 these 4 nodes are exactly the mean value of the 1st node layer distance to the range averaging value of 629 Centroids.C2 occurrence number refers to that how many times has appearred in keyword C2 altogether in 614,734,883,915 these nodes, and this number of times is exactly the number of times that keyword C2 occurs at the 1st node layer.
When concrete application, the keyword of same layer, relevance degree is different, if keyword C2 in the different layer of keyword net, all occur, get the degree of correlation of conduct itself and the term of relevance degree maximum.
Method described in the application of the invention embodiment, in the time that user inputs a term retrieval, except result for retrieval, user can also know other keywords relevant to this term, and utilizes this keyword to find and record accordingly document.As shown in Figure 3, at given term " hamburger ", by calculating, get the keyword relevant to " hamburger ", comprise the snack food in minimum distance, the second in-plant KFC (agreeing moral chicken) and McDonald, the 3rd in-plant " alien word ", " French fries " " tonyred " and distant " food security " again, then can also be according to being somebody's turn to do " food security ", obtain the word relevant to this " food security ", as " the law of food safety ", " Sanlu " finally form 6 layers and the radar map of N=6
For another example, a journal article data in literature for retrieval platform in, comprised the keyword net that the keyword by millions of journal articles records in described database described in the embodiment of the present invention generates.In the time that user inputs corresponding term retrieve relevant journal article data in literature in described database, the embodiment of the present invention is according to the term of user's input, in described keyword net, search and the keyword that comprises described term, and include the journal article document of described term by described keyword output display, shown in Fig. 4 A, when user selects after corresponding database (Periodical Database Based), when in search column, input " education " term is retrieved, the embodiment of the present invention is searched related journals paper that the keyword that comprises described " education " term is corresponding as shown in Figure 4 B according to described keyword net, comprise " IT education ", " education ideas ", and " education education ideas " etc. relevant periodical or paper, being very easy to user search uses, improve effectiveness of retrieval.
The present invention is by generating by all relevance degree and correlation distance values that record document in database the keyword net being made up of described all keywords that record document, input after term user, in described keyword net, search the document that records that its keyword comprises described term, calculate the degree of correlation between term and keyword by quantification, and by described term and described keyword degree of correlation size order, the document name that records that comprises described keyword is exported, thereby greatly improve the recall precision that records document, and provide the document name that records that user's degree of correlation arranges, being conducive to user uses, having improved user's retrieval experiences.
The above is only the preferred embodiment of the present invention; it should be pointed out that for those skilled in the art, under the premise without departing from the principles of the invention; can also make some improvements and modifications, these improvements and modifications also should be considered as protection scope of the present invention.

Claims (4)

1. the degree of correlation algorithm between word and word, is characterized in that, comprises the following steps:
Generate by all relevance degree and correlation distance values that record document in database the keyword net being formed by described all keywords that record document;
User inputs after term, searches the document that records that its keyword comprises described term in described keyword net, and by described term and described keyword degree of correlation size order, the document name that records that comprises described keyword is exported.
2. the degree of correlation algorithm between a kind of word and word according to claim 1, is characterized in that, the generation step of described keyword net is as follows:
According to formula: record the keyword quantity that document A and square ÷ that records the relevance degree of document B=record document A and the keyword that records B and repeat number record document A × record the keyword quantity of document A, calculate in described database every section of relevance degree that records document and other and record document:
According to formula: record document A and record document A and the relevance degree that records document B with the distance=1-that records B, calculate in described database every section of correlation distance value that records document and other and record document;
According in described database every section record document and other correlation distance value that records document forms taking described all keywords of document that record as the keyword net that comprises N layer keyword node of node.
3. the degree of correlation algorithm between a kind of word and word according to claim 2, is characterized in that, described every section is recorded document and other described correlation distance value that records document is the internodal distance of each described keyword in described keyword net.
4. the degree of correlation algorithm between a kind of word and word according to claim 3, it is characterized in that, describedly in described keyword net, search the document that records that keyword comprises described term, and by described term and described keyword degree of correlation size order by as follows the step that records document name output that comprises described keyword:
The relevance degree of the keyword that calculates described term and comprise described term;
By the size order of described relevance degree, the document that records that comprises described keyword is exported;
Computing formula is as follows:
The square root of mean value × keyword occurrence number of relevance degree=N layer keyword nodal distance of term and keyword.
CN201310040098.9A 2013-02-01 2013-02-01 Algorithm for correlation between words Pending CN103970789A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310040098.9A CN103970789A (en) 2013-02-01 2013-02-01 Algorithm for correlation between words

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310040098.9A CN103970789A (en) 2013-02-01 2013-02-01 Algorithm for correlation between words

Publications (1)

Publication Number Publication Date
CN103970789A true CN103970789A (en) 2014-08-06

Family

ID=51240301

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310040098.9A Pending CN103970789A (en) 2013-02-01 2013-02-01 Algorithm for correlation between words

Country Status (1)

Country Link
CN (1) CN103970789A (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1389811A (en) * 2002-02-06 2003-01-08 北京造极人工智能技术有限公司 Intelligent search method of search engine
CN101517603A (en) * 2006-04-27 2009-08-26 盖亚软件知识产权有限公司 Content delivery system and method therefor
CN101930447A (en) * 2009-12-31 2010-12-29 北京中加国道科技有限公司 Retrieval system for network academic resources

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1389811A (en) * 2002-02-06 2003-01-08 北京造极人工智能技术有限公司 Intelligent search method of search engine
CN101517603A (en) * 2006-04-27 2009-08-26 盖亚软件知识产权有限公司 Content delivery system and method therefor
CN101930447A (en) * 2009-12-31 2010-12-29 北京中加国道科技有限公司 Retrieval system for network academic resources

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
李信利: "基于关键词聚类的论文相似性检索", 《2005通信理论与技术新进展-第十届全国青年通信学术会议论文集》 *

Similar Documents

Publication Publication Date Title
Chen et al. Collabseer: a search engine for collaboration discovery
US9703895B2 (en) Organizing search results based upon clustered content
CN105468605A (en) Entity information map generation method and device
KR101827764B1 (en) Visual semantic complex network and method for forming the network
Ashokkumar et al. Intelligent optimal route recommendation among heterogeneous objects with keywords
CN105518667A (en) Understanding tables for search
WO2013014329A1 (en) Weighting metric for visual search of entity-relationship databases
Hauff et al. Placing images on the world map: a microblog-based enrichment approach
CN103198136B (en) A kind of PC file polling method based on sequential correlation
US10795895B1 (en) Business data lake search engine
US20070271228A1 (en) Documentary search procedure in a distributed system
CN105243149B (en) A kind of semantic-based web query recommended method and system
CN105868366A (en) Concept space navigation method based on concept association
Singh et al. Big data-a review
CN102915304B (en) Document retrieving apparatus and method
Varga et al. Integrating dbpedia and sentiwordnet for a tourism recommender system
Grineva et al. Blognoon: Exploring a topic in the blogosphere
CN105740476A (en) Associated problem recommending method, device and system
CN103970789A (en) Algorithm for correlation between words
Qingjie et al. Research on domain knowledge graph based on the large scale online knowledge fragment
Tabrizi et al. Search personalization based on social-network-based interestedness measures
Hintsa et al. Leveraging linked data in Social Event Detection.
Zemede et al. Personalized search with editable profiles
Cui et al. Deep web data source classification based on query interface context
Singh et al. Phrase Based Web Document Clustering: An Indexing Approach

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
CB02 Change of applicant information

Address after: 100190, room 2509, block B, century trade building, building 1, Zhongguancun East Road, No. 66, Haidian District East Road, Beijing, China

Applicant after: Beijing Fusen software Limited by Share Ltd

Address before: 100190, room 2509, block B, century trade building, building 1, Zhongguancun East Road, No. 66, Haidian District East Road, Beijing, China

Applicant before: Beijing INFCN Information Technology Co., Ltd.

COR Change of bibliographic data
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20140806

WD01 Invention patent application deemed withdrawn after publication