CN102789479A - Vocabulary relevance calculating method on basis of semantic analysis of search result - Google Patents
Vocabulary relevance calculating method on basis of semantic analysis of search result Download PDFInfo
- Publication number
- CN102789479A CN102789479A CN2012101884759A CN201210188475A CN102789479A CN 102789479 A CN102789479 A CN 102789479A CN 2012101884759 A CN2012101884759 A CN 2012101884759A CN 201210188475 A CN201210188475 A CN 201210188475A CN 102789479 A CN102789479 A CN 102789479A
- Authority
- CN
- China
- Prior art keywords
- vocabulary
- search
- semantic analysis
- record
- retrieval
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Machine Translation (AREA)
Abstract
The invention belongs to the technical field of the computational linguistics, in particular to a vocabulary relevance calculating method on the basis of the semantic analysis of a search result. The method disclosed by the invention comprises the following steps of: according to a certain search strategy, automatically submitting a search command to an Internet search engine to obtain the search result; and applying a Web page information extraction technology and a text semantic analysis technology to calculate the vocabulary co-occurrence degree so as to obtain the relevance of vocabularies. According to the invention, the local construction and maintenance of a knowledge base system are avoided; a relevance calculating result can reflect the influence on the time dimension; and the relevance calculation on mixed vocabularies comprising numbers can also be well carried out. The method is suitable for various application places requiring the vocabulary semantic relevance.
Description
Technical field
The invention belongs to the computational linguistics technical field, be specifically related to a kind of quantification computing method of the vocabulary degree of correlation.
Background technology
Along with the development of Web2.0 application technology, the continuous appearance of various blogs, network forum and socialization interactive media all has a large amount of content of text messages to produce, like various news report, product introduction, product review or the like every day.And at this wherein, no matter be from news report, to find focus, the automatic analysis of still from product review, carrying out viewpoint all need relate to a more critical technical problems, be exactly how effectively to calculate the correlativity of two vocabulary.Therefore, calculate the key foundation problem that the vocabulary degree of correlation becomes the network text information processing.The method that the multiple calculating vocabulary degree of correlation is arranged at present.A kind of statistical method that is based on extensive language material knowledge base is confirmed a stack features speech earlier, is the point in the defined vector space of this stack features speech with lexical representation, calculates [6,7] through the similarity that is similar to included angle cosine and so on then.Second method is to utilize semantic dictionary, thereby is the degree of correlation of one group of semantic vector calculated for given vocabulary according to the institutional framework of semantic dictionary with lexical representation, and dictionary commonly used has [3,5] such as WordNet, HowNet.A kind of in addition method is based on LSA (latent semantic analysis) vocabulary is mapped to a semantic space that dimension is less, in semantic space, calculates correlativity, and the SVD (svd) that this method is based on matrix decomposes.Recently, constantly perfect along with various encyclopedias (like wikipedia) on the internet uses this type knowledge base to carry out the vocabulary correlation calculations and also obtained concern, and they are expressed as tieing up the cum rights vector [1,4] in the basic concept space with text or vocabulary explicitly.
Though these methods can be calculated the vocabulary correlativity preferably under some occasion; But in concrete the application, need carry out the number of characteristics space and make up calculating; Need to safeguard and the storehouse of refreshing one's knowledge; Simultaneously its relatedness computation method is bigger for the dependence of the representation of knowledge and library structure, causes the application in the reality can not be satisfactory.Particularly, existing problem is listed below:
1. construct the problem of lexical semantic vector.Element in the semantic vector is the vocabulary of from semantic dictionary or corpus, selecting, represented with the more relevant word finder of vocabulary that will calculate.Need in bigger text message, carry out signature analysis and calculating, particularly use, also need carry out word segmentation processing, cause the deviation of vector calculation for the mixed type vocabulary on " 113 meters hurdles " and so on easily for Chinese.
2. need to make up a huge KBS.The method of the statistical computation degree of correlation need make up and safeguard a huge KBS, in data storage and retrieval, needs more extra process.
3. to the adaptive faculty of different KBSs.KBS is a basis of carrying out relatedness computation, but present method mainly depends on English wikipedia system, and its concept characteristic method for distilling is for Chinese vocabulary and incompatible.Therefore, after being replaced by other system, need define the computing method of semantic vector again, also limit the practical value of this method to a certain extent.
This shows, when carrying out the semantic relevancy of vocabulary, consider the practical problems of construction of knowledge base and maintenance, lifting is to the computing power of the dissimilar vocabulary degrees of correlation, is very important for the application of relatedness computation method.
Summary of the invention
The object of the invention mainly is the deficiency that exists in the existing various vocabulary correlation calculations methods, proposes a kind of vocabulary relatedness computation method that combines based on search engine technique and text semantic analytical technology.
The vocabulary relatedness computation method that the present invention proposes; According to certain search strategy; Submit retrieval command to from the trend internet search engine; Obtain result for retrieval, and use Web page info extractive technique, text semantic analytical technology to carry out the calculating of vocabulary co-occurrence degree, thereby obtain the degree of correlation of vocabulary.
The vocabulary relatedness computation method that the present invention proposes, concrete steps are following:
(1) setting need be carried out two vocabulary w1 of relatedness computation, w2, and record number threshold xi;
(2) be Chinese or English according to vocabulary, generation meets the retrieval command of www.bing.com, and is appointed as the retrieval that limits the website scope, and scope is set at en.wikipedia.org or baike.baidu.com;
(3) set up HTTP (HTTP) network automatically and connect, connect through this and send retrieval command to the bing search system;
(4) receive and handle the result for retrieval that is returned; It is HTML (HTML) text message; After the recording processing on the page finished, the search records that automatically performs down one page was handled, till all search records dispose or reach certain record number; Adopt Web information extraction technology [2] to obtain the search records on the page automatically, based on the statistics of the summary texts in each search records vocabulary frequency information;
(5) the vocabulary frequency information that obtains based on statistics calculates the degree of correlation of two vocabulary, and points out relevant information.
Flow process of the present invention is seen shown in Figure 1.
Among the present invention, step (2) with the Bing search engine as vocabulary contextual information obtaining means, with en.wikipedia.org or baike.baidu.com as Chinese and English knowledge base.
Among the present invention, step (3) is set up the HTTP network and is connected, and sets up satisfactory retrieval command, connects through this and sends retrieval command to the bing search system.
Among the present invention, step (4) is extracted each record in the result for retrieval page, extracts summary texts information wherein, and according to decollator " ... " Carry out text dividing, obtain several segmentations.Carry out the Information Statistics of the vocabulary frequency for each segmentation.
Among the present invention, step (4) is according to condition endRec < TotalRec, and whether condition Trec count threshold xi less than the record of setting and set up, and determines whether obtaining more record.Wherein, TotalRec representes the total number of records of result for retrieval, and endRec representes the record number of current page, the record number that Trec had handled.
Among the present invention, step (5) is calculated two vocabulary w1 through following formula, the degree of correlation of w2:
R(w1,?w2)=?TC*2?/?(T1+T2)
Wherein, T1 is the number of times that vocabulary w1 occurs, and T2 is the number of times that vocabulary w2 occurs, and TC is the number of times that both occur simultaneously.
Among the present invention,, similarity result of calculation and annotation results are carried out the calculating of Pearson correlation coefficient, thereby threshold value counted in the record that needs in definite computation process through the structure training set.
The present invention has substantive distinguishing features and marked improvement: search engine system is used in (1), avoids setting up in this locality big KBS.Have now and need the content of website all be downloaded based on the relatedness computation method of wikipedia, storage space and maintaining and updating system all can produce new problem.The present invention is based on the processing power of search engine technique on terminology match, both can obtain the relevant information of relatedness computation, need not set up again with maintenance class like the local information storehouse.Simultaneously,, can realize relatedness computation easily, thereby can be adapted to english vocabulary, also be adapted to the calculating of Chinese vocabulary based on different KBSs through changing range of search; (2) need not carry out complicated semantic analysis, calculate based on simple relatively vocabulary co-occurrence degree and obtain the vocabulary degree of correlation, make up semantic vector, can be applicable to the relatedness computation of the mixed type vocabulary that comprises numeral and so on.And, need not carry out processing such as participle for the calculating of Chinese vocabulary.(3) because search engine system can upgrade the web site contents of nearest modification automatically, therefore the relatedness computation method based on search-engine results can reflect the correlativity of two vocabulary on time dimension more effectively, has the time perception.
The present invention utilizes search engine technique and simple semantic analysis technology to set up vocabulary relatedness computation method; Avoid making up and maintenance knowledge storehouse system in this locality; Relatedness computation result can reflect the influence on the time dimension, also can carry out well for the mixed type vocabulary relatedness computation that contains numeral.This method is suitable for the various application scenarios that need the lexical semantic degree of correlation.
Description of drawings
Fig. 1 is a process flow diagram of the present invention.
Embodiment
(1) setting need be carried out two vocabulary w1 of relatedness computation, w2, and degree of correlation threshold xi.
(2) if vocabulary is Chinese, earlier vocabulary is carried out the UTF8 coding.And appointment range of search: en.wikipedia.org, baike.baidu.com.Be the basis with these information, generate retrieval command to bing search engine (www.bing.com).
(3) set up the HTTP network and connect, connect through this and send retrieval command to the bing search system.
(4) initializing variable value: T1=0, T2=0, TC=0, Trec=0.
(5) receive and handle the result that search engine returns.Content of pages html text information is carried out contents extraction.Adopt regular expression " [0-9 ;]+-[0-9 ,]+bar result ([0-9 ,]+bar) altogether " extract the result for retrieval summary journal pointed out on the page and the recording mechanism scope of current page; And record variable TotalRec and beginRec, in three variablees of endRec.
(6) according to the list separator between the search records " < li class=" sa_wr ">< div class=" sa_cc ">" location and extract each record of current page, and extract summary texts information wherein.According to text separator " ... " Carry out text dividing, obtain several segmentations.Carry out the Information Statistics of the vocabulary frequency for each segmentation.
If w1 in segmentation, occurs, then the number of times T1 of w1 increases by 1; If w2 in segmentation, occurs, then the number of times T2 of w2 increases by 1; If w1, w2 occurs in segmentation simultaneously, and then co-occurrence frequency TC increases by 1.
Trec counted in treated record increase by 1.
(7) < whether ξ sets up Rule of judgment condition Trec.Changeing step (9) if be false carries out.
(8) < whether TotalRec sets up Rule of judgment endRec, promptly whether reaches last page.If be false, then generate the retrieval command obtain next content of pages, send to the bing search system, and repeated execution of steps (5), (6), (7), (8).Otherwise " information is not enough, can't carry out relatedness computation in prompting.", and the end process flow process.
(9) calculate the degree of correlation of these two vocabulary according to following formula:
R(w1,?w2)=?TC*2?/?(T1+T2)
Setting threshold ξ method: need to confirm earlier a word finder, comprised several vocabulary to and the correlativity annotation results
XRight based on these vocabulary, select different ranges of search, the value of adjustment threshold xi obtains correlativity calculation result
Y, and calculate
XWith
YThe Pearson correlation coefficient of two set
r:
Wherein, n is an element number in the set.
rSpan be [1 ,+1], work as related coefficient
rReaching the comparison reasonable range (generally works as
rGreater than 0.4) time, represent that selected calculating parameter ξ is acceptable.
Can find out that from above-mentioned implementation process the present invention has introduced the simple semantic processes based on search engine retrieving result, both can obtain the relevant information of relatedness computation, need not set up again with maintenance class like the local information storehouse.Computing method based on search-engine results can reflect the correlativity of two vocabulary on time dimension more effectively, and can be applicable to that the mixed type vocabulary that comprises numeral calculates.Employing can more reasonably be carried out the relatedness computation of vocabulary based on the definite dominant record number that needs retrieval of the method for Pearson relatedness computation according to preset threshold.
[0032 object lesson:
Suppose that two vocabulary that will calculate are " doctor " and " nurse ", adopting the basic en.wikipedia.org of English dimension is range of search, then generates following initial retrieval order automatically, and retrieves through the HTTP connection.
http://cn.bing.com/search?q=site%3aen.wikipedia.org+doctor+%26+nurse&qs=n&pq=site%3aen.wikipedia.org+doctor+%26+nurse&sc=0-0&sp=-1&sk=&first=1
Through last first=1 in the order is increased automatically, when 60 records, obtaining the degree of correlation is 0.6613 again.
And for example, two vocabulary that calculate are " computing machines " and " computer ", adopt the encyclopaedia baike.baidu.com of Baidu as range of search, then generate following initial retrieval order automatically, and retrieve through the HTTP connection.
http://cn.bing.com/search?q=site%3abaike.baidu.com+%22%e8%ae%a1%e7%ae%97%e6%9c%ba%22+%26+%22%e7%94%b5%e8%84%91%22&qs=n&pq=site%3abaike.baidu.com+%22%e8%ae%a1%e7%ae%97%e6%9c%ba%22+%26+%22%e7%94%b5%e8%84%91%22&sc=0-0&sp=-1&sk=&first=1
Through last first=1 in the order is increased automatically, behind 140 records of retrieval, the vocabulary degree of correlation that obtains is 0.9111 again.
List of references:
[1]?E.?Gabrilovich,?S.?Markovitch.?Computing?Semantic?Relatedness?Using?Wikipedia-based?Explicit?Semantic?Analysis.?In?Proceedings?of?the?20th?International?Joint?Conference?on?Artificial?Intelligence,?2007.
[2]?X.?W.?Ji,?J.?P.?Zeng,?S.?Y.?Zhang,?C.?R.?Wu.?Tag?Tree?Template?for?Web?Information?and?Schema?Extraction.?Expert?Systems?With?Applications,?2010,37(12):?8492-8498.
[3]?A.?Budanitsky,?G.?Hirst.?Evaluating?WordNet-based?Measures?of?Lexical?Semantic?Relatedness.?Computational?Linguistics,?2006,?32(1):13-47.
[4]?M.?Strube?and?S.?P.?Ponzetto.?WikiRelate!?Computing?Semantic?Relatedness?Using?Wikipedia.?In?AAAI’06,?2006.
[5] Jiang Min, Xiao Shibin, Wang Hongwei, Shi Shuicai. a kind of improved phrase semantic similarity based on " knowledge net " is calculated. Journal of Chinese Information Processing, 2008 5 phases.
[6] Lu Song. word correlation knowledge does not have and leads the structure that obtains with balanced sorter, Inst. of Computing Techn. Academia Sinica's PhD dissertation, 2001. in the natural language
[7] Liu Qun, Li Sujian. the lexical semantic similarity based on " knowledge net " is calculated [J]. Chinese computational linguistics, 2002,7 (2): 59-76.
Claims (5)
1. vocabulary relatedness computation method based on the Search Results semantic analysis is characterized in that concrete steps are following:
(1) setting need be carried out two vocabulary w1 of relatedness computation, w2, and record number threshold xi;
(2) be Chinese or English according to vocabulary, generation meets the retrieval command of www.bing.com, and is appointed as the retrieval that limits the website scope, and scope is set at en.wikipedia.org or baike.baidu.com;
(3) set up the HTTP network automatically and connect, connect through this and send retrieval command to the bing search system;
(4) receive and handle the result for retrieval that is returned; It is the HTML text message; After the recording processing on the page finished, the search records that automatically performs down one page was handled, till all search records dispose or reach certain record number; Adopt the Web information extraction technology to obtain the search records on the page automatically, based on the statistics of the summary texts in each search records vocabulary frequency information;
(5) the vocabulary frequency information that obtains based on statistics calculates the degree of correlation of two vocabulary, and points out relevant information.
2. the vocabulary relatedness computation method based on the Search Results semantic analysis as claimed in claim 1; It is characterized in that: the method for the said statistics vocabulary of step (4) frequency information is: extract each record in the result for retrieval page; Extract summary texts information wherein; And according to decollator " ... " Carry out text dividing, obtain several segmentations; Carry out the Information Statistics of the vocabulary frequency for each segmentation.
3. the vocabulary relatedness computation method based on the Search Results semantic analysis as claimed in claim 1; It is characterized in that: in the step (4); According to condition endRec < TotalRec, and whether condition Trec count threshold xi less than the record of setting and set up, and determines whether obtaining more record; Wherein, TotalRec representes the total number of records of result for retrieval, and endRec representes the record number of current page, the record number that Trec had handled.
4. the vocabulary relatedness computation method based on the Search Results semantic analysis described in claim 1 is characterized in that: calculate two vocabulary w1 through following formula in the step (5), the degree of correlation of w2:
R(w1,?w2)=?TC*2?/?(T1+T2)
Wherein, T1 is the number of times that w1 occurs, and T2 is the number of times that w2 occurs, and TC is the number of times that both occur simultaneously.
5. the vocabulary relatedness computation method described in claim 1 based on the Search Results semantic analysis; It is characterized in that: through the structure training set; Similarity result of calculation and annotation results are carried out the calculating of Pearson correlation coefficient, thereby threshold xi counted in the record that needs in definite computation process.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN2012101884759A CN102789479A (en) | 2012-06-08 | 2012-06-08 | Vocabulary relevance calculating method on basis of semantic analysis of search result |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN2012101884759A CN102789479A (en) | 2012-06-08 | 2012-06-08 | Vocabulary relevance calculating method on basis of semantic analysis of search result |
Publications (1)
Publication Number | Publication Date |
---|---|
CN102789479A true CN102789479A (en) | 2012-11-21 |
Family
ID=47154882
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN2012101884759A Pending CN102789479A (en) | 2012-06-08 | 2012-06-08 | Vocabulary relevance calculating method on basis of semantic analysis of search result |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN102789479A (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104317846A (en) * | 2014-10-13 | 2015-01-28 | 安徽华贞信息科技有限公司 | Semantic analysis and marking method and system |
CN105335504A (en) * | 2015-10-29 | 2016-02-17 | 成都博睿德科技有限公司 | Information retrieval method based on natural language |
CN105335505A (en) * | 2015-10-29 | 2016-02-17 | 成都博睿德科技有限公司 | Information searching method based on natural language |
CN109299292A (en) * | 2018-11-26 | 2019-02-01 | 广西财经学院 | Text searching method based on the mixing extension of matrix weights correlation rule front and back pieces |
CN109299278A (en) * | 2018-11-26 | 2019-02-01 | 广西财经学院 | Based on confidence level-related coefficient frame mining rule former piece text searching method |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102184233A (en) * | 2011-05-12 | 2011-09-14 | 西北工业大学 | Query-result-based semantic correlation degree computing method |
-
2012
- 2012-06-08 CN CN2012101884759A patent/CN102789479A/en active Pending
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102184233A (en) * | 2011-05-12 | 2011-09-14 | 西北工业大学 | Query-result-based semantic correlation degree computing method |
Non-Patent Citations (2)
Title |
---|
史天艺: ""基于维基百科的搜索引擎检索结果聚类"", 《中国优秀硕士学位论文全文数据库(信息科技辑)》, 15 December 2011 (2011-12-15), pages 138 - 2061 * |
沙芸等: ""基于词间语义相关度的搜索结果聚类算法"", 《郑州大学学报(理学版)》, vol. 41, no. 1, 31 March 2009 (2009-03-31), pages 73 - 76 * |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104317846A (en) * | 2014-10-13 | 2015-01-28 | 安徽华贞信息科技有限公司 | Semantic analysis and marking method and system |
CN105335504A (en) * | 2015-10-29 | 2016-02-17 | 成都博睿德科技有限公司 | Information retrieval method based on natural language |
CN105335505A (en) * | 2015-10-29 | 2016-02-17 | 成都博睿德科技有限公司 | Information searching method based on natural language |
CN109299292A (en) * | 2018-11-26 | 2019-02-01 | 广西财经学院 | Text searching method based on the mixing extension of matrix weights correlation rule front and back pieces |
CN109299278A (en) * | 2018-11-26 | 2019-02-01 | 广西财经学院 | Based on confidence level-related coefficient frame mining rule former piece text searching method |
CN109299278B (en) * | 2018-11-26 | 2022-02-15 | 广西财经学院 | Text retrieval method based on confidence coefficient-correlation coefficient framework mining rule antecedent |
CN109299292B (en) * | 2018-11-26 | 2022-02-15 | 广西财经学院 | Text retrieval method based on matrix weighted association rule front and back part mixed expansion |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN102200975B (en) | Vertical search engine system using semantic analysis | |
US20080033932A1 (en) | Concept-aware ranking of electronic documents within a computer network | |
EP2307951A1 (en) | Method and apparatus for relating datasets by using semantic vectors and keyword analyses | |
Li et al. | Bursty event detection from microblog: a distributed and incremental approach | |
Ngonga Ngomo et al. | Scms–semantifying content management systems | |
Nguyen et al. | A math-aware search engine for math question answering system | |
CN104281702A (en) | Power keyword segmentation based data retrieval method and device | |
Nasution et al. | Social network extraction based on Web: 3. The integrated superficial method | |
CN102789479A (en) | Vocabulary relevance calculating method on basis of semantic analysis of search result | |
Srinivas et al. | A weighted tag similarity measure based on a collaborative weight model | |
Kwatra et al. | Extractive and abstractive summarization for hindi text using hierarchical clustering | |
Yu et al. | Role-explicit query identification and intent role annotation | |
Antunes et al. | Semantic features for context organization | |
Ji et al. | Big data summarization using semantic feture for iot on cloud | |
Mohd et al. | Sumdoc: a unified approach for automatic text summarization | |
Kannan et al. | Text document clustering using statistical integrated graph based sentence sensitivity ranking algorithm | |
Sharma et al. | Keyword Extraction Using Graph Centrality and WordNet | |
Jayabharathy et al. | Multi-document update summarisation using co-related terms for scientific articles and news group | |
Tsapatsoulis | Web image indexing using WICE and a learning-free language model | |
Layfield et al. | Experiments with document retrieval from small text collections using latent semantic analysis or term similarity with query coordination and automatic relevance feedback | |
Liang et al. | Multilingual information retrieval and smart news feed based on big data | |
Wang et al. | Multi-document summarization via LDA and density peaks based sentence-level clustering | |
She et al. | Deep neural semantic network for keywords extraction on short text | |
Liu | Automatic keyword extraction based on dependency parsing and BERT semantic weighting | |
Yang et al. | Sentence similarity on structural representations |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C02 | Deemed withdrawal of patent application after publication (patent law 2001) | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20121121 |