CN103678642A - Concept semantic similarity measurement method based on search engine - Google Patents

Concept semantic similarity measurement method based on search engine Download PDF

Info

Publication number
CN103678642A
CN103678642A CN201310713182.2A CN201310713182A CN103678642A CN 103678642 A CN103678642 A CN 103678642A CN 201310713182 A CN201310713182 A CN 201310713182A CN 103678642 A CN103678642 A CN 103678642A
Authority
CN
China
Prior art keywords
concept
search
search engine
semantic similarity
semantic
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201310713182.2A
Other languages
Chinese (zh)
Inventor
徐峥
齐力
梅林�
胡传平
支凤麟
梁辰
骆祥峰
魏晓
张顺香
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Third Research Institute of the Ministry of Public Security
Original Assignee
Third Research Institute of the Ministry of Public Security
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Third Research Institute of the Ministry of Public Security filed Critical Third Research Institute of the Ministry of Public Security
Priority to CN201310713182.2A priority Critical patent/CN103678642A/en
Publication of CN103678642A publication Critical patent/CN103678642A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a concept semantic similarity measurement method based on a search engine. Page numbering, semantic fragments and the number of displayed search results are integrated in the new method. Noise and redundancy in data of the search engine are effectively removed, and the problems in the prior art are effectively solved.

Description

A kind of Concept Semantic Similarity measure based on search engine
Technical field
The present invention relates to Data Mining, be specially a kind of tolerance Concept Semantic Similarity method.
Background technology
At web, excavate, in information retrieval and natural language processing, the semantic similarity of measuring exactly between concept is an important problem.The extraction of Web Mining application Zhong Ru community, relation detects, and concept disambiguation, and requirement can be measured the semantic similarity between concept or entity exactly.In information retrieval, a main problem is when user inquires about, will retrieve one group of semantic relevant file to user.For various natural language processing tasks, such as semanteme of word disambiguation, text contains, and autotext summary, estimates that the semantic similarity between word and word is vital efficiently.
In research before, there is the research of a lot of Semantic Similarity tolerance based on basis, website, be mainly divided into following three aspects:
(1). the webpage quantity of returning according to search engine is measured, and the similarity between the larger explanation concept of quantity of returning is larger.
(2). according to the quantity of the download seniority among brothers and sisters of file, then apply top text-processing technology and measure.These tolerance are to set up these hypothesis bases above, and similar context means similar meaning, and word appears at similar vocabulary environment close semantic relation.
(3). in conjunction with (1) and (2), measure.
In sum, the semantic similarity of tolerance concept, but measure the redundance of seldom removing noise and web page fragments in the method for subjectivity and objectivity of incidence relation.
At present proposed many different Concept Semantic Similarity measuring methods, these methods are mainly divided into two aspects: method and network method based on classification.Method based on classification is to carry out computing semantic similarity with information theory and hierarchical classification, yet network method in contrast, and it is dynamic as one using network, the corpus of real-time update, based on corpus, carrys out computing semantic similarity.
The information content can be used for evaluating Concept Semantic Similarity, and the information content of concept C is negative log-likelihood value, refers to the possibility that concept C occurs, and has developed similarity word finder software measure the semantic similarity of a pair of concept according to the thought of the information content.Yet the distance classification of two vocabulary is to measure the more naturally direct mode of semantic similarity.Shorter to the distance of another vocabulary from a vocabulary, they are just more similar.Owing to considering the type of line, the degree of depth, density, by the formula of edge calculation density, the edge degree of depth, edge strength, measure Concept Semantic Similarity, be also a kind of good method.The distance of the information content and two vocabulary is combined the model of formation can measure Concept Semantic Similarity, yet usage space vector model and walk random also can be measured Concept Semantic Similarity.Past has people to explore the definition of the semantic similarity of bulk information resource, and the structurized semantic information that these resources are classified by dictionary and the information content of corpus form.For the validity of survey information resource, implemented the technology of the various possible information resources of a large amount of uses.Because new word constantly produces, new implication is also assigned in the vocabulary of existence.The manual software that comprises thesaurus is costly such as word finder captures new term with new implication, and if possible, this makes the method based on classification in related Web task, seem very dumb.
Different from the method based on classification, pointwise mutual information method is to identify synonym with the touching quantity that Web search engine returns, and symbiosis duplication check is that the core of this method is the rank algorithm of search engine using Web as the corpus upgrading.Similar kernel function can define the Concept Semantic Similarity searching by google, and the function of similar kernel function is the inquiry of advising being correlated with to search engine user in a large-scale system.Method based on corpus is called second order symbiosis PMI, calculates the semantic similarity of two target vocabulary.The method is to use mutual information go the to classify a series of important adjacent words of two target vocabulary.The page count that Web search engine provides and paragraph also can be measured semantic similarity.The grammatical pattern that this method need to automatically be extracted by means of some from paragraph.In this method, from rank, in 900 fragment, extract 200 patterns, 200 patterns come from 4562471 unique patterns.Because As time goes on the forward pattern of rank changes, the regeneration of a large amount of unique patterns makes this method very consuming time, and therefore, extraction pattern has greatly affected this method.
In sum, the tolerance semantic similarity method based on website existing at present lacks relevant mechanism and processes noise and the redundance in website data.
Summary of the invention
For existing tolerance semantic similarity method, cannot process noise in website data and the problem of redundance, the object of the present invention is to provide a kind of Concept Semantic Similarity measure based on search engine, effectively removed the noise and the redundance that in search engine data, exist.
In order to achieve the above object, the present invention adopts following technical scheme:
A Concept Semantic Similarity measure based on search engine, described measure comprises the steps:
(1) webpage counting, by search engine search related notion, and returns to corresponding webpage quantity;
(2) semantic segment, provides the semantic segment that comprises all concepts by search engine search, and calculates the ratio that the semantic segment comprise all concepts accounts for all semantic segments that search engine search returns;
(3) quantity of the Search Results having shown, is searched for and is shown the result searching by search engine, and the quantity of the result having shown is provided;
(4) result providing according to step (1) to (3) is carried out Concept Semantic Similarity calculating.
In preferred embodiment of the present invention, in step (1), by search engine, search for concept p, the concept q of similarity to be measured, also search for the concept p ∧ q that represents concept p and concept q co-occurrence simultaneously.
Further, in described step (2), pass through search engine search concept p ∧ q, and inquire about the webpage number returning, calculate its shared ratio in n forward fragment of rank, be designated as SS(p ∧ q).
Further, the repeat search interface that described step (3) provides by search engine omits some and the similar entry of the Search Results that shown.
Further, in described step (4), utilize the result that step (2) and (3) obtain to eliminate respectively noise and redundance processing to the corresponding web page quantity of returning in step (1), and carry out semantic similarity tolerance to processing the application of results pointwise mutual information method obtaining.
The measure providing according to such scheme, it is by page count, semantic segment and a kind of new method of the integrated formation of quantity of display of search results.This scheme is by comprise concept p and concept q simultaneously in the sentence of semantic segment, concept p and concept q co-occurrence in a word remove the noise in Web fragment; " repeat search " interface simultaneously providing by search engine omits some and the similar entry of the Search Results that shown, with this, reaches the object that removes the redundance in Web fragment.Thus, this programme can effectively remove noise and the redundance existing in search engine data, has improved greatly efficiency and the precision of Concept Semantic Similarity tolerance.
Accompanying drawing explanation
Below in conjunction with the drawings and specific embodiments, further illustrate the present invention.
Fig. 1 is theory diagram of the invention process.
Embodiment
For technological means, creation characteristic that the present invention is realized, reach object and effect is easy to understand, below in conjunction with concrete diagram, further set forth the present invention.
The object of this invention is to provide a kind of noise and redundance of removing web page fragments and calculate Concept Semantic Similarity method, for achieving the above object, method provided by the invention comprises:
Semantic similarity is the matching degree between the Concept of Information that represents of computing machine processable form, the invention provides a kind of method of the tolerance Concept Semantic Similarity based on search engine.This measure mainly comprises following three step: A, webpage counting step; B, semantic segment treatment step; The quantity step of the Search Results that C, statistics have shown.
For the webpage counting in steps A, by Web search engine, search for corresponding concept, and add up the corresponding webpage quantity that Web search engine returns.It is concrete that with Web, to search plain engine search concept p, concept q, concept p and concept q co-occurrence be concept p ∧ q; Wherein by Web search engine search concept p, find the total number N(p of Search Results), by Web search engine search concept q, find the total number N(q of Search Results), by Web search engine search concept p ∧ q, find the total number N(p ∧ q of Search Results).
It is PMI (p, q) that each Search Results obtaining in webpage counting is used to pointwise mutual information method, gets final product the tolerance of similarity between real concept p and concept q.PMI (p wherein, q) be exactly the webpage quantity N(N=1011 of search engine) be multiplied by the ratio of product of the page number of concept p and the webpage number of concept q co-occurrence and the webpage number of concept p and concept q, this ratio is being carried out to logarithm operation, the webpage quantity of the search engine of the result of computing and logarithm is being made to division arithmetic.
But so directly utilize each Search Results obtaining in webpage counting to carry out metric calculation, cannot remove the noise and the redundance that in search engine data, exist.
The quantity of the Search Results that for this reason, scheme provided by the invention has shown by B, semantic segment treatment step and C, statistics removes respectively noise and the redundance existing in search engine data.
First, for semantic segment, process, semantic segment refer to by the search of Web search engine provide one section with the similar semantic information of the content of searching for.
In this programme, by comprise concept p and concept q in a word simultaneously, concept p and concept q are in a declarative sentence, exclamative sentence or yet co-occurrence.In appearing in short when if concept p is different with concept q, the information that search engine returns may be only about concept p or concept q, or the information of returning comprises concept p and concept q, but concept p is not associated in the information of returning with concept q.Therefore in the sentence of semantic segment, comprise that concept p and concept q can accurately calculate PMI(p, q simultaneously) concept p in formula and the webpage number of concept q co-occurrence.
Specifically, in this programme, the webpage number returning by query concept p ∧ q, calculates its shared ratio in n forward fragment of rank, is designated as SS(p ∧ q), with SS(p ∧ q) * N(p ∧ q) replace PMI(p, q) N(p ∧ q in formula).Remove thus the noise in Web fragment.
Because this programme is based on search engine, a n described here fragment is that user inputs the Search Results that search engine after keyword represents by the form of fragment, user judges whether it is the content oneself needing by reading fragment, if meet user's expectation, user can click fragment and enter relevant webpage.
Due to, in the fragment that input is returned after keyword, differing to establish a capital comprises p and q, the summation of the segments that fragment/search procedure engine of the ratio of herein calculating=comprise p and q concept returns.
Moreover, for the quantity of the Search Results that shown of statistics, by Web search engine, search for, show the result searching, and provide the quantity of the result having shown, its quantity to omit some and the similar entry of the Search Results having shown.Here the quantity that shows result providing by Web search engine, it provides " repeat search " interface to omit some and the similar entry of the Search Results that shown by Web search engine (as google), if " repeat search " excuse of not using Web search engine (as google) to provide, the page number that search engine returns so reaches 1000, and the Search Results returning is not necessarily corresponding with the content of search, therefore use the Search Results having shown can improve PMI(p, q) search engine returns in formula webpage quantity.
Concrete, in this programme, by obtaining the quantity of the Search Results that concept p, concept q and concept p ∧ q shown, be designated as respectively R(p), R(q) and R(p ∧ q), and with R(p) * N(p), R(q) * N(q) and R(p ∧ q) * N(p ∧ q) replace respectively PMI(p, q) N(p in formula), N(q) and N(p ∧ q).Remove thus the redundance in Web fragment.
By a concrete tolerance example, further illustrate such scheme below.
The gauging system of this tolerance example based on a Concept Semantic Similarity realizes, this gauging system mainly comprises webpage counting module, semantic segment processing module, the Search Results quantity module and the similarity calculation module that have shown, and these modules can realize the function of above-mentioned correspondence respectively.
Referring to Fig. 1, it is depicted as on the basis of this gauging system, the process of the semantic similarity of tolerance concept p and concept q.Detailed process is as follows:
Step 1: webpage counting module and Web search plain engine (for google searches plain engine, lower same) and match, and utilizing Web to search plain engine search concept p, concept q, concept p and concept q co-occurrence is concept p ∧ q.
Step 2: webpage counting module, by Web search engine search concept p, finds the total number N(p of Search Results); By Web search engine search concept q, find the total number N(q of Search Results); By Web search engine search concept p ∧ q, find the total number N(p ∧ q of Search Results).
Step 3: threshold alpha is set.
This threshold alpha is mainly used in the relatively judgement of step 5 kind, and its concrete value is set according to concrete requirement.
Step 4: determine the shared ratio of semantic segment of concept p and concept q co-occurrence in the sentence of semantic segment by semantic segment processing module.In the Search Results of semantic segment processing module search engine search concept p ∧ q from step 1, the webpage number that query search engine search concept p ∧ q returns, in n forward fragment of the rank of returning, calculate the shared ratio of concept p ∧ q simultaneously, be designated as SS(p ∧ q).
Step 5: semantic segment processing module is by the ratio SS(p ∧ q calculating) compare with the threshold alpha of setting before, as SS(p ∧ q) during > α, operating procedure six; Otherwise assert semantic similarity SPPMI (p, q)=0 of concept p and concept q.
Step 6: the Search Results quantity module having shown coordinates with search engine, the quantity of the Search Results that statistic concept p, concept q and concept p ∧ q have shown respectively, and they are designated as respectively to R(p), R(q) and R(p ∧ q).
Step 7: similarity calculation module accept webpage counting module, semantic segment processing module and the Search Results quantity module that shown in process the data that obtain, and according to receiving data, calculate respectively N(p) * R(p), N(q) * R(q) and SS(p ∧ q) * N(p ∧ q) * R(p ∧ q).
Step 8: similarity calculation module, with the following formula of result utilization calculating, is calculated the semantic similarity SPPMI (p, q) of concept p and concept q:
Figure BDA0000442779110000061
More than show and described ultimate principle of the present invention, principal character and advantage of the present invention.The technician of the industry should understand; the present invention is not restricted to the described embodiments; that in above-described embodiment and instructions, describes just illustrates principle of the present invention; without departing from the spirit and scope of the present invention; the present invention also has various changes and modifications, and these changes and improvements all fall in the claimed scope of the invention.The claimed scope of the present invention is defined by appending claims and equivalent thereof.

Claims (5)

1. the Concept Semantic Similarity measure based on search engine, is characterized in that, described measure comprises the steps:
(1) webpage counting, by search engine search related notion, and returns to corresponding webpage quantity;
(2) semantic segment, provides the semantic segment that comprises all concepts by search engine search, and calculates the ratio that the semantic segment comprise all concepts accounts for all semantic segments that search engine search returns;
(3) quantity of the Search Results having shown, is searched for and is shown the result searching by search engine, and the quantity of the result having shown is provided;
(4) result providing according to step (1) to (3) is carried out Concept Semantic Similarity calculating.
2. a kind of Concept Semantic Similarity measure based on search engine according to claim 1, it is characterized in that, in step (1), by search engine, search for concept p, the concept q of similarity to be measured, also search for the concept p ∧ q that represents concept p and concept q co-occurrence simultaneously.
3. a kind of Concept Semantic Similarity measure based on search engine according to claim 1, it is characterized in that, in described step (2), pass through search engine search concept p ∧ q, and the webpage number that returns of inquiry, calculate its shared ratio in n forward fragment of rank, be designated as SS(p ∧ q).
4. a kind of Concept Semantic Similarity measure based on search engine according to claim 1, is characterized in that, the repeat search interface that described step (3) provides by search engine omits some and the similar entry of the Search Results that shown.
5. a kind of Concept Semantic Similarity measure based on search engine according to claim 1, it is characterized in that, in described step (4), utilize the result that step (2) and (3) obtain to eliminate respectively noise and redundance processing to the corresponding web page quantity of returning in step (1), and carry out semantic similarity tolerance to processing the application of results pointwise mutual information method obtaining.
CN201310713182.2A 2013-12-20 2013-12-20 Concept semantic similarity measurement method based on search engine Pending CN103678642A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310713182.2A CN103678642A (en) 2013-12-20 2013-12-20 Concept semantic similarity measurement method based on search engine

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310713182.2A CN103678642A (en) 2013-12-20 2013-12-20 Concept semantic similarity measurement method based on search engine

Publications (1)

Publication Number Publication Date
CN103678642A true CN103678642A (en) 2014-03-26

Family

ID=50316186

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310713182.2A Pending CN103678642A (en) 2013-12-20 2013-12-20 Concept semantic similarity measurement method based on search engine

Country Status (1)

Country Link
CN (1) CN103678642A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105335504A (en) * 2015-10-29 2016-02-17 成都博睿德科技有限公司 Information retrieval method based on natural language
CN105335505A (en) * 2015-10-29 2016-02-17 成都博睿德科技有限公司 Information searching method based on natural language
CN107408156A (en) * 2015-03-09 2017-11-28 皇家飞利浦有限公司 For carrying out semantic search and the system and method for extracting related notion from clinical document
CN108917677A (en) * 2018-07-19 2018-11-30 福建天晴数码有限公司 Cube room inside dimension measurement method, storage medium

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107408156A (en) * 2015-03-09 2017-11-28 皇家飞利浦有限公司 For carrying out semantic search and the system and method for extracting related notion from clinical document
CN105335504A (en) * 2015-10-29 2016-02-17 成都博睿德科技有限公司 Information retrieval method based on natural language
CN105335505A (en) * 2015-10-29 2016-02-17 成都博睿德科技有限公司 Information searching method based on natural language
CN108917677A (en) * 2018-07-19 2018-11-30 福建天晴数码有限公司 Cube room inside dimension measurement method, storage medium
CN108917677B (en) * 2018-07-19 2020-03-17 福建天晴数码有限公司 Cubic room internal dimension measuring method and storage medium

Similar Documents

Publication Publication Date Title
US20210097238A1 (en) User keyword extraction device and method, and computer-readable storage medium
CN103514183B (en) Information search method and system based on interactive document clustering
WO2017166912A1 (en) Method and device for extracting core words from commodity short text
WO2021218322A1 (en) Paragraph search method and apparatus, and electronic device and storage medium
CN103279478B (en) A kind of based on distributed mutual information file characteristics extracting method
CN108132929A (en) A kind of similarity calculation method of magnanimity non-structured text
CN110309446A (en) The quick De-weight method of content of text, device, computer equipment and storage medium
CN107992480B (en) Method, device, storage medium and program product for realizing entity disambiguation
CN107967256B (en) Word weight prediction model generation method, position recommendation method and computing device
JP2017504105A5 (en)
CN110362678A (en) A kind of method and apparatus automatically extracting Chinese text keyword
CN112148843B (en) Text processing method and device, terminal equipment and storage medium
CN111930962A (en) Document data value evaluation method and device, electronic equipment and storage medium
Jain et al. Query2vec: An evaluation of NLP techniques for generalized workload analytics
CN110321466A (en) A kind of security information duplicate checking method and system based on semantic analysis
CN104182527A (en) Partial-sequence itemset based Chinese-English test word association rule mining method and system
CN103678642A (en) Concept semantic similarity measurement method based on search engine
CN112784062A (en) Idiom knowledge graph construction method and device
CN107832467A (en) A kind of microblog topic detecting method based on improved Single pass clustering algorithms
CN105164676A (en) Query features and questions
Zhang et al. Extracting focused locations for web pages
US11288266B2 (en) Candidate projection enumeration based query response generation
Chen et al. Finding keywords in blogs: Efficient keyword extraction in blog mining via user behaviors
WO2018205391A1 (en) Method, system and apparatus for evaluating accuracy of information retrieval, and computer-readable storage medium
CN110019556B (en) Topic news acquisition method, device and equipment thereof

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20140326

RJ01 Rejection of invention patent application after publication