CN105335504A - Information retrieval method based on natural language - Google Patents
Information retrieval method based on natural language Download PDFInfo
- Publication number
- CN105335504A CN105335504A CN201510716439.9A CN201510716439A CN105335504A CN 105335504 A CN105335504 A CN 105335504A CN 201510716439 A CN201510716439 A CN 201510716439A CN 105335504 A CN105335504 A CN 105335504A
- Authority
- CN
- China
- Prior art keywords
- keyword
- semantic
- retrieval
- sim
- segmentation
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3344—Query execution using natural language analysis
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Artificial Intelligence (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention provides an information retrieval method based on natural language. The method comprises the following steps: respectively retrieving a plurality of keywords input by a user, and computing semantic similarity among the keywords by utilizing document quantity in retrieval results. The invention discloses a natural language retrieval method which does not need manual intervention, is easy to apply to financial information retrieval associated work, and can improve accuracy of retrieving an extended task.
Description
Technical field
The present invention relates to natural language processing, particularly a kind of natural language searching method.
Background technology
The research of the semantic approximation of keyword is all an important problem in text search application.Such as topic detection, recommendation query etc.In recent years along with the fast development of network, many based on the Web inter-related task of financial field in the calculating of the semantic approximation of keyword also more and more important.Existing financial correlation search engine all provides a series of related term to help user and finds the result wanted most, thus improves search experience and the recall precision of user.In Financial Information field, the calculating of the semantic approximation of keyword also plays an important role.But the computing method of the semantic approximation of the keyword of existing sing on web do not consider interference and repetition in the result that search engine feeds back.The source of disturbing mainly keyword appears in some documents randomly, and this will reduce the accuracy of document searching quantity.The document much repeated makes Search Results quantity insincere.
Summary of the invention
For solving the problem existing for above-mentioned prior art, the present invention proposes a kind of information retrieval method based on natural language, comprising:
Retrieve respectively multiple keywords of user's input, the number of documents utilizing result for retrieval to hit is to calculate the semantic approximation between keyword.
Preferably, the semantic approximation between following formulae discovery keyword is used:
Sim(a,b)=N(a∩b)/(N(a)+N(b)-N(a∩b))+N(a,b)(min(N(a),N(b)))+log((N*N(P∩b))/((N(a)*N(b)))/logN
Wherein Sim (a, b) represents the different keyword a that user inputs, the semantic approximation tolerance between b; N is the number of documents in search engine, and symbol N (x) expression search engine retrieving keyword x returns search file number; A ∩ b be keyword a's and b and operating result, namely N (a ∩ b) represents the number of files of retrieval " aANDb ".
Preferably, in the process calculating the semantic approximation between keyword, also comprise:
Keyword a and b with operation result for retrieval in, keyword a and b common result for retrieval segmentation occurred in same statement is expressed as semantic segmentation, and calculate the ratio of described semantic segmentation in a front n segmentation, be designated as K (a ∩ b), wherein n is for presetting segments; N (a ∩ b) * K (a ∩ b) is utilized to calculate approximation between keyword:
Sim
K(a,b)=N(a∩b)*K(a∩b)/(N(a)+N(b)-N(a∩b)*K(a∩b))
+N(a∩b)*K(a∩b)(min(N(a),N(b)))
+log((N*N(a∩b)*K(a∩b))/((N(a)*N(b)))/logN;
Wherein Sim
k(a, b) represents the different keyword a that user inputs, based on the semantic approximation tolerance of semantic segment information between b.
Preferably, also comprise:
The proportion threshold value β of semantic segmentation in n segmentation before presetting,
As K (a ∩ b) < β, Sim (a, b)=0;
As K (a ∩ b) < β, Sim (a, b)=
N(a∩b)*R(a∩b)*K(a∩b)/(N(a)*R(a)+N(b)*R(b)-N(a∩b)*R(a∩b)*K(a∩b)
+N(a∩b)*R(a∩b)*K(a∩b)(min(N(a)*R(a),N(b)*R(b)))
+log((N*N(a∩b)*R(a∩b)*K(a∩b))/((N(a)*R(a)*N(b)*R(b)))/logN;
Reproducible results quantity when wherein R (a), R (b) and R (a ∩ b) are respectively search key a, b, " aANDb ".
The present invention compared to existing technology, has the following advantages:
The present invention proposes a kind of natural language searching method, do not need manual intervention; And be easy to be applied in the relevant work of Financial Information retrieval, improve the accuracy of retrieval expansion task.
Accompanying drawing explanation
Fig. 1 is the process flow diagram of the information retrieval method based on natural language according to the embodiment of the present invention.
Embodiment
Detailed description to one or more embodiment of the present invention is hereafter provided together with the accompanying drawing of the diagram principle of the invention.But describe the present invention in conjunction with such embodiment and the invention is not restricted to any embodiment.Scope of the present invention be only defined by the claims and the present invention contain many substitute, amendment and equivalent.Set forth many details in the following description to provide thorough understanding of the present invention.These details are provided for exemplary purposes and also can realize the present invention according to claims without some in these details or all details.
An aspect of of the present present invention provides a kind of information retrieval method based on natural language.Fig. 1 is the information retrieval method process flow diagram based on natural language according to the embodiment of the present invention.The present invention calculates the semantic approximation between keyword by confluent retrieval number of documents and result for retrieval segmentation.The method that the present invention proposes does not need manual intervention; And be easy to be applied in work as relevant to network in retrieval suggestion etc.Utilize keyword jointly to appear in same sentence and remove interference, utilize the reproducible results number of search engine to remove repetition, effectively can calculate the approximation between word.Meanwhile, the method proposed can improve the accuracy of retrieval expansion task.
Search file number of the present invention refers to the number of documents comprising search key b.At remainder of the present invention, will use that symbol N (b) expression search engine retrieving keyword b's return search file number.But the independent search file number of word a and b is not enough to calculate its semantic approximation, also should add the search file number of retrieval " aANDb ".
Particularly, use following methods to calculate the semantic approximation of keyword in the present invention, concrete formula is as follows.
Sim(a,b)=N(a∩b)/(N(a)+N(b)-N(a∩b))+N(a,b)(min(N(a),N(b)))+log((N*N(P∩b))/((N(a)*N(b)))/logN
Wherein N is the number of documents in search engine.
Search file number computing semantic approximation is used to have ignored the interference and repetition that exist in network data.Therefore need to reduce further two keywords to occur at random and document also exists the situation of a large amount of repetitions, to improve the accuracy that semantic approximation calculates.Therefore need to revise based on N (a ∩ b) part in the semantic approximation computing method of keyword of search file number.Search engine also can return result for retrieval segmentation when returning Search Results, these segmentations are normally no more than 30 words short and small text, these texts provide very important semantic information.
Word a and b common result for retrieval segmentation occurred in same statement is expressed as semantic segmentation.In segmentation with fullstop be ending be referred to as a statement.Semantic segmentation provides the useful semantic relation between word a and b.Therefore semantic segmentation can be used for judging whether two keywords appear in text document randomly.
Search engine provides the link of each result, and because number of documents is huge and growth rate fast, it is very difficult for therefore carrying out directly analyzing to each Search Results.Search engine provides the function that is removed reproducible results.When searching for search engine, in order to make the degree of correlation of result high, search engine eliminates some closely similar Search Results.The reproducible results quantity of search engine can with removing repetition.
The present invention calculates the semantic approximation of keyword further by confluent retrieval number of files, semantic segmentation and reproducible results quantity.
Mode 1: the semantic similitude degree between keyword is determined by search file number and semantic segmentation.Key step is as follows:
1) " a ", " b ", " aANDb " is searched for respectively in a search engine;
2) N (a), N (b) and N (a ∩ b) is obtained;
3) in the result of " aANDb ", the ratio of computing semantic segmentation in a front n segmentation, is designated as: K (a ∩ b), n are for presetting segments; Such as in front 100 segmentations of Search Results, a, b semantic segmentation of simultaneously appearing at same sentence have 40, then K (a ∩ b) be 40/100=40%.
4) N (a ∩ b) is replaced to calculate approximation between keyword with N (a ∩ b) * K (a ∩ b).
With upper type by using semantic segmentation to revise based on the N (a ∩ b) in the method for search file number, interference can be removed.Be shown below according to this mode is revised:
Sim
K(a,b)=N(a∩b)*K(a∩b)/(N(a)+N(b)-N(a∩b)*K(a∩b))
+N(a∩b)*K(a∩b)(min(N(a),N(b)))
+log((N*N(a∩b)*K(a∩b))/((N(a)*N(b)))/logN
Mode 2: the semantic similitude degree between keyword is that search file number and reproducible results quantity determine jointly.Key step is as follows:
1) " a ", " b ", " aANDb " is searched for respectively in a search engine;
2) N (a), N (b) and N (a ∩ b) is obtained;
3) obtain reproducible results quantity when searching for " a ", " b ", " aANDb ", be designated as: R (a), R (b) and R (a ∩ b);
4) replace N (a), N (b) and N (a ∩ b) respectively with N (a) * R (a), N (b) * R (b) and N (a ∩ b) * R (a ∩ b), reduce the repetition in network data.
Be shown below according to this mode is revised:
Sim
R(a,b)=
N(a∩b)*R(a∩b)/(N(a)*R(a)+N(b)*R(b)-N(a∩b)*R(a∩b)
+N(a∩b)*R(a∩b)(min(N(a)*R(a),N(b)*R(b)))
+log((N*N(a∩b)*R(a∩b))/((N(a)*R(a)*N(b)*R(b)))/logN
Semantic approximation between mode 3: two keywords is that pass-through mode 1 and mode 2 jointly determine, namely not only consider semantic segmentation, also consider reproducible results number.
The proportion threshold value β of semantic segmentation in n segmentation before presetting,
As K (a ∩ b) < β, Sim (a, b)=0;
As K (a ∩ b) < β, Sim (a, b)=
N(a∩b)*R(a∩b)*K(a∩b)/(N(a)*R(a)+N(b)*R(b)-N(a∩b)*R(a∩b)*K(a∩b)
+N(a∩b)*R(a∩b)*K(a∩b)(min(N(a)*R(a),N(b)*R(b)))
+log((N*N(a∩b)*R(a∩b)*K(a∩b))/((N(a)*R(a)*N(b)*R(b)))/logN.
In sum, the present invention proposes a kind of natural language searching method, do not need manual intervention; And be easy to be applied in the relevant work of Financial Information retrieval, improve the accuracy of retrieval expansion task.
Obviously, it should be appreciated by those skilled in the art, above-mentioned of the present invention each module or each step can realize with general computing system, they can concentrate on single computing system or to be distributed on network that multiple computing system forms, alternatively, they can realize with the executable program code of computing system thus, they can be stored and be performed by computing system within the storage system.Like this, the present invention is not restricted to any specific hardware and software combination.
Should be understood that, above-mentioned embodiment of the present invention is only for exemplary illustration or explain principle of the present invention and be not construed as limiting the invention.Therefore any amendment, made when without departing from the spirit and scope of the present invention, equivalent replacement, improvement etc., all should to be included within protection scope of the present invention.In addition, claims of the present invention be intended to contain fall into claims scope and border or this scope and border equivalents in whole change and modification.
Claims (4)
1. based on an information retrieval method for natural language, it is characterized in that, comprising:
Retrieve respectively multiple keywords of user's input, the number of documents utilizing result for retrieval to hit is to calculate the semantic approximation between keyword.
2. method according to claim 1, is characterized in that, uses the semantic approximation between following formulae discovery keyword:
Sim(a,b)=N(a∩b)/(N(a)+N(b)-N(a∩b))+N(a,b)(min(N(a),N(b)))+log((N*N(P∩b))/((N(a)*N(b)))/logN
Wherein Sim (a, b) represents the different keyword a that user inputs, the semantic approximation tolerance between b; N is the number of documents in search engine, and symbol N (x) expression search engine retrieving keyword x returns search file number; A ∩ b be keyword a's and b and operating result, namely N (a ∩ b) represents the number of files of retrieval " aANDb ".
3. method according to claim 2, is characterized in that, in the process calculating the semantic approximation between keyword, also comprises:
Keyword a and b with operation result for retrieval in, keyword a and b common result for retrieval segmentation occurred in same statement is expressed as semantic segmentation, and calculate the ratio of described semantic segmentation in a front n segmentation, be designated as K (a ∩ b), wherein n is for presetting segments; N (a ∩ b) * K (a ∩ b) is utilized to calculate approximation between keyword:
Sim
K(a,b)=N(a∩b)*K(a∩b)/(N(a)+N(b)-N(a∩b)*K(a∩b))
+N(a∩b)*K(a∩b)(min(N(a),N(b)))
+log((N*N(a∩b)*K(a∩b))/((N(a)*N(b)))/logN;
Wherein Sim
k(a, b) represents the different keyword a that user inputs, based on the semantic approximation tolerance of semantic segment information between b.
4. method according to claim 3, is characterized in that, also comprises:
The proportion threshold value β of semantic segmentation in n segmentation before presetting,
As K (a ∩ b) < β, Sim (a, b)=0;
As K (a ∩ b) < β, Sim (a, b)=
N(a∩b)*R(a∩b)*K(a∩b)/(N(a)*R(a)+N(b)*R(b)-N(a∩b)*R(a∩b)*K(a∩b)
+N(a∩b)*R(a∩b)*K(a∩b)(min(N(a)*R(a),N(b)*R(b)))
+log((N*N(a∩b)*R(a∩b)*K(a∩b))/((N(a)*R(a)*N(b)*R(b)))/logN;
Reproducible results quantity when wherein R (a), R (b) and R (a ∩ b) are respectively search key a, b, " aANDb ".
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510716439.9A CN105335504A (en) | 2015-10-29 | 2015-10-29 | Information retrieval method based on natural language |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510716439.9A CN105335504A (en) | 2015-10-29 | 2015-10-29 | Information retrieval method based on natural language |
Publications (1)
Publication Number | Publication Date |
---|---|
CN105335504A true CN105335504A (en) | 2016-02-17 |
Family
ID=55286031
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201510716439.9A Pending CN105335504A (en) | 2015-10-29 | 2015-10-29 | Information retrieval method based on natural language |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN105335504A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109002464A (en) * | 2017-06-06 | 2018-12-14 | 万事达卡国际公司 | The method and system suggested using the automatic report analysis of dialog interface and distribution |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20100042576A1 (en) * | 2008-08-13 | 2010-02-18 | Siemens Aktiengesellschaft | Automated computation of semantic similarity of pairs of named entity phrases using electronic document corpora as background knowledge |
CN102789479A (en) * | 2012-06-08 | 2012-11-21 | 复旦大学 | Vocabulary relevance calculating method on basis of semantic analysis of search result |
CN103678642A (en) * | 2013-12-20 | 2014-03-26 | 公安部第三研究所 | Concept semantic similarity measurement method based on search engine |
-
2015
- 2015-10-29 CN CN201510716439.9A patent/CN105335504A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20100042576A1 (en) * | 2008-08-13 | 2010-02-18 | Siemens Aktiengesellschaft | Automated computation of semantic similarity of pairs of named entity phrases using electronic document corpora as background knowledge |
CN102789479A (en) * | 2012-06-08 | 2012-11-21 | 复旦大学 | Vocabulary relevance calculating method on basis of semantic analysis of search result |
CN103678642A (en) * | 2013-12-20 | 2014-03-26 | 公安部第三研究所 | Concept semantic similarity measurement method based on search engine |
Non-Patent Citations (2)
Title |
---|
ANJA THEOBALD等: "Semantic Similarity Search on Semistructured Data with the XXL Search Engine", 《INFORMATION RETRIEVAL JOURNAL》 * |
陈海燕: "基于搜索引擎的词汇语义相似度计算方法", 《计算机科学》 * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109002464A (en) * | 2017-06-06 | 2018-12-14 | 万事达卡国际公司 | The method and system suggested using the automatic report analysis of dialog interface and distribution |
CN109002464B (en) * | 2017-06-06 | 2023-01-06 | 万事达卡国际公司 | Method and system for automatic report analysis and distribution of suggestions using a conversational interface |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110874531B (en) | Topic analysis method and device and storage medium | |
CN103678576B (en) | The text retrieval system analyzed based on dynamic semantics | |
CN106649818B (en) | Application search intention identification method and device, application search method and server | |
Mitra | Exploring session context using distributed representations of queries and reformulations | |
CN107102981B (en) | Word vector generation method and device | |
US10268758B2 (en) | Method and system of acquiring semantic information, keyword expansion and keyword search thereof | |
CN102479191B (en) | Method and device for providing multi-granularity word segmentation result | |
US10095784B2 (en) | Synonym generation | |
CN101872351B (en) | Method, device for identifying synonyms, and method and device for searching by using same | |
CN110134760A (en) | A kind of searching method, device, equipment and medium | |
US10528662B2 (en) | Automated discovery using textual analysis | |
CN101297291A (en) | Suggesting and refining user input based on original user input | |
CN105589894B (en) | Document index establishing method and device and document retrieval method and device | |
CN108228612B (en) | Method and device for extracting network event keywords and emotional tendency | |
CN105404677A (en) | Tree structure based retrieval method | |
CN105224624A (en) | A kind of method and apparatus realizing down the quick merger of row chain | |
CN110705285B (en) | Government affair text subject word library construction method, device, server and readable storage medium | |
Blanco et al. | Overview of NTCIR-13 Actionable Knowledge Graph (AKG) Task. | |
US10671810B2 (en) | Citation explanations | |
JP2012079029A (en) | Suggestion query extracting apparatus, method, and program | |
CN105426490A (en) | Tree structure based indexing method | |
CN105335504A (en) | Information retrieval method based on natural language | |
CN105426551A (en) | Classical Chinese searching method and device | |
Martins et al. | Modeling temporal evidence from external collections | |
CN112507181B (en) | Search request classification method, device, electronic equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20160217 |