CN105335504A - Information retrieval method based on natural language - Google Patents

Information retrieval method based on natural language Download PDF

Info

Publication number
CN105335504A
CN105335504A CN201510716439.9A CN201510716439A CN105335504A CN 105335504 A CN105335504 A CN 105335504A CN 201510716439 A CN201510716439 A CN 201510716439A CN 105335504 A CN105335504 A CN 105335504A
Authority
CN
China
Prior art keywords
keyword
semantic
retrieval
sim
segmentation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201510716439.9A
Other languages
Chinese (zh)
Inventor
李垚霖
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chengdu Boruide Science & Technology Co Ltd
Original Assignee
Chengdu Boruide Science & Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chengdu Boruide Science & Technology Co Ltd filed Critical Chengdu Boruide Science & Technology Co Ltd
Priority to CN201510716439.9A priority Critical patent/CN105335504A/en
Publication of CN105335504A publication Critical patent/CN105335504A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides an information retrieval method based on natural language. The method comprises the following steps: respectively retrieving a plurality of keywords input by a user, and computing semantic similarity among the keywords by utilizing document quantity in retrieval results. The invention discloses a natural language retrieval method which does not need manual intervention, is easy to apply to financial information retrieval associated work, and can improve accuracy of retrieving an extended task.

Description

A kind of information retrieval method based on natural language
Technical field
The present invention relates to natural language processing, particularly a kind of natural language searching method.
Background technology
The research of the semantic approximation of keyword is all an important problem in text search application.Such as topic detection, recommendation query etc.In recent years along with the fast development of network, many based on the Web inter-related task of financial field in the calculating of the semantic approximation of keyword also more and more important.Existing financial correlation search engine all provides a series of related term to help user and finds the result wanted most, thus improves search experience and the recall precision of user.In Financial Information field, the calculating of the semantic approximation of keyword also plays an important role.But the computing method of the semantic approximation of the keyword of existing sing on web do not consider interference and repetition in the result that search engine feeds back.The source of disturbing mainly keyword appears in some documents randomly, and this will reduce the accuracy of document searching quantity.The document much repeated makes Search Results quantity insincere.
Summary of the invention
For solving the problem existing for above-mentioned prior art, the present invention proposes a kind of information retrieval method based on natural language, comprising:
Retrieve respectively multiple keywords of user's input, the number of documents utilizing result for retrieval to hit is to calculate the semantic approximation between keyword.
Preferably, the semantic approximation between following formulae discovery keyword is used:
Sim(a,b)=N(a∩b)/(N(a)+N(b)-N(a∩b))+N(a,b)(min(N(a),N(b)))+log((N*N(P∩b))/((N(a)*N(b)))/logN
Wherein Sim (a, b) represents the different keyword a that user inputs, the semantic approximation tolerance between b; N is the number of documents in search engine, and symbol N (x) expression search engine retrieving keyword x returns search file number; A ∩ b be keyword a's and b and operating result, namely N (a ∩ b) represents the number of files of retrieval " aANDb ".
Preferably, in the process calculating the semantic approximation between keyword, also comprise:
Keyword a and b with operation result for retrieval in, keyword a and b common result for retrieval segmentation occurred in same statement is expressed as semantic segmentation, and calculate the ratio of described semantic segmentation in a front n segmentation, be designated as K (a ∩ b), wherein n is for presetting segments; N (a ∩ b) * K (a ∩ b) is utilized to calculate approximation between keyword:
Sim K(a,b)=N(a∩b)*K(a∩b)/(N(a)+N(b)-N(a∩b)*K(a∩b))
+N(a∩b)*K(a∩b)(min(N(a),N(b)))
+log((N*N(a∩b)*K(a∩b))/((N(a)*N(b)))/logN;
Wherein Sim k(a, b) represents the different keyword a that user inputs, based on the semantic approximation tolerance of semantic segment information between b.
Preferably, also comprise:
The proportion threshold value β of semantic segmentation in n segmentation before presetting,
As K (a ∩ b) < β, Sim (a, b)=0;
As K (a ∩ b) < β, Sim (a, b)=
N(a∩b)*R(a∩b)*K(a∩b)/(N(a)*R(a)+N(b)*R(b)-N(a∩b)*R(a∩b)*K(a∩b)
+N(a∩b)*R(a∩b)*K(a∩b)(min(N(a)*R(a),N(b)*R(b)))
+log((N*N(a∩b)*R(a∩b)*K(a∩b))/((N(a)*R(a)*N(b)*R(b)))/logN;
Reproducible results quantity when wherein R (a), R (b) and R (a ∩ b) are respectively search key a, b, " aANDb ".
The present invention compared to existing technology, has the following advantages:
The present invention proposes a kind of natural language searching method, do not need manual intervention; And be easy to be applied in the relevant work of Financial Information retrieval, improve the accuracy of retrieval expansion task.
Accompanying drawing explanation
Fig. 1 is the process flow diagram of the information retrieval method based on natural language according to the embodiment of the present invention.
Embodiment
Detailed description to one or more embodiment of the present invention is hereafter provided together with the accompanying drawing of the diagram principle of the invention.But describe the present invention in conjunction with such embodiment and the invention is not restricted to any embodiment.Scope of the present invention be only defined by the claims and the present invention contain many substitute, amendment and equivalent.Set forth many details in the following description to provide thorough understanding of the present invention.These details are provided for exemplary purposes and also can realize the present invention according to claims without some in these details or all details.
An aspect of of the present present invention provides a kind of information retrieval method based on natural language.Fig. 1 is the information retrieval method process flow diagram based on natural language according to the embodiment of the present invention.The present invention calculates the semantic approximation between keyword by confluent retrieval number of documents and result for retrieval segmentation.The method that the present invention proposes does not need manual intervention; And be easy to be applied in work as relevant to network in retrieval suggestion etc.Utilize keyword jointly to appear in same sentence and remove interference, utilize the reproducible results number of search engine to remove repetition, effectively can calculate the approximation between word.Meanwhile, the method proposed can improve the accuracy of retrieval expansion task.
Search file number of the present invention refers to the number of documents comprising search key b.At remainder of the present invention, will use that symbol N (b) expression search engine retrieving keyword b's return search file number.But the independent search file number of word a and b is not enough to calculate its semantic approximation, also should add the search file number of retrieval " aANDb ".
Particularly, use following methods to calculate the semantic approximation of keyword in the present invention, concrete formula is as follows.
Sim(a,b)=N(a∩b)/(N(a)+N(b)-N(a∩b))+N(a,b)(min(N(a),N(b)))+log((N*N(P∩b))/((N(a)*N(b)))/logN
Wherein N is the number of documents in search engine.
Search file number computing semantic approximation is used to have ignored the interference and repetition that exist in network data.Therefore need to reduce further two keywords to occur at random and document also exists the situation of a large amount of repetitions, to improve the accuracy that semantic approximation calculates.Therefore need to revise based on N (a ∩ b) part in the semantic approximation computing method of keyword of search file number.Search engine also can return result for retrieval segmentation when returning Search Results, these segmentations are normally no more than 30 words short and small text, these texts provide very important semantic information.
Word a and b common result for retrieval segmentation occurred in same statement is expressed as semantic segmentation.In segmentation with fullstop be ending be referred to as a statement.Semantic segmentation provides the useful semantic relation between word a and b.Therefore semantic segmentation can be used for judging whether two keywords appear in text document randomly.
Search engine provides the link of each result, and because number of documents is huge and growth rate fast, it is very difficult for therefore carrying out directly analyzing to each Search Results.Search engine provides the function that is removed reproducible results.When searching for search engine, in order to make the degree of correlation of result high, search engine eliminates some closely similar Search Results.The reproducible results quantity of search engine can with removing repetition.
The present invention calculates the semantic approximation of keyword further by confluent retrieval number of files, semantic segmentation and reproducible results quantity.
Mode 1: the semantic similitude degree between keyword is determined by search file number and semantic segmentation.Key step is as follows:
1) " a ", " b ", " aANDb " is searched for respectively in a search engine;
2) N (a), N (b) and N (a ∩ b) is obtained;
3) in the result of " aANDb ", the ratio of computing semantic segmentation in a front n segmentation, is designated as: K (a ∩ b), n are for presetting segments; Such as in front 100 segmentations of Search Results, a, b semantic segmentation of simultaneously appearing at same sentence have 40, then K (a ∩ b) be 40/100=40%.
4) N (a ∩ b) is replaced to calculate approximation between keyword with N (a ∩ b) * K (a ∩ b).
With upper type by using semantic segmentation to revise based on the N (a ∩ b) in the method for search file number, interference can be removed.Be shown below according to this mode is revised:
Sim K(a,b)=N(a∩b)*K(a∩b)/(N(a)+N(b)-N(a∩b)*K(a∩b))
+N(a∩b)*K(a∩b)(min(N(a),N(b)))
+log((N*N(a∩b)*K(a∩b))/((N(a)*N(b)))/logN
Mode 2: the semantic similitude degree between keyword is that search file number and reproducible results quantity determine jointly.Key step is as follows:
1) " a ", " b ", " aANDb " is searched for respectively in a search engine;
2) N (a), N (b) and N (a ∩ b) is obtained;
3) obtain reproducible results quantity when searching for " a ", " b ", " aANDb ", be designated as: R (a), R (b) and R (a ∩ b);
4) replace N (a), N (b) and N (a ∩ b) respectively with N (a) * R (a), N (b) * R (b) and N (a ∩ b) * R (a ∩ b), reduce the repetition in network data.
Be shown below according to this mode is revised:
Sim R(a,b)=
N(a∩b)*R(a∩b)/(N(a)*R(a)+N(b)*R(b)-N(a∩b)*R(a∩b)
+N(a∩b)*R(a∩b)(min(N(a)*R(a),N(b)*R(b)))
+log((N*N(a∩b)*R(a∩b))/((N(a)*R(a)*N(b)*R(b)))/logN
Semantic approximation between mode 3: two keywords is that pass-through mode 1 and mode 2 jointly determine, namely not only consider semantic segmentation, also consider reproducible results number.
The proportion threshold value β of semantic segmentation in n segmentation before presetting,
As K (a ∩ b) < β, Sim (a, b)=0;
As K (a ∩ b) < β, Sim (a, b)=
N(a∩b)*R(a∩b)*K(a∩b)/(N(a)*R(a)+N(b)*R(b)-N(a∩b)*R(a∩b)*K(a∩b)
+N(a∩b)*R(a∩b)*K(a∩b)(min(N(a)*R(a),N(b)*R(b)))
+log((N*N(a∩b)*R(a∩b)*K(a∩b))/((N(a)*R(a)*N(b)*R(b)))/logN.
In sum, the present invention proposes a kind of natural language searching method, do not need manual intervention; And be easy to be applied in the relevant work of Financial Information retrieval, improve the accuracy of retrieval expansion task.
Obviously, it should be appreciated by those skilled in the art, above-mentioned of the present invention each module or each step can realize with general computing system, they can concentrate on single computing system or to be distributed on network that multiple computing system forms, alternatively, they can realize with the executable program code of computing system thus, they can be stored and be performed by computing system within the storage system.Like this, the present invention is not restricted to any specific hardware and software combination.
Should be understood that, above-mentioned embodiment of the present invention is only for exemplary illustration or explain principle of the present invention and be not construed as limiting the invention.Therefore any amendment, made when without departing from the spirit and scope of the present invention, equivalent replacement, improvement etc., all should to be included within protection scope of the present invention.In addition, claims of the present invention be intended to contain fall into claims scope and border or this scope and border equivalents in whole change and modification.

Claims (4)

1. based on an information retrieval method for natural language, it is characterized in that, comprising:
Retrieve respectively multiple keywords of user's input, the number of documents utilizing result for retrieval to hit is to calculate the semantic approximation between keyword.
2. method according to claim 1, is characterized in that, uses the semantic approximation between following formulae discovery keyword:
Sim(a,b)=N(a∩b)/(N(a)+N(b)-N(a∩b))+N(a,b)(min(N(a),N(b)))+log((N*N(P∩b))/((N(a)*N(b)))/logN
Wherein Sim (a, b) represents the different keyword a that user inputs, the semantic approximation tolerance between b; N is the number of documents in search engine, and symbol N (x) expression search engine retrieving keyword x returns search file number; A ∩ b be keyword a's and b and operating result, namely N (a ∩ b) represents the number of files of retrieval " aANDb ".
3. method according to claim 2, is characterized in that, in the process calculating the semantic approximation between keyword, also comprises:
Keyword a and b with operation result for retrieval in, keyword a and b common result for retrieval segmentation occurred in same statement is expressed as semantic segmentation, and calculate the ratio of described semantic segmentation in a front n segmentation, be designated as K (a ∩ b), wherein n is for presetting segments; N (a ∩ b) * K (a ∩ b) is utilized to calculate approximation between keyword:
Sim K(a,b)=N(a∩b)*K(a∩b)/(N(a)+N(b)-N(a∩b)*K(a∩b))
+N(a∩b)*K(a∩b)(min(N(a),N(b)))
+log((N*N(a∩b)*K(a∩b))/((N(a)*N(b)))/logN;
Wherein Sim k(a, b) represents the different keyword a that user inputs, based on the semantic approximation tolerance of semantic segment information between b.
4. method according to claim 3, is characterized in that, also comprises:
The proportion threshold value β of semantic segmentation in n segmentation before presetting,
As K (a ∩ b) < β, Sim (a, b)=0;
As K (a ∩ b) < β, Sim (a, b)=
N(a∩b)*R(a∩b)*K(a∩b)/(N(a)*R(a)+N(b)*R(b)-N(a∩b)*R(a∩b)*K(a∩b)
+N(a∩b)*R(a∩b)*K(a∩b)(min(N(a)*R(a),N(b)*R(b)))
+log((N*N(a∩b)*R(a∩b)*K(a∩b))/((N(a)*R(a)*N(b)*R(b)))/logN;
Reproducible results quantity when wherein R (a), R (b) and R (a ∩ b) are respectively search key a, b, " aANDb ".
CN201510716439.9A 2015-10-29 2015-10-29 Information retrieval method based on natural language Pending CN105335504A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510716439.9A CN105335504A (en) 2015-10-29 2015-10-29 Information retrieval method based on natural language

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510716439.9A CN105335504A (en) 2015-10-29 2015-10-29 Information retrieval method based on natural language

Publications (1)

Publication Number Publication Date
CN105335504A true CN105335504A (en) 2016-02-17

Family

ID=55286031

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510716439.9A Pending CN105335504A (en) 2015-10-29 2015-10-29 Information retrieval method based on natural language

Country Status (1)

Country Link
CN (1) CN105335504A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109002464A (en) * 2017-06-06 2018-12-14 万事达卡国际公司 The method and system suggested using the automatic report analysis of dialog interface and distribution

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100042576A1 (en) * 2008-08-13 2010-02-18 Siemens Aktiengesellschaft Automated computation of semantic similarity of pairs of named entity phrases using electronic document corpora as background knowledge
CN102789479A (en) * 2012-06-08 2012-11-21 复旦大学 Vocabulary relevance calculating method on basis of semantic analysis of search result
CN103678642A (en) * 2013-12-20 2014-03-26 公安部第三研究所 Concept semantic similarity measurement method based on search engine

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100042576A1 (en) * 2008-08-13 2010-02-18 Siemens Aktiengesellschaft Automated computation of semantic similarity of pairs of named entity phrases using electronic document corpora as background knowledge
CN102789479A (en) * 2012-06-08 2012-11-21 复旦大学 Vocabulary relevance calculating method on basis of semantic analysis of search result
CN103678642A (en) * 2013-12-20 2014-03-26 公安部第三研究所 Concept semantic similarity measurement method based on search engine

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
ANJA THEOBALD等: "Semantic Similarity Search on Semistructured Data with the XXL Search Engine", 《INFORMATION RETRIEVAL JOURNAL》 *
陈海燕: "基于搜索引擎的词汇语义相似度计算方法", 《计算机科学》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109002464A (en) * 2017-06-06 2018-12-14 万事达卡国际公司 The method and system suggested using the automatic report analysis of dialog interface and distribution
CN109002464B (en) * 2017-06-06 2023-01-06 万事达卡国际公司 Method and system for automatic report analysis and distribution of suggestions using a conversational interface

Similar Documents

Publication Publication Date Title
CN106649818B (en) Application search intention identification method and device, application search method and server
CN110874531B (en) Topic analysis method and device and storage medium
Mitra Exploring session context using distributed representations of queries and reformulations
CN107102981B (en) Word vector generation method and device
US10268758B2 (en) Method and system of acquiring semantic information, keyword expansion and keyword search thereof
CN102479191B (en) Method and device for providing multi-granularity word segmentation result
US10095784B2 (en) Synonym generation
CN108280114B (en) Deep learning-based user literature reading interest analysis method
US10528662B2 (en) Automated discovery using textual analysis
US9697287B2 (en) Detection and handling of aggregated online content using decision criteria to compare similar or identical content items
CN105589894B (en) Document index establishing method and device and document retrieval method and device
CN105224624A (en) A kind of method and apparatus realizing down the quick merger of row chain
Blanco et al. Overview of NTCIR-13 Actionable Knowledge Graph (AKG) Task.
CN105404677A (en) Tree structure based retrieval method
JP2012079029A (en) Suggestion query extracting apparatus, method, and program
CN110705285B (en) Government affair text subject word library construction method, device, server and readable storage medium
US10671810B2 (en) Citation explanations
CN105426490A (en) Tree structure based indexing method
CN112287102A (en) Data mining method and device
CN105335504A (en) Information retrieval method based on natural language
CN112507181B (en) Search request classification method, device, electronic equipment and storage medium
CN105335505A (en) Information searching method based on natural language
Lingwal Noise reduction and content retrieval from web pages
CN105426551A (en) Classical Chinese searching method and device
CN105808607A (en) Generation method and device of document index

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20160217