CN105335504A

CN105335504A - Information retrieval method based on natural language

Info

Publication number: CN105335504A
Application number: CN201510716439.9A
Authority: CN
Inventors: 李垚霖
Original assignee: Chengdu Boruide Science & Technology Co Ltd
Current assignee: Chengdu Boruide Science & Technology Co Ltd
Priority date: 2015-10-29
Filing date: 2015-10-29
Publication date: 2016-02-17

Abstract

The invention provides an information retrieval method based on natural language. The method comprises the following steps: respectively retrieving a plurality of keywords input by a user, and computing semantic similarity among the keywords by utilizing document quantity in retrieval results. The invention discloses a natural language retrieval method which does not need manual intervention, is easy to apply to financial information retrieval associated work, and can improve accuracy of retrieving an extended task.

Description

A kind of information retrieval method based on natural language

Technical field

The present invention relates to natural language processing, particularly a kind of natural language searching method.

Background technology

The research of the semantic approximation of keyword is all an important problem in text search application.Such as topic detection, recommendation query etc.In recent years along with the fast development of network, many based on the Web inter-related task of financial field in the calculating of the semantic approximation of keyword also more and more important.Existing financial correlation search engine all provides a series of related term to help user and finds the result wanted most, thus improves search experience and the recall precision of user.In Financial Information field, the calculating of the semantic approximation of keyword also plays an important role.But the computing method of the semantic approximation of the keyword of existing sing on web do not consider interference and repetition in the result that search engine feeds back.The source of disturbing mainly keyword appears in some documents randomly, and this will reduce the accuracy of document searching quantity.The document much repeated makes Search Results quantity insincere.

Summary of the invention

For solving the problem existing for above-mentioned prior art, the present invention proposes a kind of information retrieval method based on natural language, comprising:

Retrieve respectively multiple keywords of user's input, the number of documents utilizing result for retrieval to hit is to calculate the semantic approximation between keyword.

Preferably, the semantic approximation between following formulae discovery keyword is used:

Sim(a,b)＝N(a∩b)/(N(a)+N(b)-N(a∩b))+N(a,b)(min(N(a),N(b)))+log((N*N(P∩b))/((N(a)*N(b)))/logN

Wherein Sim (a, b) represents the different keyword a that user inputs, the semantic approximation tolerance between b; N is the number of documents in search engine, and symbol N (x) expression search engine retrieving keyword x returns search file number; A ∩ b be keyword a's and b and operating result, namely N (a ∩ b) represents the number of files of retrieval " aANDb ".

Preferably, in the process calculating the semantic approximation between keyword, also comprise:

Keyword a and b with operation result for retrieval in, keyword a and b common result for retrieval segmentation occurred in same statement is expressed as semantic segmentation, and calculate the ratio of described semantic segmentation in a front n segmentation, be designated as K (a ∩ b), wherein n is for presetting segments; N (a ∩ b) * K (a ∩ b) is utilized to calculate approximation between keyword:

Sim _K(a，b)＝N(a∩b)*K(a∩b)/(N(a)+N(b)-N(a∩b)*K(a∩b))

+N(a∩b)*K(a∩b)(min(N(a),N(b)))

+log((N*N(a∩b)*K(a∩b))/((N(a)*N(b)))/logN；

Wherein Sim _k(a, b) represents the different keyword a that user inputs, based on the semantic approximation tolerance of semantic segment information between b.

Preferably, also comprise:

The proportion threshold value β of semantic segmentation in n segmentation before presetting,

As K (a ∩ b) < β, Sim (a, b)=0;

As K (a ∩ b) < β, Sim (a, b)=

N(a∩b)*R(a∩b)*K(a∩b)/(N(a)*R(a)+N(b)*R(b)-N(a∩b)*R(a∩b)*K(a∩b)

+N(a∩b)*R(a∩b)*K(a∩b)(min(N(a)*R(a),N(b)*R(b)))

+log((N*N(a∩b)*R(a∩b)*K(a∩b))/((N(a)*R(a)*N(b)*R(b)))/logN；

Reproducible results quantity when wherein R (a), R (b) and R (a ∩ b) are respectively search key a, b, " aANDb ".

The present invention compared to existing technology, has the following advantages:

The present invention proposes a kind of natural language searching method, do not need manual intervention; And be easy to be applied in the relevant work of Financial Information retrieval, improve the accuracy of retrieval expansion task.

Accompanying drawing explanation

Fig. 1 is the process flow diagram of the information retrieval method based on natural language according to the embodiment of the present invention.

Embodiment

Detailed description to one or more embodiment of the present invention is hereafter provided together with the accompanying drawing of the diagram principle of the invention.But describe the present invention in conjunction with such embodiment and the invention is not restricted to any embodiment.Scope of the present invention be only defined by the claims and the present invention contain many substitute, amendment and equivalent.Set forth many details in the following description to provide thorough understanding of the present invention.These details are provided for exemplary purposes and also can realize the present invention according to claims without some in these details or all details.

An aspect of of the present present invention provides a kind of information retrieval method based on natural language.Fig. 1 is the information retrieval method process flow diagram based on natural language according to the embodiment of the present invention.The present invention calculates the semantic approximation between keyword by confluent retrieval number of documents and result for retrieval segmentation.The method that the present invention proposes does not need manual intervention; And be easy to be applied in work as relevant to network in retrieval suggestion etc.Utilize keyword jointly to appear in same sentence and remove interference, utilize the reproducible results number of search engine to remove repetition, effectively can calculate the approximation between word.Meanwhile, the method proposed can improve the accuracy of retrieval expansion task.

Search file number of the present invention refers to the number of documents comprising search key b.At remainder of the present invention, will use that symbol N (b) expression search engine retrieving keyword b's return search file number.But the independent search file number of word a and b is not enough to calculate its semantic approximation, also should add the search file number of retrieval " aANDb ".

Particularly, use following methods to calculate the semantic approximation of keyword in the present invention, concrete formula is as follows.

Wherein N is the number of documents in search engine.

Search file number computing semantic approximation is used to have ignored the interference and repetition that exist in network data.Therefore need to reduce further two keywords to occur at random and document also exists the situation of a large amount of repetitions, to improve the accuracy that semantic approximation calculates.Therefore need to revise based on N (a ∩ b) part in the semantic approximation computing method of keyword of search file number.Search engine also can return result for retrieval segmentation when returning Search Results, these segmentations are normally no more than 30 words short and small text, these texts provide very important semantic information.

Word a and b common result for retrieval segmentation occurred in same statement is expressed as semantic segmentation.In segmentation with fullstop be ending be referred to as a statement.Semantic segmentation provides the useful semantic relation between word a and b.Therefore semantic segmentation can be used for judging whether two keywords appear in text document randomly.

Search engine provides the link of each result, and because number of documents is huge and growth rate fast, it is very difficult for therefore carrying out directly analyzing to each Search Results.Search engine provides the function that is removed reproducible results.When searching for search engine, in order to make the degree of correlation of result high, search engine eliminates some closely similar Search Results.The reproducible results quantity of search engine can with removing repetition.

The present invention calculates the semantic approximation of keyword further by confluent retrieval number of files, semantic segmentation and reproducible results quantity.

Mode 1: the semantic similitude degree between keyword is determined by search file number and semantic segmentation.Key step is as follows:

1) " a ", " b ", " aANDb " is searched for respectively in a search engine;

2) N (a), N (b) and N (a ∩ b) is obtained;

3) in the result of " aANDb ", the ratio of computing semantic segmentation in a front n segmentation, is designated as: K (a ∩ b), n are for presetting segments; Such as in front 100 segmentations of Search Results, a, b semantic segmentation of simultaneously appearing at same sentence have 40, then K (a ∩ b) be 40/100=40%.

4) N (a ∩ b) is replaced to calculate approximation between keyword with N (a ∩ b) * K (a ∩ b).

With upper type by using semantic segmentation to revise based on the N (a ∩ b) in the method for search file number, interference can be removed.Be shown below according to this mode is revised:

Sim _K(a，b)＝N(a∩b)*K(a∩b)/(N(a)+N(b)-N(a∩b)*K(a∩b))

+N(a∩b)*K(a∩b)(min(N(a),N(b)))

+log((N*N(a∩b)*K(a∩b))/((N(a)*N(b)))/logN

Mode 2: the semantic similitude degree between keyword is that search file number and reproducible results quantity determine jointly.Key step is as follows:

1) " a ", " b ", " aANDb " is searched for respectively in a search engine;

2) N (a), N (b) and N (a ∩ b) is obtained;

3) obtain reproducible results quantity when searching for " a ", " b ", " aANDb ", be designated as: R (a), R (b) and R (a ∩ b);

4) replace N (a), N (b) and N (a ∩ b) respectively with N (a) * R (a), N (b) * R (b) and N (a ∩ b) * R (a ∩ b), reduce the repetition in network data.

Be shown below according to this mode is revised:

Sim _R(a，b)＝

N(a∩b)*R(a∩b)/(N(a)*R(a)+N(b)*R(b)-N(a∩b)*R(a∩b)

+N(a∩b)*R(a∩b)(min(N(a)*R(a),N(b)*R(b)))

+log((N*N(a∩b)*R(a∩b))/((N(a)*R(a)*N(b)*R(b)))/logN

Semantic approximation between mode 3: two keywords is that pass-through mode 1 and mode 2 jointly determine, namely not only consider semantic segmentation, also consider reproducible results number.

As K (a ∩ b) < β, Sim (a, b)=0;

As K (a ∩ b) < β, Sim (a, b)=

N(a∩b)*R(a∩b)*K(a∩b)/(N(a)*R(a)+N(b)*R(b)-N(a∩b)*R(a∩b)*K(a∩b)

+N(a∩b)*R(a∩b)*K(a∩b)(min(N(a)*R(a),N(b)*R(b)))

+log((N*N(a∩b)*R(a∩b)*K(a∩b))/((N(a)*R(a)*N(b)*R(b)))/logN.

In sum, the present invention proposes a kind of natural language searching method, do not need manual intervention; And be easy to be applied in the relevant work of Financial Information retrieval, improve the accuracy of retrieval expansion task.

Obviously, it should be appreciated by those skilled in the art, above-mentioned of the present invention each module or each step can realize with general computing system, they can concentrate on single computing system or to be distributed on network that multiple computing system forms, alternatively, they can realize with the executable program code of computing system thus, they can be stored and be performed by computing system within the storage system.Like this, the present invention is not restricted to any specific hardware and software combination.

Should be understood that, above-mentioned embodiment of the present invention is only for exemplary illustration or explain principle of the present invention and be not construed as limiting the invention.Therefore any amendment, made when without departing from the spirit and scope of the present invention, equivalent replacement, improvement etc., all should to be included within protection scope of the present invention.In addition, claims of the present invention be intended to contain fall into claims scope and border or this scope and border equivalents in whole change and modification.

Claims

1. based on an information retrieval method for natural language, it is characterized in that, comprising:

2. method according to claim 1, is characterized in that, uses the semantic approximation between following formulae discovery keyword:

3. method according to claim 2, is characterized in that, in the process calculating the semantic approximation between keyword, also comprises:

Sim _K(a，b)＝N(a∩b)*K(a∩b)/(N(a)+N(b)-N(a∩b)*K(a∩b))

+N(a∩b)*K(a∩b)(min(N(a),N(b)))

+log((N*N(a∩b)*K(a∩b))/((N(a)*N(b)))/logN；

4. method according to claim 3, is characterized in that, also comprises:

As K (a ∩ b) < β, Sim (a, b)=0;

As K (a ∩ b) < β, Sim (a, b)=

N(a∩b)*R(a∩b)*K(a∩b)/(N(a)*R(a)+N(b)*R(b)-N(a∩b)*R(a∩b)*K(a∩b)

+N(a∩b)*R(a∩b)*K(a∩b)(min(N(a)*R(a),N(b)*R(b)))

+log((N*N(a∩b)*R(a∩b)*K(a∩b))/((N(a)*R(a)*N(b)*R(b)))/logN；