CN103823857B

CN103823857B - Space information searching method based on natural language processing

Info

Publication number: CN103823857B
Application number: CN201410059272.9A
Authority: CN
Inventors: 吴朝晖; 高啸; 柳云超; 陈华钧; 郑国轴; 杨建华
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2014-02-21
Filing date: 2014-02-21
Publication date: 2017-02-01
Anticipated expiration: 2034-02-21
Also published as: CN103823857A

Abstract

The invention discloses a space information searching method based on natural language processing. The space information searching method comprises the following steps of (1), performing word segmentation on an indexing document, and changing weights of various words obtained by word segmentation to obtain an indexing document comprising the weights; (2) inputting an inquire statement by a user, performing work segmentation on the inquire statement, and changing weights of various words obtained by word segmentation to obtain an inquire statement comprising the weights; and (3) searching the inquire statement comprising the weights in the indexing document comprising the weights. According to the space information searching method, a natural language processing tool is used, a word segmentation technology and a named entity identity technology are applied to the field of space information searching, and a searching effect is optimized.

Description

Space information retrieval method based on natural language processing

Technical field

The present invention relates to retrieval technique and natural language processing technique, the more particularly, to space based on natural language processing are believed Breath search method.

Background technology

Natural language processing is one of artificial intelligence field important directions, and main research is realized between people and computer The theory being exchanged with natural language symbol and method.Natural language processing is one and melts computer science, mathematics and language Learn the science in one.The nineties in last century, natural language understanding and the field processing there occurs huge change: require System can process really large-scale text it is desirable to useful information can be extracted from natural language text.Due to above Require, the development of real Large Scale Corpus, and the establishment of informative scale dictionary is developed, thus Bring great convenience for the application of the low levels such as participle, part-of-speech tagging.

Search engine refers to collect information from the Internet according to certain strategy, with specific computer program, After information is organized and processed, provide the user retrieval service, by what the related information of user search showed user be System.Search engine includes full-text index, directory index, first search engine, vertical search engine, aggregation type search engine, door Search engine, free lists of links etc..

The work of modern search engine can be divided into three phases: collects stage, pretreatment stage and inquiry phase.For For the retrieval in vertical field, the stage of collecting is then relatively simple, generally only needs to carry out simple format unification to metadata Process.Pretreatment stage is also referred to as the index construct stage, and this stage is the most complicated stage in search engine, most of Sort algorithm can be applied in this stage.First, search engine can be treated index data and be cleared up, and execution includes participle, goes Except stop words etc. operates；It is exactly most important step afterwards: build inverted index, inverted index is expressed as a word, corresponding The frequency that this word occurs in a document and position etc., be equivalent to and a dictionary is built to all data, can be fast according to word Speed indexes relevant documentation；Inquiry phase is the actually used stage of search engine, and all parts with user mutual are all at this The individual stage completes.Search engine is done cleaning and is processed to user input, is equally using participle and to remove the operation such as stop words, then Lexical item to be retrieved is updated to inverted index and marking formula, returns after sequence.

Technology binding site between natural language and retrieval is a lot, is all widely made in academia and industrial quarters With, including: participle, keyword extraction and semantic retrieval etc..

Content of the invention

The invention provides a kind of space information retrieval optimization method based on natural language processing, its object is to use The effect of natural language processing algorithm room for promotion information retrieval.

A kind of space information retrieval method based on natural language processing, comprising:

Step 1, index document is carried out participle, and changes the weight of each word after participle, obtains the rope after weight change Quotation shelves；

Step 2, user input query sentence, query statement is carried out with participle, and changes the weight of each word after participle, obtain Query statement to after weight change；

Step 3, the query statement after retrieval weight change in the index document after weight change.

Wherein, index document refers to the text being pre-stored in searching platform, and query statement refers to user when entering line retrieval The text of input.When entering line retrieval, by the query statement of user input is mated with index document, the literary composition of coupling This exports as retrieval result.By changing the weight of each word in index document and query statement so that representation space is believed The word weight of breath increases, thus improving the accuracy of retrieval.

In step 1, using overall linear model, participle is carried out to index document, and utilize the overall situation linear in step 2 Model carries out participle to query statement.

Overall linear model is modeled to target sequence on the basis of observation sequence, solves asking of serializing mark Topic.Have the advantages that discriminative model and production model consider the transition probability it is contemplated that between contextual tagging, with sequence simultaneously Row form carries out global parameter optimization and decoding.

The method for building up of described overall situation linear model is:

Step 1-1, is labeled to corpus, and in the corpus after mark, each individual character corresponds to a label；

Step 1-2, carries out model training using the corpus after default feature templates and mark, obtains described global lines Property model.

In terms of rule-based machine learning, present invention uses the substantial amounts of participle for geo-spatial data Sample, contains point spatial information natural language sentences for good word in these samples.These sample sentences include Sample Storehouse of increasing income Sentence, be on the other hand the sentence for spatial geographic information through manual mark.These sample sentences constitute corpus. Corpus is labeled, is easy to follow-up word segmentation processing.

In step 1-2, the step carrying out model training is as follows:

Step 1-21, applies mechanically feature templates to the corpus after mark, generates feature list to each individual character；

Step 1-22, extracts the feature in each feature list, using feature and and its weight build model, wherein each The initial value of weight is 0；

All individual characters in corpus after mark are predicted by step 1-23 using model, for the prediction of each individual character Result is handled as follows:

Prediction is correct, then carry out the prediction of next individual character；

Prediction error, then utilize the weight of online updating algorithm more new feature, obtain new model, using new model again This individual character is predicted, until predicting that correct or weight update times exceed preset value.

The part of speech of character representation word, comprises the part of speech of word and the part of speech of previous word in feature templates.Wherein prediction side Formula has a lot, for example with viterbi algorithm prediction, the error between the predictive value of individual character and actual value and threshold value is compared Relatively, thus judging whether individual character is predicted correctly.

In step 1 and step 2, the method carrying out participle is as follows,

Step a, enters text into overall linear model, and feature templates are applied to text by described overall situation linear model In, and the feature list according to corresponding to weight calculation obtains text；

Step b, obtains all possible tag combination using dynamic programming algorithm according to feature list, using backtracking algorithm Find the tag combination of optimum；

Text is carried out word division according to optimum tag combination by step c；

Wherein, the text described in step a to c is the query statement in index document or the step 2 in step 1.

Because each individual character corresponds to a label, therefore optimum tag combination illustrates that each word in text has most can The division position of energy, thus carry out word division (participle) according to optimum tag combination.

Described dynamic programming algorithm is viterbi algorithm.

Best consideration can be carried out to whole context using viterbi algorithm, thus obtaining preferably word segmentation result.

Utilize keyword extraction to change the weight of word in step 1 and step 2, so that the weight of key word is increased.

Wherein, key word refers to comprise the word of spatial information.

Carry out keyword extraction using textrank algorithm.

Textrank algorithm, is adopted the figure TRANSFER MODEL similar with the page rank of google, can be realized well The extraction of key word.

In step 1 and step 2, change the weight of each word after participle using name entity recognition method, increase literary composition The weight of spatial information noun in this, text is index document in step 1, in step 2 for query statement.

The noun of representation space information in text is identified so that retrieval result is believed in space using name entity recognition method Breath is more concentrated in field, thus improve effectiveness of retrieval.

The inventive method uses natural language processing instrument, by participle technique and name entity recognition techniques application space letter Breath searching field, optimizes the effect of retrieval.

Brief description

Fig. 1 is the method schematic diagram carrying out participle in one embodiment of the invention using viterbi algorithm；

Fig. 2 is the effect diagram of Chinese word segmentation in present example of the present invention；

Fig. 3 is the inventive method flow chart.

Specific embodiment

Below in conjunction with accompanying drawing, the specific embodiment of the present invention is described.It should be noted that the embodiments described herein It is served only for illustrating, be not limited to the present invention.

As shown in figure 3, the step of the embodiment of the present invention is as follows:

Wherein, in step 1 is adopted in step 2 the participle of query statement to the index participle that carries out of document and all Overall linear model is carried out.

The method for building up of overall linear model is:

Step 1-2, carries out model training using the corpus after default feature templates and mark, obtains the linear mould of the overall situation Type.The step carrying out model training is as follows:

Step 1-21, applies mechanically feature templates to the corpus after mark, generates feature list to each individual character.Single with Chinese As a example word,

Step 1-23, is predicted to each individual character in the corpus after mark using model:

Prediction error, then utilize the weight of online updating algorithm more new feature, obtain new model, repeat step 1-23, Until predicting that correct or weight update times exceed preset value.

In embodiments of the present invention, individual character prediction is carried out using viterbi algorithm, the predictive value according to individual character and sample value Between error judging whether prediction accurately, if prediction error, that is, the label predicted and the label of reality different then it represents that Parameter is problematic to the prediction of this individual character, needs undated parameter, and specific update algorithm is online updating (onlinepassive-aggressive) algorithm；

When the error amount of loop iteration is less than the threshold value setting, or exceed the iterationses of setting, then terminate algorithm.

Model training terminates afterwards it is possible to be predicted with the overall situation obtaining, and the method for concrete prediction is more, conventional One kind is dynamic programming algorithm, and as shown in Figure 2, we use dynamic programming algorithm, is inferred to according to the mark of previous state The mark of current state, is finally found out optimization path and is returned using backtracking algorithm.

As follows to the method that in the participle indexing document and step 2, query statement is carried out with participle in step 1:

Step a, enters text into overall linear model, feature templates are applied in text overall linear model, And the feature list according to corresponding to weight calculation obtains text.

Step b, obtains all possible tag combination using dynamic programming algorithm according to feature list, using backtracking algorithm Find the tag combination of optimum.

In present example of the present invention, dynamic programming algorithm is viterbi algorithm.Fig. 1 is to be selected using viterbi algorithm The schematic diagram of optimum label combination.Segmenting method schematic diagram based on mark., Fig. 2 is one and marks taking Chinese word segmentation as a example Sentence, each individual character (inclusion punctuation mark) in sentence corresponds to a label, in the corpus through mark, only four Plant possible label: s represents individual character, and b represents the beginning of word, and m represents the centre of word, and e represents the end of word.Superincumbent example In, sentence is divided into:

In sentence, " " this word independently becomes word, so using s labelling；" modernization " is three words, and " showing " word corresponds to B, represents the beginning of word, and " generation " word corresponds to m, represents the centre of word, and word is not over, and " change " corresponds to e, the knot of tagged words Bundle.

Text is carried out word division according to optimum tag combination by step c.

After completing participle, change the weight of each word, in order to later retrieval, the weight according to word enters line retrieval, Thus improving effectiveness of retrieval and accuracy.The weight method changing word can be carried using the key word of textrank algorithm Take.In embodiments of the present invention, carry out the change of weight using name Entity recognition, representation space in the text after participle is believed The word of breath increases weight, thus increasing the professional field specific aim of retrieval.

Step 3, the query statement after weight is changed enters line retrieval in the index document after weight change.

After indexing document and being weighted, two higher sentences of similarity can be promoted to obtain in retrieval higher Weight, thus arranging forward in Search Results.The computing formula of similarity is as follows:

sim(d,q)=cosine(d^→,q^→)=(d^→·q^→)/(|d^→|×|q^→|)

Wherein d^→Represent index document, q^→Represent query statement, similarity therebetween passes through cosine angle formulae meter Obtain, and weight information is already contained in d^→And q^→Among, by increasing the weight of key word, so that similarity is high Index document obtains higher score, thus make in retrieval result higher score index document ordering forward, improve inspection The accuracy of rope.

Present invention incorporates participle technique and name entity recognition techniques, natural language processing technique is applied to spatially In reason message area retrieval, being capable of effective room for promotion geographic information retrieval effect.

Claims

1. a kind of space information retrieval method based on natural language processing is it is characterised in that include:

Step 1, index document is carried out participle, and changes the weight of each word after participle, obtains the index literary composition after weight change Shelves；

Step 2, user input query sentence, query statement is carried out with participle, and changes the weight of each word after participle, weighed Query statement after again changing；

Step 3, the query statement after retrieval weight change in the index document after weight change；

In step 1, using overall linear model, participle is carried out to index document, and utilize overall linear model in step 2 Participle is carried out to query statement；

The method for building up of overall linear model is:

Step 1-2, carries out model training using the corpus after default feature templates and mark, obtains the linear mould of the described overall situation Type；

The method carrying out participle is as follows:

Step a, enters text into overall linear model, and feature templates are applied in text described overall situation linear model, And the feature list according to corresponding to weight calculation obtains text；

Step b, obtains all possible tag combination using dynamic programming algorithm according to feature list, is found using backtracking algorithm Optimum tag combination；

2. as claimed in claim 1 the space information retrieval method based on natural language processing it is characterised in that in step 1-2, The step carrying out model training is as follows:

Step 1-22, extracts the feature in each feature list, using feature and and its weight build model, wherein each weight Initial value be 0；

All individual characters in corpus after mark are predicted using model, predict the outcome for each individual character by step 1-23 It is handled as follows:

Prediction error, then utilize the weight of online updating algorithm more new feature, obtain new model, using new model again to this Individual character is predicted, until predicting that correct or weight update times exceed preset value.

3. as claimed in claim 1 the space information retrieval method based on natural language processing it is characterised in that in stepb, Described dynamic programming algorithm is viterbi algorithm.

4. as claimed in claim 1 the space information retrieval method based on natural language processing it is characterised in that step 1 and Utilize keyword extraction to change the weight of word in step 2, so that the weight of key word is increased.

5. the space information retrieval method based on natural language processing as claimed in claim 4 is it is characterised in that utilize Textrank algorithm carries out keyword extraction.

6. as claimed in claim 1 the space information retrieval method based on natural language processing it is characterised in that step 1 with And in step 2, using the weight of each word after name entity recognition method change participle, increase spatial information noun in text Weight, text is index document in step 1, and text is query statement in step 2.