CN103823857A

CN103823857A - Space information searching method based on natural language processing

Info

Publication number: CN103823857A
Application number: CN201410059272.9A
Authority: CN
Inventors: 吴朝晖; 高啸; 柳云超; 陈华钧; 郑国轴; 杨建华
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2014-02-21
Filing date: 2014-02-21
Publication date: 2014-05-28
Anticipated expiration: 2034-02-21
Also published as: CN103823857B

Abstract

The invention discloses a space information searching method based on natural language processing. The space information searching method comprises the following steps of (1), performing word segmentation on an indexing document, and changing weights of various words obtained by word segmentation to obtain an indexing document comprising the weights; (2) inputting an inquire statement by a user, performing work segmentation on the inquire statement, and changing weights of various words obtained by word segmentation to obtain an inquire statement comprising the weights; and (3) searching the inquire statement comprising the weights in the indexing document comprising the weights. According to the space information searching method, a natural language processing tool is used, a word segmentation technology and a named entity identity technology are applied to the field of space information searching, and a searching effect is optimized.

Description

Spatial information search method based on natural language processing

Technical field

The present invention relates to retrieval technique and natural language processing technique, relate in particular to the spatial information search method based on natural language processing.

Background technology

Natural language processing is an important directions in artificial intelligence field, and main research realizes the theory and the method that between people and computing machine, exchange with natural language symbol.Natural language processing is one and melts computer science, mathematics and linguistics in the science of one.The nineties in last century, there is huge variation in the field of natural language understanding and processing: require system can process real large-scale text, requirement can extract useful information from natural language text.Due to requirement above, the development of Large Scale Corpus, and the establishment of informative scale dictionary really is all developed, thereby applies and bring great convenience for low levels such as participle, part-of-speech taggings.

Search engine refers to according to certain strategy, uses specific computer program to gather information from internet, after information being organized and is processed, and for user provides retrieval service, the system by information display relevant user search to user.Search engine comprises full-text index, directory index, first search engine, vertical search engine, aggregation type search engine, door search engine, free lists of links etc.

The work of modern search engine can be divided into three phases: collection stage, pretreatment stage and inquiry phase.For the retrieval in vertical field, the collection stage is comparatively simple, conventionally only need to carry out simple uniform formatization to metadata and process.Pretreatment stage is also referred to as the index construct stage, and this stage is the stage the most complicated in search engine, and most of sort algorithm can be applied in this stage.First, search engine can be treated index data and clear up, and carries out and comprises participle, removes the operations such as stop words; Be exactly most important step afterwards: build inverted index, inverted index is expressed as a word, frequency and position etc. that corresponding this word occurs in document, be equivalent to dictionary of all data construct, according to word can quick indexing to relevant documentation; Inquiry phase is the actual operational phase of search engine, and all and part user interactions all completes in this stage.Search engine is done cleaning to user's input and is processed, and is equally to use participle and remove the operations such as stop words, then lexical item to be retrieved is updated to inverted index and marking formula, after sequence, returns.

Technology binding site between natural language and retrieval is a lot, is all widely used in academia and industry member, comprising: participle, keyword extraction and semantic retrieval etc.

Summary of the invention

The invention provides a kind of spatial information Optimization of Information Retrieval method based on natural language processing, its object is to use the effect of natural language processing algorithm room for promotion information retrieval.

A spatial information search method based on natural language processing, comprising:

Step 1, carries out participle by index file, and changes the weight of each word after participle, obtains the index file after weight change;

Step 2, user input query statement, carries out participle to query statement, and changes the weight of each word after participle, obtains the query statement after weight change;

Step 3, retrieves the query statement after weight change in the index file after weight change.

Wherein, index file refers to the text being pre-stored in retrieval platform, and query statement refers to the text that user inputs in the time retrieving.In the time retrieving, by by the query statement of user input with mate with index file, the text of coupling is exported as result for retrieval.By changing the weight of each word in index file and query statement, the word weight of representation space information is increased, thereby improve the accuracy of retrieval.

In step 1, utilize overall linear model to carry out participle to index file, and in step 2, utilize overall linear model to carry out participle to query statement.

Overall situation linear model carries out modeling to target sequence on the basis of observation sequence, solves the problem of serializing mark.There is discriminative model and production model consideration simultaneously, considered the transition probability between contextual tagging, carry out global parameter optimization and decoding with serializing form.

The method for building up of described overall linear model is:

Step 1-1, marks corpus, the corresponding label of each individual character in the corpus after mark;

Step 1-2, utilizes the corpus after default feature templates and mark to carry out model training, obtains described overall linear model.

Aspect rule-based machine learning, the present invention has used a large amount of participle samples for geo-spatial data, has comprised the spatial information natural language sentences of point good word in these samples.These sample sentences comprise the sentence of the Sample Storehouse of increasing income, and are the sentence through manually marking on the other hand for spatial geographic information.These sample sentences have formed corpus.Corpus is marked, be convenient to follow-up word segmentation processing.

In step 1-2, the step of carrying out model training is as follows:

Step 1-21, applies mechanically feature templates to the corpus after mark, to the list of each individual character generating feature;

Step 1-22, extracts the feature in each feature list, utilize feature and and weight build model, wherein the initial value of each weight is 0;

Step 1-23, utilizes model to predict all individual characters in the corpus after marking, and predicts the outcome and is handled as follows for each individual character:

Prediction is correct, carries out the prediction of next individual character;

Prediction error, utilizes the weight of online updating algorithm regeneration characteristics, obtains new model, utilizes new model this individual character to be predicted again, until prediction update times correct or weight exceedes preset value.

The part of speech of character representation word, comprises the part of speech of word and the part of speech of previous word in feature templates.Wherein prediction mode has a lot, for example, adopt viterbi algorithm prediction, the error between the predicted value of individual character and actual value and threshold value is compared, thereby judge whether individual character is predicted correctly.

In step 1 and step 2, the method for carrying out participle is as follows,

Step a, inputs to text in overall linear model, and described overall linear model is applied to feature templates in text, and obtains the corresponding feature list of text according to weight calculation;

Step b, adopts dynamic programming algorithm to obtain all possible tag combination according to feature list, utilizes back-track algorithm to find optimum tag combination;

Step c, carries out word division according to optimum tag combination by text;

Wherein, the text described in step a to c is the query statement in index file or the step 2 in step 1.

Due to the corresponding label of each individual character, therefore optimum tag combination has represented the most possible division position of each word in text, thereby carries out word division (participle) according to optimum tag combination.

Described dynamic programming algorithm is viterbi algorithm.

Adopt viterbi algorithm to carry out best consideration to whole context, thereby obtain preferably word segmentation result.

In step 1 and step 2, utilize keyword extraction to change the weight of word, the weight of keyword is increased.

Wherein, keyword refers to the word that comprises spatial information.

Utilize TextRank algorithm to carry out keyword extraction.

TextRank algorithm, adopt and the similar figure TRANSFER MODEL of Page Rank of Google, can realize the extraction of keyword well.

In step 1 and step 2, utilize the weight of each word after named entity recognition method change participle, increase the weight of spatial information noun in text, be index file at step 1 Chinese version, in step 2, be query statement.

The noun that adopts representation space information in named entity recognition method identification text, makes result for retrieval more concentrated in spatial information field, thereby has improved effectiveness of retrieval.

The inventive method is used natural language processing instrument, by participle technique and named entity recognition technology application space information retrieval field, has optimized the effect of retrieval.

Accompanying drawing explanation

Fig. 1 utilizes viterbi algorithm to carry out the method schematic diagram of participle in one embodiment of the invention;

Fig. 2 is the effect schematic diagram of Chinese word segmentation in the current embodiment of the present invention;

Fig. 3 is the inventive method process flow diagram.

Embodiment

Below in conjunction with accompanying drawing, specific embodiments of the invention are described.It should be noted that the embodiments described herein, only for illustrating, is not limited to the present invention.

As shown in Figure 3, the step of the embodiment of the present invention is as follows:

Wherein, the participle in step 1, index file being carried out and all adopt overall linear model to carry out to the participle of query statement in step 2.

The method for building up of overall situation linear model is:

Step 1-2, utilizes the corpus after default feature templates and mark to carry out model training, obtains overall linear model.The step of carrying out model training is as follows:

Step 1-21, applies mechanically feature templates to the corpus after mark, to the list of each individual character generating feature.Take Chinese individual character as example,

Step 1-23, utilizes model to predict each individual character in the corpus after marking:

Prediction is correct, carries out the prediction of next individual character;

Prediction error, utilizes the weight of online updating algorithm regeneration characteristics, obtains new model, and repeating step 1-23, until prediction update times correct or weight exceedes preset value.

In embodiments of the present invention, adopt viterbi algorithm to carry out individual character prediction, judge whether that according to the error between the predicted value of individual character and sample value prediction accurately, if prediction error, the label of prediction is different with actual label, represent that parameter has problem to the prediction of this individual character, need undated parameter, concrete update algorithm is online updating (OnlinePassive-Aggressive) algorithm;

When the error amount of loop iteration is less than the threshold value of setting, or exceed the iterations of setting, finish algorithm.

After model training finishes, just can predict by the overall situation obtaining, the method of concrete prediction is more, conventional one is dynamic programming algorithm, as shown in Figure 2, we use dynamic programming algorithm, infer the mark of current state according to the mark of previous state, finally use back-track algorithm find out optimization path and return.

In participle to index file in step 1 and step 2, query statement is carried out to the method for participle as follows:

Step a, inputs to text in overall linear model, and overall linear model is applied to feature templates in text, and obtains the corresponding feature list of text according to weight calculation.

Step b, adopts dynamic programming algorithm to obtain all possible tag combination according to feature list, utilizes back-track algorithm to find optimum tag combination.

In the current embodiment of the present invention, dynamic programming algorithm is viterbi algorithm.Fig. 1 utilizes viterbi algorithm to select the schematic diagram of optimum label combination.Based on the segmenting method schematic diagram of mark.Take Chinese word segmentation as example, Fig. 2 is a sentence having marked, the corresponding label of each individual character (comprising punctuation mark) in sentence, in the corpus through mark, only have four kinds of possible labels: S represents individual character, B represents the beginning of word, M represents the centre of word, and E represents the end of word.In the above example, sentence is divided into:

In sentence, " " this word independently becomes word, so use S mark; " modernization " is three words, and the corresponding B of " showing " word, represents the beginning of word, the corresponding M of " generation " word, and the centre of expression word, word does not also finish, and " change " corresponding E, the end of tagged words.

Step c, carries out word division according to optimum tag combination by text.

After completing participle, change the weight of each word, so that later retrieval retrieves according to the weight of word, thereby improve effectiveness of retrieval and accuracy.The weight method that changes word can be the keyword extraction of utilizing TextRank algorithm.In embodiments of the present invention, adopt named entity recognition to carry out the change of weight, the word of representation space information in the text after participle is increased to weight, thereby increase the professional domain specific aim of retrieval.

Step 3, retrieves in the index file by the query statement after weight change after weight change.

To index file with after being weighted, can impel two statements that similarity is higher to obtain higher weight in the time of retrieval, thereby in Search Results, arrange forward.The computing formula of similarity is as follows:

sim(d,q)=cosine(d ^→,q ^→)=(d ^→·q ^→)/(|d ^→|×|q ^→|)

Wherein d ^→represent index file, q ^→represent query statement, the similarity between the two calculates by cosine angle formulae, and weight information has been included in d ^→and q ^→among, by increasing the weight of keyword, can make the index file that similarity is high obtain higher score, thereby in result for retrieval, make the index file sequence of higher score forward, improve the accuracy of retrieval.

The present invention combines participle technique and named entity recognition technology, natural language processing technique is applied in the retrieval of spatial geographic information field to effectively room for promotion geographic information retrieval effect.

Claims

1. the spatial information search method based on natural language processing, is characterized in that, comprising:

2. the spatial information search method based on natural language processing as claimed in claim 1, is characterized in that, in step 1, utilizes overall linear model to carry out participle to index file, and in step 2, utilizes overall linear model to carry out participle to query statement.

3. the spatial information search method based on natural language processing as claimed in claim 2, is characterized in that, the method for building up of described overall linear model is:

4. the spatial information search method based on natural language processing as claimed in claim 3, is characterized in that, in step 1-2, the step of carrying out model training is as follows:

Prediction is correct, carries out the prediction of next individual character;

5. the spatial information search method based on natural language processing as claimed in claim 4, is characterized in that, in step 1 and step 2, the method for carrying out participle is as follows,

Step c, carries out word division according to optimum tag combination by text;

6. the spatial information search method based on natural language processing as claimed in claim 5, is characterized in that, in step b, described dynamic programming algorithm is viterbi algorithm.

7. the spatial information search method based on natural language processing as claimed in claim 1, is characterized in that, utilizes keyword extraction to change the weight of word in step 1 and step 2, and the weight of keyword is increased.

8. the spatial information search method based on natural language processing as claimed in claim 7, is characterized in that, utilizes TextRank algorithm to carry out keyword extraction.

9. the spatial information search method based on natural language processing as claimed in claim 1, it is characterized in that, in step 1 and step 2, utilize the weight of each word after named entity recognition method change participle, increase the weight of spatial information noun in text, being index file at step 1 Chinese version, is query statement at step 2 Chinese version.