CN103823857B - Space information searching method based on natural language processing - Google Patents

Space information searching method based on natural language processing Download PDF

Info

Publication number
CN103823857B
CN103823857B CN201410059272.9A CN201410059272A CN103823857B CN 103823857 B CN103823857 B CN 103823857B CN 201410059272 A CN201410059272 A CN 201410059272A CN 103823857 B CN103823857 B CN 103823857B
Authority
CN
China
Prior art keywords
weight
participle
word
natural language
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201410059272.9A
Other languages
Chinese (zh)
Other versions
CN103823857A (en
Inventor
吴朝晖
高啸
柳云超
陈华钧
郑国轴
杨建华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University ZJU
Original Assignee
Zhejiang University ZJU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University ZJU filed Critical Zhejiang University ZJU
Priority to CN201410059272.9A priority Critical patent/CN103823857B/en
Publication of CN103823857A publication Critical patent/CN103823857A/en
Application granted granted Critical
Publication of CN103823857B publication Critical patent/CN103823857B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/313Selection or weighting of terms for indexing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a space information searching method based on natural language processing. The space information searching method comprises the following steps of (1), performing word segmentation on an indexing document, and changing weights of various words obtained by word segmentation to obtain an indexing document comprising the weights; (2) inputting an inquire statement by a user, performing work segmentation on the inquire statement, and changing weights of various words obtained by word segmentation to obtain an inquire statement comprising the weights; and (3) searching the inquire statement comprising the weights in the indexing document comprising the weights. According to the space information searching method, a natural language processing tool is used, a word segmentation technology and a named entity identity technology are applied to the field of space information searching, and a searching effect is optimized.

Description

Space information retrieval method based on natural language processing
Technical field
The present invention relates to retrieval technique and natural language processing technique, the more particularly, to space based on natural language processing are believed Breath search method.
Background technology
Natural language processing is one of artificial intelligence field important directions, and main research is realized between people and computer The theory being exchanged with natural language symbol and method.Natural language processing is one and melts computer science, mathematics and language Learn the science in one.The nineties in last century, natural language understanding and the field processing there occurs huge change: require System can process really large-scale text it is desirable to useful information can be extracted from natural language text.Due to above Require, the development of real Large Scale Corpus, and the establishment of informative scale dictionary is developed, thus Bring great convenience for the application of the low levels such as participle, part-of-speech tagging.
Search engine refers to collect information from the Internet according to certain strategy, with specific computer program, After information is organized and processed, provide the user retrieval service, by what the related information of user search showed user be System.Search engine includes full-text index, directory index, first search engine, vertical search engine, aggregation type search engine, door Search engine, free lists of links etc..
The work of modern search engine can be divided into three phases: collects stage, pretreatment stage and inquiry phase.For For the retrieval in vertical field, the stage of collecting is then relatively simple, generally only needs to carry out simple format unification to metadata Process.Pretreatment stage is also referred to as the index construct stage, and this stage is the most complicated stage in search engine, most of Sort algorithm can be applied in this stage.First, search engine can be treated index data and be cleared up, and execution includes participle, goes Except stop words etc. operates;It is exactly most important step afterwards: build inverted index, inverted index is expressed as a word, corresponding The frequency that this word occurs in a document and position etc., be equivalent to and a dictionary is built to all data, can be fast according to word Speed indexes relevant documentation;Inquiry phase is the actually used stage of search engine, and all parts with user mutual are all at this The individual stage completes.Search engine is done cleaning and is processed to user input, is equally using participle and to remove the operation such as stop words, then Lexical item to be retrieved is updated to inverted index and marking formula, returns after sequence.
Technology binding site between natural language and retrieval is a lot, is all widely made in academia and industrial quarters With, including: participle, keyword extraction and semantic retrieval etc..
Content of the invention
The invention provides a kind of space information retrieval optimization method based on natural language processing, its object is to use The effect of natural language processing algorithm room for promotion information retrieval.
A kind of space information retrieval method based on natural language processing, comprising:
Step 1, index document is carried out participle, and changes the weight of each word after participle, obtains the rope after weight change Quotation shelves;
Step 2, user input query sentence, query statement is carried out with participle, and changes the weight of each word after participle, obtain Query statement to after weight change;
Step 3, the query statement after retrieval weight change in the index document after weight change.
Wherein, index document refers to the text being pre-stored in searching platform, and query statement refers to user when entering line retrieval The text of input.When entering line retrieval, by the query statement of user input is mated with index document, the literary composition of coupling This exports as retrieval result.By changing the weight of each word in index document and query statement so that representation space is believed The word weight of breath increases, thus improving the accuracy of retrieval.
In step 1, using overall linear model, participle is carried out to index document, and utilize the overall situation linear in step 2 Model carries out participle to query statement.
Overall linear model is modeled to target sequence on the basis of observation sequence, solves asking of serializing mark Topic.Have the advantages that discriminative model and production model consider the transition probability it is contemplated that between contextual tagging, with sequence simultaneously Row form carries out global parameter optimization and decoding.
The method for building up of described overall situation linear model is:
Step 1-1, is labeled to corpus, and in the corpus after mark, each individual character corresponds to a label;
Step 1-2, carries out model training using the corpus after default feature templates and mark, obtains described global lines Property model.
In terms of rule-based machine learning, present invention uses the substantial amounts of participle for geo-spatial data Sample, contains point spatial information natural language sentences for good word in these samples.These sample sentences include Sample Storehouse of increasing income Sentence, be on the other hand the sentence for spatial geographic information through manual mark.These sample sentences constitute corpus. Corpus is labeled, is easy to follow-up word segmentation processing.
In step 1-2, the step carrying out model training is as follows:
Step 1-21, applies mechanically feature templates to the corpus after mark, generates feature list to each individual character;
Step 1-22, extracts the feature in each feature list, using feature and and its weight build model, wherein each The initial value of weight is 0;
All individual characters in corpus after mark are predicted by step 1-23 using model, for the prediction of each individual character Result is handled as follows:
Prediction is correct, then carry out the prediction of next individual character;
Prediction error, then utilize the weight of online updating algorithm more new feature, obtain new model, using new model again This individual character is predicted, until predicting that correct or weight update times exceed preset value.
The part of speech of character representation word, comprises the part of speech of word and the part of speech of previous word in feature templates.Wherein prediction side Formula has a lot, for example with viterbi algorithm prediction, the error between the predictive value of individual character and actual value and threshold value is compared Relatively, thus judging whether individual character is predicted correctly.
In step 1 and step 2, the method carrying out participle is as follows,
Step a, enters text into overall linear model, and feature templates are applied to text by described overall situation linear model In, and the feature list according to corresponding to weight calculation obtains text;
Step b, obtains all possible tag combination using dynamic programming algorithm according to feature list, using backtracking algorithm Find the tag combination of optimum;
Text is carried out word division according to optimum tag combination by step c;
Wherein, the text described in step a to c is the query statement in index document or the step 2 in step 1.
Because each individual character corresponds to a label, therefore optimum tag combination illustrates that each word in text has most can The division position of energy, thus carry out word division (participle) according to optimum tag combination.
Described dynamic programming algorithm is viterbi algorithm.
Best consideration can be carried out to whole context using viterbi algorithm, thus obtaining preferably word segmentation result.
Utilize keyword extraction to change the weight of word in step 1 and step 2, so that the weight of key word is increased.
Wherein, key word refers to comprise the word of spatial information.
Carry out keyword extraction using textrank algorithm.
Textrank algorithm, is adopted the figure TRANSFER MODEL similar with the page rank of google, can be realized well The extraction of key word.
In step 1 and step 2, change the weight of each word after participle using name entity recognition method, increase literary composition The weight of spatial information noun in this, text is index document in step 1, in step 2 for query statement.
The noun of representation space information in text is identified so that retrieval result is believed in space using name entity recognition method Breath is more concentrated in field, thus improve effectiveness of retrieval.
The inventive method uses natural language processing instrument, by participle technique and name entity recognition techniques application space letter Breath searching field, optimizes the effect of retrieval.
Brief description
Fig. 1 is the method schematic diagram carrying out participle in one embodiment of the invention using viterbi algorithm;
Fig. 2 is the effect diagram of Chinese word segmentation in present example of the present invention;
Fig. 3 is the inventive method flow chart.
Specific embodiment
Below in conjunction with accompanying drawing, the specific embodiment of the present invention is described.It should be noted that the embodiments described herein It is served only for illustrating, be not limited to the present invention.
As shown in figure 3, the step of the embodiment of the present invention is as follows:
Step 1, index document is carried out participle, and changes the weight of each word after participle, obtains the rope after weight change Quotation shelves;
Step 2, user input query sentence, query statement is carried out with participle, and changes the weight of each word after participle, obtain Query statement to after weight change;
Wherein, in step 1 is adopted in step 2 the participle of query statement to the index participle that carries out of document and all Overall linear model is carried out.
The method for building up of overall linear model is:
Step 1-1, is labeled to corpus, and in the corpus after mark, each individual character corresponds to a label;
Step 1-2, carries out model training using the corpus after default feature templates and mark, obtains the linear mould of the overall situation Type.The step carrying out model training is as follows:
Step 1-21, applies mechanically feature templates to the corpus after mark, generates feature list to each individual character.Single with Chinese As a example word,
Step 1-22, extracts the feature in each feature list, using feature and and its weight build model, wherein each The initial value of weight is 0;
Step 1-23, is predicted to each individual character in the corpus after mark using model:
Prediction is correct, then carry out the prediction of next individual character;
Prediction error, then utilize the weight of online updating algorithm more new feature, obtain new model, repeat step 1-23, Until predicting that correct or weight update times exceed preset value.
In embodiments of the present invention, individual character prediction is carried out using viterbi algorithm, the predictive value according to individual character and sample value Between error judging whether prediction accurately, if prediction error, that is, the label predicted and the label of reality different then it represents that Parameter is problematic to the prediction of this individual character, needs undated parameter, and specific update algorithm is online updating (onlinepassive-aggressive) algorithm;
When the error amount of loop iteration is less than the threshold value setting, or exceed the iterationses of setting, then terminate algorithm.
Model training terminates afterwards it is possible to be predicted with the overall situation obtaining, and the method for concrete prediction is more, conventional One kind is dynamic programming algorithm, and as shown in Figure 2, we use dynamic programming algorithm, is inferred to according to the mark of previous state The mark of current state, is finally found out optimization path and is returned using backtracking algorithm.
As follows to the method that in the participle indexing document and step 2, query statement is carried out with participle in step 1:
Step a, enters text into overall linear model, feature templates are applied in text overall linear model, And the feature list according to corresponding to weight calculation obtains text.
Step b, obtains all possible tag combination using dynamic programming algorithm according to feature list, using backtracking algorithm Find the tag combination of optimum.
In present example of the present invention, dynamic programming algorithm is viterbi algorithm.Fig. 1 is to be selected using viterbi algorithm The schematic diagram of optimum label combination.Segmenting method schematic diagram based on mark., Fig. 2 is one and marks taking Chinese word segmentation as a example Sentence, each individual character (inclusion punctuation mark) in sentence corresponds to a label, in the corpus through mark, only four Plant possible label: s represents individual character, and b represents the beginning of word, and m represents the centre of word, and e represents the end of word.Superincumbent example In, sentence is divided into:
In | modernization | battleship | upper |, | or not there is | technology | simple | | post.
In sentence, " " this word independently becomes word, so using s labelling;" modernization " is three words, and " showing " word corresponds to B, represents the beginning of word, and " generation " word corresponds to m, represents the centre of word, and word is not over, and " change " corresponds to e, the knot of tagged words Bundle.
Text is carried out word division according to optimum tag combination by step c.
After completing participle, change the weight of each word, in order to later retrieval, the weight according to word enters line retrieval, Thus improving effectiveness of retrieval and accuracy.The weight method changing word can be carried using the key word of textrank algorithm Take.In embodiments of the present invention, carry out the change of weight using name Entity recognition, representation space in the text after participle is believed The word of breath increases weight, thus increasing the professional field specific aim of retrieval.
Step 3, the query statement after weight is changed enters line retrieval in the index document after weight change.
After indexing document and being weighted, two higher sentences of similarity can be promoted to obtain in retrieval higher Weight, thus arranging forward in Search Results.The computing formula of similarity is as follows:
sim(d,q)=cosine(d,q)=(d·q)/(|d|×|q|)
Wherein dRepresent index document, qRepresent query statement, similarity therebetween passes through cosine angle formulae meter Obtain, and weight information is already contained in dAnd qAmong, by increasing the weight of key word, so that similarity is high Index document obtains higher score, thus make in retrieval result higher score index document ordering forward, improve inspection The accuracy of rope.
Present invention incorporates participle technique and name entity recognition techniques, natural language processing technique is applied to spatially In reason message area retrieval, being capable of effective room for promotion geographic information retrieval effect.

Claims (6)

1. a kind of space information retrieval method based on natural language processing is it is characterised in that include:
Step 1, index document is carried out participle, and changes the weight of each word after participle, obtains the index literary composition after weight change Shelves;
Step 2, user input query sentence, query statement is carried out with participle, and changes the weight of each word after participle, weighed Query statement after again changing;
Step 3, the query statement after retrieval weight change in the index document after weight change;
In step 1, using overall linear model, participle is carried out to index document, and utilize overall linear model in step 2 Participle is carried out to query statement;
The method for building up of overall linear model is:
Step 1-1, is labeled to corpus, and in the corpus after mark, each individual character corresponds to a label;
Step 1-2, carries out model training using the corpus after default feature templates and mark, obtains the linear mould of the described overall situation Type;
The method carrying out participle is as follows:
Step a, enters text into overall linear model, and feature templates are applied in text described overall situation linear model, And the feature list according to corresponding to weight calculation obtains text;
Step b, obtains all possible tag combination using dynamic programming algorithm according to feature list, is found using backtracking algorithm Optimum tag combination;
Text is carried out word division according to optimum tag combination by step c;
Wherein, the text described in step a to c is the query statement in index document or the step 2 in step 1.
2. as claimed in claim 1 the space information retrieval method based on natural language processing it is characterised in that in step 1-2, The step carrying out model training is as follows:
Step 1-21, applies mechanically feature templates to the corpus after mark, generates feature list to each individual character;
Step 1-22, extracts the feature in each feature list, using feature and and its weight build model, wherein each weight Initial value be 0;
All individual characters in corpus after mark are predicted using model, predict the outcome for each individual character by step 1-23 It is handled as follows:
Prediction is correct, then carry out the prediction of next individual character;
Prediction error, then utilize the weight of online updating algorithm more new feature, obtain new model, using new model again to this Individual character is predicted, until predicting that correct or weight update times exceed preset value.
3. as claimed in claim 1 the space information retrieval method based on natural language processing it is characterised in that in stepb, Described dynamic programming algorithm is viterbi algorithm.
4. as claimed in claim 1 the space information retrieval method based on natural language processing it is characterised in that step 1 and Utilize keyword extraction to change the weight of word in step 2, so that the weight of key word is increased.
5. the space information retrieval method based on natural language processing as claimed in claim 4 is it is characterised in that utilize Textrank algorithm carries out keyword extraction.
6. as claimed in claim 1 the space information retrieval method based on natural language processing it is characterised in that step 1 with And in step 2, using the weight of each word after name entity recognition method change participle, increase spatial information noun in text Weight, text is index document in step 1, and text is query statement in step 2.
CN201410059272.9A 2014-02-21 2014-02-21 Space information searching method based on natural language processing Active CN103823857B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410059272.9A CN103823857B (en) 2014-02-21 2014-02-21 Space information searching method based on natural language processing

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410059272.9A CN103823857B (en) 2014-02-21 2014-02-21 Space information searching method based on natural language processing

Publications (2)

Publication Number Publication Date
CN103823857A CN103823857A (en) 2014-05-28
CN103823857B true CN103823857B (en) 2017-02-01

Family

ID=50758921

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410059272.9A Active CN103823857B (en) 2014-02-21 2014-02-21 Space information searching method based on natural language processing

Country Status (1)

Country Link
CN (1) CN103823857B (en)

Families Citing this family (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104008166B (en) * 2014-05-30 2017-05-24 华东师范大学 Dialogue short text clustering method based on form and semantic similarity
CN104268144B (en) * 2014-08-12 2017-08-29 华东师范大学 A kind of building method of electronic health record query statement
CN106970922A (en) * 2016-01-14 2017-07-21 北大方正集团有限公司 Index establishing method, search method and directory system based on multi-field keyword
US10824630B2 (en) 2016-10-26 2020-11-03 Google Llc Search and retrieval of structured information cards
CN106372063A (en) * 2016-11-01 2017-02-01 上海智臻智能网络科技股份有限公司 Information processing method and device and terminal
CN108897861A (en) * 2018-07-01 2018-11-27 东莞市华睿电子科技有限公司 A kind of information search method
CN110705249B (en) * 2019-09-03 2023-04-11 东南大学 NLP library combined use method based on overlapping degree calculation
CN111259145B (en) * 2020-01-16 2023-05-12 广西计算中心有限责任公司 Text retrieval classification method, system and storage medium based on information data
US11537660B2 (en) * 2020-06-18 2022-12-27 International Business Machines Corporation Targeted partial re-enrichment of a corpus based on NLP model enhancements
CN112183087B (en) * 2020-09-27 2024-05-28 武汉华工安鼎信息技术有限责任公司 System and method for identifying sensitive text
CN113688213B (en) * 2021-02-09 2023-09-29 鼎捷软件股份有限公司 Application program interface service searching system and searching method thereof

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103530415A (en) * 2013-10-29 2014-01-22 谭永 Natural language search method and system compatible with keyword search
CN103544309A (en) * 2013-11-04 2014-01-29 北京中搜网络技术股份有限公司 Splitting method for search string of Chinese vertical search

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103530415A (en) * 2013-10-29 2014-01-22 谭永 Natural language search method and system compatible with keyword search
CN103544309A (en) * 2013-11-04 2014-01-29 北京中搜网络技术股份有限公司 Splitting method for search string of Chinese vertical search

Also Published As

Publication number Publication date
CN103823857A (en) 2014-05-28

Similar Documents

Publication Publication Date Title
CN103823857B (en) Space information searching method based on natural language processing
CN109271529B (en) Method for constructing bilingual knowledge graph of Xilier Mongolian and traditional Mongolian
CN109543181B (en) Named entity model and system based on combination of active learning and deep learning
CN105718586B (en) The method and device of participle
CN104699763B (en) The text similarity gauging system of multiple features fusion
CN110232192A (en) Electric power term names entity recognition method and device
CN110851596A (en) Text classification method and device and computer readable storage medium
CN106202010A (en) The method and apparatus building Law Text syntax tree based on deep neural network
CN104408173A (en) Method for automatically extracting kernel keyword based on B2B platform
CN110442880B (en) Translation method, device and storage medium for machine translation
CN110879834B (en) Viewpoint retrieval system based on cyclic convolution network and viewpoint retrieval method thereof
CN105677857B (en) method and device for accurately matching keywords with marketing landing pages
CN109614620B (en) HowNet-based graph model word sense disambiguation method and system
CN103324700A (en) Noumenon concept attribute learning method based on Web information
CN112328800A (en) System and method for automatically generating programming specification question answers
CN113434636A (en) Semantic-based approximate text search method and device, computer equipment and medium
Sasidhar et al. A survey on named entity recognition in Indian languages with particular reference to Telugu
CN104699797A (en) Webpage data structured analytic method and device
CN104391969A (en) User query statement syntactic structure determining method and device
CN104317882A (en) Decision-based Chinese word segmentation and fusion method
CN106897274B (en) Cross-language comment replying method
Mercan et al. Abstractive text summarization for resumes with cutting edge NLP transformers and LSTM
Xue et al. A method of chinese tourism named entity recognition based on bblc model
CN114372454B (en) Text information extraction method, model training method, device and storage medium
CN117828024A (en) Plug-in retrieval method, device, storage medium and equipment

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant