CN109271502A

CN109271502A - A kind of classifying method and device of the space querying theme based on natural language processing

Info

Publication number: CN109271502A
Application number: CN201811116358.5A
Authority: CN
Inventors: 呙维; 赵雨慧; 李铭; 朱欣焰
Original assignee: Wuhan University WHU
Current assignee: Wuhan University WHU
Priority date: 2018-09-25
Filing date: 2018-09-25
Publication date: 2019-01-25
Anticipated expiration: 2038-09-25
Also published as: CN109271502B

Abstract

The classifying method and device of the present invention provides a kind of space querying theme based on natural language processing, method therein includes: that the set of partition word is divided by the natural language for inputting user, then successively carries out characteristic matching to the word in set and rearranges with semantic sequence.The sample most adjacent with the natural language of input is searched further according to the result of theme training, and returns to theme, the space querying theme classification to natural language is reached with this.Realize the technical effect for improving subject distillation accuracy.

Description

A kind of classifying method and device of the space querying theme based on natural language processing

Technical field

The present invention relates to natural language technical fields, and in particular to a kind of space querying theme based on natural language processing Classifying method and device.

Background technique

With the rapid development of generation information technology industry, personal intelligence assistant has become the stream for improving quality of the life Row application.According to the input of user, personal intelligence assistant can be completed by natural language understanding and automated information processing Operational order.Natural language is the subdiscipline of artificial intelligence, and natural language processing is the reason using machine processing human language By and technology, language is studied into corresponding algorithm as computing object.Purpose be allow the mankind can with natural language form with Computer system carry out human-computer interaction, thus it is more convenient, information management is effectively performed.At the beginning of from the end of the nineties to 21 century, people Gradually recognize, be all only successfully to carry out at natural language with Statistics-Based Method with rule-based method or only Reason.Subsequent Case-based Reasoning and rule-based corpus technology are come into being.

Existing natural language processing technique, mainly include rule-based method and method two major classes type based on probability, It is subdivided into based on Bayes principle method, based on Hidden Markov Model method, discourse analysis method, neural network method etc. Deng.But the increasingly increase with people to information service demand, the semantic understanding of natural language there are still subject distillation difficulty with The problems such as theme ambiguity.

With the continuous development of natural language understanding technology, the research of place query language there has also been it is certain into Exhibition achieves many significant achievements.The main morphology being related to including place query language, syntax and semantic content and The research of method and the research of place query language spatial relation semantics and refinement.Natural language cognition is related to processing Oneself tends to be mature to theory and method, but also seldom for the research of the natural language of space field.Present invention applicant is implementing In process of the invention, find existing method in, be primarily present following both sides problem: first is that due to natural language morphology, Syntax and semantic is flexibly complicated, and interpretation process ambiguity situation is more, and existing research may be only available for some specific profession neck mostly The GIS-Geographic Information System in domain, to be considered merely as the arbitrary way of other space querying modes for a long time.Second is that place The research and accumulation of query language domain knowledge are less, lead to space dictionary summary and induction not system, space querying syntactic analysis It concludes not perfect while less to the existing research achievement of Spatial Semantics.At present in natural language understanding technology, space is believed The extraction of theme is ceased, ambiguity and mistake are more in explanation results.

From the foregoing, it will be observed that the technical problem of subject distillation result inaccuracy existing for method in the prior art.

Summary of the invention

In view of this, the classifying method and dress of the present invention provides a kind of space querying theme based on natural language processing It sets, to solve or at least partly solve the technical problem of subject distillation result inaccuracy existing for method in the prior art.

First aspect present invention provides a kind of classifying method of space querying theme based on natural language processing, packet It includes:

Step S1: natural language to be processed is divided into the set of word based on default partition word；

Step S2: the word in the set of institute's predicate is subjected to characteristic matching with the conceptual lexicon constructed in advance, is obtained Obtain word sequence corresponding with preset structure；

Step S3: concentrating in theme training result, lookup and the most adjacent sample of the natural language to be processed, In, the theme training result collection is obtained, the sample by the natural language sample collected in advance after being trained by the word sequence Comprising text and inquiry theme in this, the inquiry theme for including in the sample is returned to, and using the inquiry theme as classification As a result.

In one implementation, the default partition word includes: actional verb, preposition, subject, Feature Words and query Word.

In one implementation, the conceptual lexicon constructed in advance includes point of interest, service attribute, business category Property evaluation, spatial relationship, actional verb, the time, personage, place query, evaluation query, business query.

In one implementation, the preset structure is " theme-verb-point of interest-verb-article ", step S2 tool Body includes:

Step S2.1: carrying out characteristic matching with the conceptual lexicon constructed in advance for the word in the set of institute's predicate, Obtain Feature Words；

Step S2.2: the Feature Words are converted to the word sequence of " theme-verb-point of interest-verb-article " structure.

In one implementation, theme training result collection passes through the word order by the natural language sample collected in advance It is obtained after column training, specifically:

Obtain the training sample comprising subject information；

Create the training sample ElaticSearch index and mapping, wherein it is described mapping include [the first text, Theme, ID, row], wherein the first text is the word sequence, is the partial list of space segmentation, theme is the master of training sample Question number, ID are the ID of training sample, behavior natural language to be processed；

All mappings are traversed, the word for including in training sample is replaced with into the feature in the conceptual lexicon constructed in advance Word obtains the second text；

All mappings are traversed, to each training sample, are constructed [the first text, the second text, theme, ID], and be inserted into It is trained into ElaticSearch, obtains the theme training result collection.

In one implementation, it is searched according to pre-determined distance editor's algorithm most adjacent with the natural language to be processed Sample.

In one implementation, it is replaced in the conceptual lexicon constructed in advance by the word for including in training sample Feature Words before, the method also includes:

Whether the word in training of judgement sample corresponds at least two classifications of the concept lexicon constructed in advance,

If it is, for the word without replacement.

Based on same inventive concept, second aspect of the present invention provides a kind of space querying based on natural language processing The categorization arrangement of theme, comprising:

Language divides module, for natural language to be processed to be divided into the set of word based on default partition word；

Characteristic matching module, it is special for carrying out the word in the set of institute's predicate with the conceptual lexicon constructed in advance Sign matching, obtains word sequence corresponding with preset structure；

Theme classifying module is searched most adjacent with the natural language to be processed for concentrating in theme training result Sample, wherein the theme training result collection is by the natural language sample collected in advance, by obtaining after word sequence training , comprising text and inquiry theme in the sample, return to the inquiry theme for including in the sample, and by the inquiry theme As categorization results.

Said one or multiple technical solutions in the embodiment of the present application at least have following one or more technology effects Fruit:

It, then will be in set by the way that natural language to be processed to be divided into the set of word in method provided by the invention Word carry out characteristic matching with the conceptual lexicon that constructs in advance, obtain word sequence corresponding with preset structure, then according to The result set of theme training searches the sample most adjacent with natural language to be processed, and returns to theme.By will be to be processed Natural language is split, after characteristic matching, and splitting and reorganizing is semantic sequence, can reduce the complexity of natural language, and The result set of theme training is the natural language sample collected in advance, is obtained after being trained by word sequence, by a large amount of Sample insertion word sequence be trained so that its have semanteme, the accuracy of theme training result collection can be improved；Then lead to Cross theme training result concentrate search with the most adjacent sample of language to be processed, and using the theme of sample as categorization results, So improving the accuracy sorted out, solves the technical problem that subject distillation result inaccuracy exists in the prior art.

Detailed description of the invention

In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is this hair Bright some embodiments for those of ordinary skill in the art without creative efforts, can be with root Other attached drawings are obtained according to these attached drawings.

Fig. 1 is a kind of process of the classifying method of the space querying theme based on natural language processing in the embodiment of the present invention Figure；

Fig. 2 is to separate set of words exemplary diagram under a kind of application scenarios；

Fig. 3 is the exemplary diagram that natural language carries out partition test；

Fig. 4 is a kind of corresponding conceptual vocabulary exemplary diagram of application scenarios；

Fig. 5 is the test result exemplary diagram carried out after characteristic matching；

Fig. 6 is the theme trained sample graph；

Fig. 7 is the theme the result schematic diagram of classification；

Fig. 8 is a kind of structure of the categorization arrangement of the space querying theme based on natural language processing in the embodiment of the present invention Figure.

Specific embodiment

The classifying method and device of the embodiment of the invention provides a kind of space querying theme based on natural language processing, To improve method in the prior art, there are the technical problems of subject distillation result inaccuracy, reduce ambiguity to reach, improve The technical effect of subject distillation and the accuracy of classification.

In order to reach above-mentioned technical effect, general thought of the invention is as follows:

Natural language to be processed is divided into the set of partition word, characteristic matching is successively then carried out to the word in set And semantic sequence rearranges, and obtains word sequence, and word sequence is used for the theme training of a large amount of samples, obtains theme instruction Practice result set, searches the sample most adjacent with natural language to be processed further according to the result set of theme training, and return to theme It is sorting out that space querying theme of natural language is sorted out as a result, being reached with this.

In order to make the object, technical scheme and advantages of the embodiment of the invention clearer, below in conjunction with the embodiment of the present invention In attached drawing, technical scheme in the embodiment of the invention is clearly and completely described, it is clear that described embodiment is A part of the embodiment of the present invention, instead of all the embodiments.Based on the embodiments of the present invention, those of ordinary skill in the art Every other embodiment obtained without creative efforts, shall fall within the protection scope of the present invention.

Embodiment one

A kind of classifying method of space querying theme based on natural language processing is present embodiments provided, referring to Figure 1, This method comprises:

Step S1 is first carried out: natural language to be processed is divided into the set of word based on default partition word.

Wherein, it presets partition root and obtains according to analyzing to summarize after a large amount of sample, in present embodiment, preset and separate word packet Include actional verb, preposition, subject, Feature Words and interrogative.

In the specific implementation process, by taking scene of doing shopping as an example, the partition word being related to is as shown in Figure 2, wherein L refers to sentence There are other words on the left side of current word in son, and R, which refers to, other words on the right of word current in sentence, and LR refers to word current in sentence The left side and the right have other words.Actional verb includes:, to sell, buy, seeing, that preposition includes:, have, waiting, subject packet Include: I, we, you, he etc., Feature Words include: shop, shop, dining room etc., interrogative include:, where, what etc..

Next, being split for the natural language involved in the scene of doing shopping, as shown in figure 3, key represents identification Partition word out, unknow represent the word not identified.For example, this sentence that " will have a drink " is split, then can be divided into It drinks and beverage, wherein " drinking " is the actional verb identified." three buildings are what is sold ", the set of words being partitioned into include " three Building ", "Yes", " selling ", " what " " ".For example, " I wants to go to Nike purchase sport footwear " this sentence can be converted by segmentation At me, want, Nike, the set for buying these words of sport footwear.

Then it executes step S2: the word in the set of institute's predicate is subjected to feature with the conceptual lexicon constructed in advance Matching obtains word sequence corresponding with preset structure.

Wherein, the conceptual lexicon constructed in advance is gone out by a large amount of sample analysis and summary, wherein including feature Word.

In one embodiment, the conceptual lexicon constructed in advance includes point of interest, service attribute, business category Property evaluation, spatial relationship, actional verb, the time, personage, place query, evaluation query, business query.

Specifically, in order to match as far as possible user input each word classification, the present embodiment be shopping scene structure Conceptual lexicon is built.Notional word can be classified as POI, business vocabulary, business assessment, spatial relationship, verb term, the time, Theme, where, how, Baggage Inquiry etc..As shown in figure 4, be the corresponding conceptual vocabulary exemplary diagram of shopping application scene, In, point of interest (POI Point Of Interest) includes: shop, shop, dining room, supermarket etc., and service attribute includes: men's clothing, female Dress, underwear etc., service attribute evaluation include: it is nice, be not very good eating, drink well, it is numerous to list herein.

In one embodiment, step S2 is specifically included:

Step S2.2: the Feature Words are converted to the word sequence of " theme-verb-point of interest-verb-article " form.

Specifically, as shown in figure 5, being characterized the test result exemplary diagram after matching.Wherein, SREL represents surrounding ring Border；GOODS represents commodity, cargo；Unknow represents unknown；CPOI represents important point of interest；Ude represents auxiliary word；V represents dynamic Word；COMM represents attribute, evaluation；GOODSQRY represents the query to commodity；COMMQRY represents the query to attribute；LOCQRY Represent the query to position；POI represents point of interest.According to the structure of " theme-verb-point of interest-verb-article " to feature Word sequence after matching is resequenced, the word sequence after being sorted.

Step S3 is executed again: being concentrated, is searched and the most adjacent sample of the natural language to be processed in theme training result This, wherein the theme training result collection is obtained after being trained by the word sequence by the natural language sample collected in advance, Comprising text and inquiry theme in the sample, the inquiry theme for including in the sample is returned, and the inquiry theme is made For categorization results.

Specifically, the natural language sample collected in advance can be gone out by artificial or equipment analysis varying environment or Great amount of samples under application scenarios, summary obtain.Such as shopping scene, tourism scene etc..Theme training result is concentrated comprising each The classification of kind theme.

Wherein, lookup and the most adjacent sample of the natural language to be processed, can be more similar by presetting method Degree obtains.

In one embodiment, theme training result collection passes through the word order by the natural language sample collected in advance It obtains, specifically includes after column training:

Obtain the training sample comprising subject information；

Specifically, ElasticSearch is the search server based on Lucene, it can provide distribution The full-text search of multi-user capability stablizes to reach real-time search, is reliable, quick technical effect.Training sample, that is, theme Set, each theme has been divided into the word sequence comprising " theme-verb-POI- verb-article ".Each sentence is (certainly Right language) there is a specific theme, refer to the real purpose of user.For example, " I wants to buy sport footwear " means to use Family goes for the information of POI relevant to sport footwear.After sentence is divided into word sequence, due to user input sentence (to The natural language of processing) step language words, it is therefore desirable to it is replaced using the Feature Words in the concept lexicon constructed in advance, from And achieve the effect that the standardization of word, preferably to search nearest sample, improve precision and accuracy.

In the specific implementation process, Fig. 6 is referred to, be the theme trained sample graph, and applicant is a large amount of by analysis Shopping consulting sample, summarizes five themes: 1) inquiry has the POI of specified (special) business function；2) it navigates to have and refer to Determine the POI of business function；3) the assessment information of POI is inquired；4) business function of POI is inquired；5) film is inquired.Theme trains rank The input of section is the training sample comprising subject information, i.e. the set of theme, and each theme includes to have been divided into multiple sentences Son, each sentence are inputted as sample, and sentence structure is shaped like (buying/V shoes/cargo).Fig. 7 is referred to, the result for the classification that is the theme Schematic diagram, second floor women's dress, have lunch, manicure etc. is classified as inquiring some POI for having FEATURE service function, I wants to drink drink Material, I to buy wallet etc. and be classified as navigating to the POI of some alternative particular service function, bright nice the, Home Alone of Huang note How the joyful web page interlinkage for being classified as some POI opens.

In addition, natural language to be processed includes spatial information, then in training process in method provided by the invention In, a large amount of samples including spatial positional information also can be used and carry out themes training, can make method of the invention can be with To contain in the theme inquiry of the natural language of spatial information, the accuracy of subject distillation is improved.

In one embodiment, it is searched according to pre-determined distance editor's algorithm most adjacent with the natural language to be processed Sample.

Specifically, pre-determined distance editor algorithm is the method in ElaticSearch, can be found out by this method It is closest with natural language to be processed.

In one embodiment, it is replaced in the conceptual lexicon constructed in advance by the word for including in training sample Feature Words before, the method also includes:

If it is, for the word without replacement.

Specifically, if the word in training sample corresponds to the plurality of classes class of corresponding concept lexicon, in order to avoid Mistake is not replaced the word then.

In order to illustrate more clearly of the realization process of scene recognition method of the invention, below by a specific example It is introduced,

Specifically, natural language to be processed is that " I will buy sneakers.", step S1 is first carried out, based on default partition Word is split it, obtained set of words are as follows: " I " " wanting " " buying " " sneakers ", then execute step S2 by set of words and in advance The conceptional features lexicon of building carries out characteristic matching, obtains following format: the word sequence of subject-V-V-goods, then Execute step S3, from theme training result Integrated query and [" I " " wanting " " buying " " sneakers ", subject-V-V-goods] most Close sample, e.g. [" I " " wanting " " buying " " shoes ", subject-V-V-goods, " navigating to specific function shopping area ", " I Buy shoes "], then " navigating to specific function shopping area " is as return theme, the result of classification.

The classifying method of space querying theme disclosed by the invention based on natural language processing, natural language to be processed Speech, which is divided and reconfigures, becomes the semantic sequence (word sequence) comprising " theme-verb-POI- verb-article ", and passes through These word sequences carry out theme training, available preferable subject distillation effect to the sample collected in advance.

Based on the same inventive concept, present invention also provides with the space querying based on natural language processing in embodiment one The corresponding device of the classifying method of theme, detailed in Example two.

Embodiment two

The present embodiment provides the categorization arrangements of the space querying theme based on natural language processing, refer to Fig. 8, the device Include:

Language divides module 801, for natural language to be processed to be divided into the set of word based on default partition word；

Characteristic matching module 802, for by word and the conceptual lexicon that in advance constructs in the set of institute's predicate into Row characteristic matching obtains word sequence corresponding with preset structure；

Theme classifying module 803 is searched and the natural language to be processed most phase for concentrating in theme training result Adjacent sample, wherein the theme training result collection is by the natural language sample collected in advance, after word sequence training It obtains, comprising text and inquiry theme in the sample, returns to the inquiry theme for including in the sample, and the inquiry is led Topic is used as categorization results.

In one embodiment, the default partition word includes: actional verb, preposition, subject, Feature Words and query Word.

In one embodiment, the preset structure is " theme-verb-point of interest-verb-article ", characteristic matching Module 802 is specifically used for:

Word in the set of institute's predicate is subjected to characteristic matching with the conceptual lexicon constructed in advance, obtains feature Word；

The Feature Words are converted to the word sequence of " theme-verb-point of interest-verb-article " structure.

In one embodiment, theme training result collection passes through the word order by the natural language sample collected in advance It is obtained after column training, specifically:

Obtain the training sample comprising subject information；

In one embodiment, device provided in this embodiment further includes judgment module, for will be in training sample The word for including replaces with before the Feature Words in the conceptual lexicon constructed in advance:

If it is, for the word without replacement.

By the device that the embodiment of the present invention two is introduced, to implement to be based on natural language processing in the embodiment of the present invention one Space querying theme classifying method used by device, so based on the method that the embodiment of the present invention one is introduced, ability The affiliated personnel in domain can understand specific structure and the deformation of the device, so details are not described herein.All embodiment of the present invention one Method used by device belong to the range to be protected of the invention.

It should be understood by those skilled in the art that, the embodiment of the present invention can provide as method, system or computer program Product.Therefore, complete hardware embodiment, complete software embodiment or reality combining software and hardware aspects can be used in the present invention Apply the form of example.Moreover, it wherein includes the computer of computer usable program code that the present invention, which can be used in one or more, The computer program implemented in usable storage medium (including but not limited to magnetic disk storage, CD-ROM, optical memory etc.) produces The form of product.

The present invention be referring to according to the method for the embodiment of the present invention, the process of equipment (system) and computer program product Figure and/or block diagram describe.It should be understood that every one stream in flowchart and/or the block diagram can be realized by computer program instructions The combination of process and/or box in journey and/or box and flowchart and/or the block diagram.It can provide these computer programs Instruct the processor of general purpose computer, special purpose computer, Embedded Processor or other programmable data processing devices to produce A raw machine, so that being generated by the instruction that computer or the processor of other programmable data processing devices execute for real The device for the function of being specified in present one or more flows of the flowchart and/or one or more blocks of the block diagram.

These computer program instructions, which may also be stored in, is able to guide computer or other programmable data processing devices with spy Determine in the computer-readable memory that mode works, so that it includes referring to that instruction stored in the computer readable memory, which generates, Enable the manufacture of device, the command device realize in one box of one or more flows of the flowchart and/or block diagram or The function of being specified in multiple boxes.

These computer program instructions also can be loaded onto a computer or other programmable data processing device, so that counting Series of operation steps are executed on calculation machine or other programmable devices to generate computer implemented processing, thus in computer or The instruction executed on other programmable devices is provided for realizing in one or more flows of the flowchart and/or block diagram one The step of function of being specified in a box or multiple boxes.

Although preferred embodiments of the present invention have been described, it is created once a person skilled in the art knows basic Property concept, then additional changes and modifications may be made to these embodiments.So it includes excellent that the following claims are intended to be interpreted as It selects embodiment and falls into all change and modification of the scope of the invention.

Obviously, those skilled in the art can carry out various modification and variations without departing from this hair to the embodiment of the present invention The spirit and scope of bright embodiment.In this way, if these modifications and variations of the embodiment of the present invention belong to the claims in the present invention And its within the scope of equivalent technologies, then the present invention is also intended to include these modifications and variations.

Claims

1. a kind of classifying method of the space querying theme based on natural language processing characterized by comprising

Step S2: carrying out characteristic matching with the conceptual lexicon in advance constructed for the word in the set of institute's predicate, obtain with The corresponding word sequence of preset structure；

Step S3: it concentrates, searches and the most adjacent sample of the natural language to be processed, wherein institute in theme training result Theme training result collection is stated by the natural language sample collected in advance, is obtained after being trained by the word sequence, in the sample Comprising text and inquiry theme, the inquiry theme for including in the sample is returned to, and using the inquiry theme as categorization results.

2. the method as described in claim 1, which is characterized in that the default partition word include: actional verb, preposition, subject, Feature Words and interrogative.

3. the method as described in claim 1, which is characterized in that the conceptual lexicon constructed in advance include point of interest, Service attribute, service attribute evaluation, spatial relationship, actional verb, time, personage, place query, evaluation query, business query.

4. the method as described in claim 1, which is characterized in that the preset structure is " theme-verb-point of interest-verb- Article ", step S2 are specifically included:

Step S2.1: the word in the set of institute's predicate is subjected to characteristic matching with the conceptual lexicon constructed in advance, is obtained Feature Words；

5. method as claimed in claim 4, which is characterized in that theme training result collection is by the natural language sample collected in advance Example, by being obtained after word sequence training, specifically:

Obtain the training sample comprising subject information；

Create the training sample ElaticSearch index and mapping, wherein it is described mapping include [the first text, theme, ID, row], wherein the first text is the word sequence, is the partial list of space segmentation, and theme is the theme number of training sample, ID is the ID of training sample, behavior natural language to be processed；

All mappings are traversed, the word for including in training sample is replaced with into the Feature Words in the conceptual lexicon constructed in advance, Obtain the second text；

All mappings are traversed, to each training sample, are constructed [the first text, the second text, theme, ID], and be inserted into It is trained in ElaticSearch, obtains the theme training result collection.

6. method as claimed in claim 5, which is characterized in that searched according to pre-determined distance editor's algorithm and described to be processed The most adjacent sample of natural language.

7. method as claimed in claim 5, which is characterized in that constructed in advance replacing with the word for including in training sample Before Feature Words in conceptual lexicon, the method also includes:

If it is, for the word without replacement.

8. a kind of categorization arrangement of the space querying theme based on natural language processing characterized by comprising

Characteristic matching module, for the word in the set of institute's predicate to be carried out feature with the conceptual lexicon constructed in advance Match, obtains word sequence corresponding with preset structure；

Theme classifying module is searched and the most adjacent sample of the natural language to be processed for concentrating in theme training result This, wherein the theme training result collection is obtained after being trained by the word sequence by the natural language sample collected in advance, Comprising text and inquiry theme in the sample, the inquiry theme for including in the sample is returned, and the inquiry theme is made For categorization results.

9. device as claimed in claim 8, which is characterized in that the default partition word include: actional verb, preposition, subject, Feature Words and interrogative.