CN107944027A - Create the method and system of semantic key index - Google Patents

Create the method and system of semantic key index Download PDF

Info

Publication number
CN107944027A
CN107944027A CN201711320645.3A CN201711320645A CN107944027A CN 107944027 A CN107944027 A CN 107944027A CN 201711320645 A CN201711320645 A CN 201711320645A CN 107944027 A CN107944027 A CN 107944027A
Authority
CN
China
Prior art keywords
sentence
semantic
semantic key
key
real
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201711320645.3A
Other languages
Chinese (zh)
Other versions
CN107944027B (en
Inventor
俞凯
赵子健
吴学阳
林弘韬
刘轩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sipic Technology Co Ltd
Original Assignee
Shanghai Jiaotong University
Suzhou Speech Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Jiaotong University, Suzhou Speech Information Technology Co Ltd filed Critical Shanghai Jiaotong University
Priority to CN201711320645.3A priority Critical patent/CN107944027B/en
Publication of CN107944027A publication Critical patent/CN107944027A/en
Application granted granted Critical
Publication of CN107944027B publication Critical patent/CN107944027B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/316Indexing structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Abstract

The embodiment of the present invention provides a kind of method for creating semantic key index.This method includes:In corpus is talked with, the part of speech of the read statement of each pair of sentence pair of analysis and each word of revert statement, extracts each semantic key for meeting default part of speech requirement in every sentence;All sentences in dialogue corpus are clustered according to semantic key, the sentence of each classification corresponds to same semantic key;Using the sentence training sequence of all categories talked with corpus to series model, obtain that sentence can be mapped to the coding network as real-valued vectors;Same category of sentence will be belonged to be encoded using coding network, obtained and the corresponding real-valued vectors set of each semantic key;Multiple real-valued vectors composition dot-blur patterns are selected from the corresponding real-valued vectors set of each semantic key, semantic key index is established in each semantic key and each dot-blur pattern part.The embodiment of the present invention also provides a kind of system for creating semantic key index.The sentence of generation of the embodiment of the present invention has diversity, directionality.

Description

Create the method and system of semantic key index
Technical field
The present invention relates to artificial intelligence field, more particularly to a kind of method and system for creating semantic key index.
Background technology
The progress of machine learning accelerates the development of speech dialogue system.The interaction capabilities of speech dialogue system are brought More and more voice dialogue applications, wherein, the function of best embodying speech dialogue system is exactly chat robots, pass through The answer method set according to developer, when chat robots receive a dialogue, it can find appropriateness by algorithm Answer, carries out output reply.
And this algorithm is in general, nowadays there is two kinds of major techniques:
1st, traditional retrieval type model based on retrieval technique;
2nd, the production model based on depth learning technology.
Wherein, the retrieval type model based on retrieval technique in the forum on cellular network, microblogging or website mainly by going out The dialogue sentence now crossed, builds large-scale dialog database, and when user provides a sentence, system can look in corpus To one with the most alike sentence of this sentence, and this corresponding answer of most alike sentence can be directly output as it is defeated Go out sentence and feed back to user;
But retrieval type model based on retrieval technique is, it is necessary to safeguard a large-scale corpus, the sentence that machine is replied It is limited in corpus so that the revert statement retrieved in the corpus necessarily sentence in corpus, and can not occur new Reply;And find the corresponding of alike sentence due to the method for taking question matching to reply so that the reply content list of generation One, for same read statement, often generate identical reply;It is single in the sentence that the technology is provided mainly for user at the same time The shallow semantics such as word phrase match so that the semanteme in reply is difficult to expand.
And the production model based on depth learning technology needs a large-scale training corpus, its basic flow It is that user provides a sentence, which is encoded into a real-valued vectors by Recognition with Recurrent Neural Network, then passed through again by system Another Recognition with Recurrent Neural Network is decoded to generate each word, so that being combined into a sentence feeds back to user.
In process of the present invention is realized, inventor has found that at least there are the following problems in correlation technique:
Production model based on depth learning technology, same read statement can have many kinds in being chatted due to task The method of answer carries out output reply, and basic sequence can not produce multiple replies to series neural network model, so inclining To in average semantemes that these are replied, the reply for causing to ultimately generate be mostly " yes ", " laughing a great ho-ho " such common replies; And basic production model is unable to control semantic and emotion, relies solely on the learning ability of itself and be difficult to learn to different Semantic and emotion information so that the content of reply is difficult to control emotion, theme;Meanwhile for same sentence, when decoding, can only give birth to Into similar reply, without diversity.
The content of the invention
In order at least solve the production model based on depth learning technology in the prior art easily generate general information, It is difficult to control emotion theme, without multifarious problem.Those skilled in the art would generally use:Change neural metwork training Object function improves the multifarious mode of model reply, or the mode of the longer contextual information of increase, or introduces Feeling polarities generate the mode of the reply of different moods, or the external evaluation using related entities as extra information As side information, the intellectual mode of model is improved.Applicant is found surprisingly that by being stored in the form of real-valued vectors The information such as relevant semanteme, establish the read-write memory external storage of semantic key, different from traditional directly using knowledge base Form.At the same time by specifying different semantic keys and corresponding mapping method above-mentioned explicitly to control generative semantics to solve Problem.
In a first aspect, the embodiment of the present invention provides a kind of method for creating semantic key index, including:
In corpus is talked with, the part of speech of the read statement of each pair of sentence pair of analysis and each word of revert statement, extraction Meet each semantic key of default part of speech requirement in each sentence, wherein, the default part of speech includes:Noun, adjective, verb At least one of;
All sentences in dialogue corpus are clustered according to semantic key, the sentence of each classification corresponds to same language Adopted key;
Using the sentence training sequence of all categories talked with corpus to series model, obtain sentence being mapped to For the coding network of real-valued vectors;
Same category of sentence will be belonged to be encoded using the coding network, obtained and the corresponding reality of each semantic key Value vector set;
Multiple real-valued vectors composition dot-blur patterns are selected from the corresponding real-valued vectors set of each semantic key, Each semanteme key and each dot-blur pattern part establish semantic key index.
Second aspect, the embodiment of the present invention provide a kind of revert statement generation method of dialogue, including:
Extract at least one semantic key in read statement;
Using default Semantic mapping relation, at least one semantic key is expanded into multiple semantic keysets and is closed;
Each semantic key in being closed according to the multiple semantic keyset, index and the matched dot-blur pattern of each semantic key;
At least the dot-blur pattern is input in coding network and is decoded, selects a decoded wherein sentence to make Go out revert statement.
The third aspect, the embodiment of the present invention provide a kind of system for creating semantic key index, including:
Semantic key extraction procedure module, in corpus is talked with, analyzes read statement and the reply of each pair of sentence pair The part of speech of each word of sentence, extracts each semantic key for meeting default part of speech requirement in each sentence, wherein, the default word Property includes:At least one of noun, adjective, verb;
Sentence Cluster Program module, it is each for being clustered according to semantic key to all sentences in dialogue corpus The sentence of classification corresponds to same semantic key;
Coding network mapper module, for utilizing the sentence training sequence of all categories talked with corpus to sequence Model, obtains that sentence can be mapped to the coding network as real-valued vectors;
Statement coding program module, is encoded using the coding network for that will belong to same category of sentence, is obtained To with each semantic corresponding real-valued vectors set of key;
Semantic key index establishes program module, for being selected from the corresponding real-valued vectors set of each semantic key Multiple real-valued vectors form dot-blur pattern, and semantic key index is established in each semantic key and each dot-blur pattern part.
Fourth aspect, the embodiment of the present invention provide a kind of revert statement generation system of dialogue, including:
Semantic key extraction procedure module, for extracting at least one semantic key in read statement;
Semantic key expands program module, and for utilizing default Semantic mapping relation, at least one semantic key is expanded Open up and closed for multiple semantic keysets;
Dot-blur pattern concordance program module, for each semantic key in being closed according to the multiple semantic keyset, index and institute State the matched dot-blur pattern of each semantic key;
Sentence replys program module, is decoded at least the dot-blur pattern to be input in coding network, selects A decoded wherein sentence is as revert statement.
5th aspect, there is provided a kind of electronic equipment, it includes:At least one processor, and with least one place The memory of device communication connection is managed, wherein, the memory storage has the instruction that can be performed by least one processor, institute State instruction to be performed by least one processor, so that at least one processor is able to carry out any embodiment of the present invention The semantic key index of establishment method the step of.
6th aspect, the embodiment of the present invention provide a kind of storage medium, are stored thereon with computer program, its feature exists In when the program is executed by processor the step of the method for the semantic key index of the establishment of realization any embodiment of the present invention.
7th aspect, there is provided a kind of electronic equipment, it includes:At least one processor, and with least one place The memory of device communication connection is managed, wherein, the memory storage has the instruction that can be performed by least one processor, institute State instruction to be performed by least one processor, so that at least one processor is able to carry out any embodiment of the present invention Dialogue revert statement generation method the step of.
Eighth aspect, the embodiment of the present invention provide a kind of storage medium, are stored thereon with computer program, its feature exists In when the program is executed by processor the step of the revert statement generation method of the dialogue of realization any embodiment of the present invention.
The beneficial effect of the embodiment of the present invention is:Talk with the semantic key of each sentence in corpus by extracting, to each language Sentence is classified, and determines the real-valued vectors of sentence in set of all categories, and real-valued vectors composition of all categories is remembered accordingly Matrix is established with corresponding semantic key and indexed, and relevant semantic information is stored in the form of real-valued vectors, establishes semantic key rope The read-write storage device drawn, realizes equivalent to by corpus, can easily apply semantic information wherein included, and Than expansion and association that semanteme is directly easier to make for using corpus;At the same time by specifying different semantic keys and reflecting accordingly Shooting method controls the semanteme of generation, can select different semantic keys or different Semantic mapping sides to same a word Method, so that the diversity that model generation is replied is improved, and in the direction for controlling dialog semantics, so as to improve the body of user Test.
Brief description of the drawings
In order to illustrate more clearly about the embodiment of the present invention or technical scheme of the prior art, below will be to embodiment or existing There is attached drawing needed in technology description to be briefly described, it should be apparent that, drawings in the following description are this hairs Some bright embodiments, for those of ordinary skill in the art, without creative efforts, can be with root Other attached drawings are obtained according to these attached drawings.
Fig. 1 is a kind of flow chart of the method for establishment semanteme key index that one embodiment of the invention provides;
Fig. 2 is a kind of flow chart of the method for establishment semanteme key index that another embodiment of the present invention provides;
Fig. 3 is a kind of flow chart of the revert statement generation method for dialogue that one embodiment of the invention provides;
Fig. 4 is a kind of signal for generation for being building up to revert statement from semantic key index that one embodiment of the invention provides Figure;
Fig. 5 is a kind of frame for generation for being building up to revert statement from semantic key index that one embodiment of the invention provides Figure;
Fig. 6 is a kind of the various of generation for being building up to revert statement from semantic key index that one embodiment of the invention provides Property and substantive table data figure;
Fig. 7 is a kind of the controllable of generation for being building up to revert statement from semantic key index that one embodiment of the invention provides Short essay talks with exemplary table data figure;
Fig. 8 is a kind of having for generation for being building up to revert statement from semantic key index that one embodiment of the invention provides The table data figure of the ESED of theme rank external memory storage;
Fig. 9 is a kind of BLEU for generation for being building up to revert statement from semantic key index that one embodiment of the invention provides The table data figure that fraction compares;
Figure 10 is a kind of people for generation for being building up to revert statement from semantic key index that one embodiment of the invention provides The table data figure of class assessment result;
Figure 11 is a kind of making for generation for being building up to revert statement from semantic key index that one embodiment of the invention provides The table data figure of the sentence generated with different semantic key mapping methods;
Figure 12 is a kind of the logical of generation for being building up to revert statement from semantic key index that one embodiment of the invention provides The knowledge for crossing external memory storage generation imports the table data figure of response;
Figure 13 is a kind of structure diagram of the system for establishment semanteme key index that one embodiment of the invention provides;
Figure 14 is a kind of structure diagram of the revert statement generation system for dialogue that one embodiment of the invention provides.
Embodiment
To make the purpose, technical scheme and advantage of the embodiment of the present invention clearer, below in conjunction with the embodiment of the present invention In attached drawing, the technical solution in the embodiment of the present invention is clearly and completely described, it is clear that described embodiment is Part of the embodiment of the present invention, instead of all the embodiments.Based on the embodiments of the present invention, those of ordinary skill in the art All other embodiments obtained without creative efforts, belong to the scope of protection of the invention.
A kind of flow chart of the method for the establishment semanteme key index provided as shown in Figure 1 for one embodiment of the invention, including Following steps:
S11:In corpus is talked with, the part of speech of the read statement of each pair of sentence pair of analysis and each word of revert statement, Each semantic key for meeting default part of speech requirement in each sentence is extracted, wherein, the default part of speech includes:Noun, adjective, At least one of verb;
S12:All sentences in dialogue corpus are clustered according to semantic key, the sentence of each classification corresponds to together One semantic key;
S13:Using the sentence training sequence of all categories talked with corpus to series model, obtain sentence being reflected Penetrate the coding network as real-valued vectors;
S14:Same category of sentence will be belonged to be encoded using the coding network, obtained corresponding with each semantic key Real-valued vectors set;
S15:Multiple real-valued vectors composition memory squares are selected from the corresponding real-valued vectors set of each semantic key Battle array, semantic key index is established in each semantic key and each dot-blur pattern part.
In the present embodiment, since every operation is all to talk with read statement in corpus and revert statement as base Plinth, so the collection of sentence is particularly significant in dialogue corpus.In dialogue data is gathered, can take " microblogging ", STC (Short text conversation, the dialogue of short text) numbers in the social network-i i-platforms such as " Twitter ", " forum " According to (for example, the finding that bloger issues in microblogging, what is heard, the information felt, and the information is directed to, other users are directed to In the information of the information-reply, the information of bloger's issue can be understood as the read statement in STC data, what other users were replied Information can be understood as the revert statement in STC data).By the STC data collected from major social network-i i-platform deposit pair Talk about in corpus.It is also possible to obtained into instant chat software.This kind of software is carried out by two objects Real-time communication, adjacent two sentences generally also all have correlation, meet the standard of STC data.
For step S11, in corpus is talked with, each word in the read statement and revert statement of each pair of sentence pair is analyzed The part of speech of language, for the ease of description, only chooses a small number of sentences to carrying out example, such as following groups conversation sentence:
1st, " which constellation you most like " " I most likes Capricorn ";
2nd, " which constellation you most like " " because I is Sagittarius, I likes Sagittarius ";
3rd, " film of Zhou Jielun is very good to be seen " " song of Zhou Jielun is also listened very well ".
First, these sentences segmented, standardized etc. with philology disposal, is then directed to per in short, finds wherein Semantic key, wherein, semantic key can be understood as the centre word in one section of sentence.Simple lookup mode for example forms the sentence Noun, verb, adjective etc., naturally it is also possible to the more complicated accurately method of setting.Then, will be to language according to semantic key Sentence in material storehouse aggregates into different classifications.
For example, pass through LTP (Language Technology Platform, language technology platform) this kind of instrument pair Above-mentioned sentence carries out part-of-speech tagging, obtains:
" which constellation you most like "
Noun:Constellation
Verb:Like
" I most likes Capricorn "
Noun:Capricorn
Verb:Like
" because I is Sagittarius, I likes Sagittarius "
Noun:Sagittarius
Verb:Like
" film of Zhou Jielun is very good to be seen "
Noun:Zhou Jielun, film
Adjective:It is good-looking
" song of Zhou Jielun is also listened very well "
Noun:Zhou Jielun, song
Adjective:It is pleasing to the ear
Satisfactory each semantic key in each sentence is extracted, for example, by taking noun as an example, " which constellation you most like " Semantic key be " constellation ", the semantic key of " I most likes Capricorn " be " Capricorn ", " because I is Sagittarius, I like The semantic key of joyous Sagittarius " is " Sagittarius ", the semantic key of " film of Zhou Jielun is very good to be seen " is " Zhou Jielun ", " Zhou Jielun Song also listen very well " semantic key be " Zhou Jielun ".
Wherein, segmented to sentence., can also be by the theme implication of read statement in the philology disposals such as standardization Extracted.Using the theme implication that is extracted as semantic key during follow-up, generally also can preferably it be imitated Fruit.
Such as:" I does not believe constellation, does not also know what constellation I is "
If carry out extracting semantic key with " noun, verb, adjective " in sentence etc., then guide by semantic key Output statement is probably also related with " constellation ".And if during using theme implication as semantic key.Such as it can extract The semantic key that " not liking constellation " is the theme.And then below during, be likely to be obtained that " you believe blood group" it is similar with " constellation " unrelated revert statement.
Wherein, can also be by same read statement pair in being segmented to these sentences, the philology disposal such as standardize The different revert statements answered are grouped.Such as:
1st, " which constellation you most like " " I most likes Capricorn ";
2nd, " which constellation you most like " " because I is Sagittarius, I likes Sagittarius ";
3rd, " which constellation you most like " " what is constellation "
4th, " which constellation you most like " " not knowing "
5th, " which constellation you most like " " Cancer and Gemini are all delithted with, and are obstinately said if which is most liked, I Do not know "
It is grouped, can be classified according to output statement by corresponding.For example, there is the constellation sentence 1,2,5 specifically replied For one group.The constellation sentence 3,4 of insignificant reply is one group.At this moment insignificant revert statement 3,4 can be deleted Remove.The further sentence pair simplified in dialogue corpus, improves the efficiency of the semantic key of extraction.
Simultaneously as in advance to corresponding sentence to being grouped, special demand can also be directed to and create inhomogeneity Other semanteme key index, when having the demand of specific aim colony, often relates to the sentence dialogue less than other classifications.By to it Classification is to conversation sentence to carrying out category filter.For example, create the conversational device in terms of children's early learning machine, it is possible to only The sentence in terms of poem, nursery rhymes, animation is obtained, and the sentence of some popular, modern unsuitable children can screen out in advance.Not only The establishment efficiency of semantic key index can be significantly lifted, but also semantic key index can be subjected to modularization management, pin Individually created for different classes of semantic key index, consequently facilitating the maintenance in later stage.
For step S12, the sentence in dialogue corpus is clustered according to semantic key, is determined by step S11 Semantic key, classifies sentence,
For example, " Capricorn ", " Sagittarius " belong to constellation this classification, it is " constellation " to be unified cluster.Obtain:
Clustered with semantic key " constellation ":" which constellation you most like ", " I most likes Capricorn ", " be shooter because of me Seat, so I likes Sagittarius " is one kind;
Clustered with semantic key " Zhou Jielun ":" film of Zhou Jielun is very good to be seen " " song of Zhou Jielun is also listened very well " is one Class.
For step S13, the sentence of all categories that step S12 is classified is trained model, can obtain reflecting sentence Penetrate the coding network as real-valued vectors.
For step S14, according to the classification of each semantic key, it will belong to what same category of sentence was determined using step S13 Coding network, can obtain the corresponding real-valued vectors set of each semantic key.
For example, being encoded with the sentence of " constellation " this kind of semantic key using coding network, corresponding real-valued vectors are obtained, Encoded with the sentence of " Zhou Jielun " this kind of semantic key using coding network, obtain corresponding real-valued vectors.
Due in step s 11 it is exemplary dialogue corpus conversation sentence to less.Dialogue language material in practical application The quantity of sentence pair in storehouse can be very more.So that the real-valued vectors in real-valued vectors set also can be very more.
For step S15, multiple vector composition dot-blur patterns are selected from the corresponding real-valued vectors set of each semantic key. For example, multiple real-valued vectors composition dot-blur pattern S are selected in semantic key " constellation " corresponding real-valued vectors set.For semantic key " constellation " and matrix S establish semantic key index.Pass through the semantic key, you can find corresponding matrix.Crowd is finally obtained Multi-semantic meaning key and corresponding dot-blur pattern, can be stored in storage medium, be called for other equipment.
It is in the present embodiment, described to utilize the sentence training of all categories talked with corpus as a kind of embodiment Sequence includes to series model:Using the sentence training sequence of all categories talked with corpus to series model and its mutation, Wherein, the sequence includes to series model and its mutation:Attention model and own coding model.
In the present embodiment, its model is its main feature is that by statement coding be real-valued vectors using a coding network, and Real-valued vectors are decoded as a sentence using a decoding network afterwards.When wherein training sequence is to series model, coding network Read statement can be encoded into a sentence vector;During training attention model, coding network can obtain the statement coding of input One vector set;During training own coding model, when replying, the sentence of input can be reappeared.
By the implementation can be seen that by extract talk with corpus in each sentence semantic key, to each sentence into Row classification, determines the real-valued vectors of sentence in set of all categories, and real-valued vectors of all categories are formed corresponding dot-blur pattern Establish and index with corresponding semantic key, relevant semantic information is stored in the form of real-valued vectors, establish semantic key index Read-write storage device, realizes equivalent to by corpus, can easily apply semantic information wherein included, it is easier to Carry out semantic expansion and association.
A kind of flow chart of the method for establishment semanteme key index of another embodiment of the present invention offer, bag are provided Include following steps:
S151:The central point of real-valued vectors set is determined by mean algorithm;
S152:Determine the distance of each real-valued vectors and the central point in real-valued vectors set;
S153:Real-valued vectors according to default quantity is chosen apart from ascending order form dot-blur pattern.
In the present embodiment, in practical application, dialogue corpus can collect substantial amounts of sentence dialogue so as to language Expect that the quantity of the sentence pair in storehouse is very more, so that the real-valued vectors that result in each real-valued vectors set are also very much, can cause Dot-blur pattern is very big.So that needing to filter out the stronger sentence of some relevances, real-valued vectors selection is carried out.
For step S151, the central point of each real-valued vectors is determined by using K- mean algorithms.
For step S152, according to the central point of the step S151 each real-valued vectors set determined, to determine its center The distance of point and each real-valued vectors in the real-valued vectors set of place, wherein, common distance algorithm has Euclidean distance and Manhattan Distance algorithm.
For step S153, present count is chosen according to the ascending sequence of the distance of each real-valued vectors to the central point The real-valued vectors composition dot-blur pattern of amount, wherein, apart from smaller, its relevance is stronger, and distance is bigger, and relevance is weaker.
It can be seen that by the implementation when the real-valued vectors in the corresponding real-valued vectors set of semantic key are excessive When, the stronger real-valued vectors of relevance are chosen according to the distance of each real-valued vectors and central point, are keeping the corresponding reality of semantic key When being worth the precision of vector, reduce the size of dot-blur pattern.
A kind of flow chart of the revert statement generation method of dialogue of one embodiment of the invention offer, bag are provided Include following steps:
S21:Extract at least one semantic key in read statement;
S22:Using default Semantic mapping relation, at least one semantic key is expanded into multiple semantic keysets and is closed;
S23:Each semantic key in being closed according to the multiple semantic keyset, index and the matched memory square of each semantic key Battle array;
S24:At least the dot-blur pattern is input in coding network and is decoded, selects a decoded wherein language Sentence makes revert statement.
In the present embodiment, for step S21, first, the sentence of input is received, wherein, which can be user The sentence of input, can also be by sentence of the machine as Self -adaptive.By being segmented to read statement, standardization processing etc. Philological operation, extracts at least one semantic key in sentence.For example, read statement is:" picture of van gogh, very beautiful ".Carry Take out at least one semantic key therein:" picture of van gogh ".
For step S22, using default Semantic mapping relation, semantic key " picture of van gogh " is expanded as multiple semantic keys Gather { " sunflower ", " Starry Starry Night " }.
For step S23, each semantic key in being closed according to semantic keyset indexes the matched dot-blur pattern of each semantic key.Example Such as, semantic key " sunflower " indexes corresponding dot-blur pattern in external memory storage.Other semantic keys are also by identical side Method goes in external memory storage to index corresponding dot-blur pattern.
For step S24, according to the dot-blur pattern of step S23 indexes, these dot-blur patterns are input in coding network Decoded, so as to decode a plurality of sentence, therefrom choose a sentence as revert statement.
Such as:Input " picture of van gogh, very beautiful "
Semantic key " picture of van gogh ", expands and is:" sunflower ", " Starry Starry Night "
When to expand semantic key be " sunflower ", " really very beautiful, the especially sunflower of van gogh, the flower in picture are exported As burned flame, the life passion as artist's fanaticism is showed "
When the semantic key of expansion is " Starry Starry Night ", and output " it is really very beautiful, wherein there is the Starry Starry Night of a secondary van gogh, signify The spirit that the mankind struggle with struggle "
Wherein, when mapping is by semantic key, the semantic key that can be expanded to it is controlled.Such as the semanteme no matter extracted What such as " beautiful " key is.When expanding, control it as " Leonardo da Vinci ".So output statement may will be " Leonardo da Vinci With being all outstanding artist as van gogh ".
It can be seen that by the implementation by specifying different semantic keys and corresponding mapping method to control generation Semanteme, different semantic keys or different Semantic mapping methods can be selected, so as to improve model to same a word The diversity replied is generated, and in the direction for controlling dialog semantics.So as to improve the experience of user.
It is in the present embodiment, described that at least the dot-blur pattern is input in coding network as a kind of embodiment Carrying out decoding includes:
The sentence vector of the dot-blur pattern and the read statement is input in coding network together and is decoded, its In, the sentence vector of the read statement by read statement by importing pre-generatmg in the coding network.
In the present embodiment, read statement is also entered into coding network, determines the sentence vector of read statement, then will The sentence vector of the dot-blur pattern and read statement is input in coding network at the same time to be decoded.For example, " picture of van gogh, very It is beautiful ", the semantic key of extraction is expanded when being " beautiful ", and the semantic key expanded is " intension ".Pass through the semanteme of expansion Key " intension " indexes corresponding dot-blur pattern in external memory storage, while by read statement " picture of van gogh, very beautiful ".It is defeated Enter into coding network, the sentence vector of definite input network, then decoded by the way that the sentence of dot-blur pattern and read statement vector is common, So as to decode corresponding sentence " picture of van gogh, also there is very much intension ".
It can be seen that by the implementation and decode the sentence vector of read statement jointly with dot-blur pattern, solution can be made The sentence of code is more close to the answer of read statement, makes the sentence and read statement context relation of generation.Improve what is replied Degree of intelligence, while user experience is improved, increase the satisfaction of user.
As a kind of embodiment, in the present embodiment, at least one semantic key is expanded into multiple semantic keysets Close, the expansion includes:With word mapping, synonym mapping, antonym mapping, knowledge mapping.
In the present embodiment, such as read statement is:" film of Zhou Jielun is very good to be seen ", the semantic key extracted has " Zhou Jielun ", " film ".A variety of different semantic keys can be mapped out by different expanding methods:" Zhou Jielun ", " elder brother's icepro ", " film ", " song ", " special edition " etc..
By the semantic keys of difference mapped out, the sentence of each side can be also obtained in decoding, such as:
" artistic skills of Zhou Jielun are very excellent "
" in the shooting of film, elder brother, which insults also to have, to show up "
" this film of Zhou Jielun was shown on January 19th, 2012 "
" song of Zhou Jielun is also listened very well "
" theme song of this film is exactly to be sung by Zhou Jielun, has also gone out special edition of the same name "
Can be seen that by the implementation by mapping out different semantic keys, decoding obtains the sentence of various aspects, Improve the diversity of reply.So as to improve the experience sense of user by.
As a kind of embodiment, in the present embodiment, after the completion of a wheel dialogue, by the input language of epicycle dialogue centering Sentence and revert statement are as talking with to being added in corpus, to expand semantic key index.
In the present embodiment, since read statement is that extraneous input is come in, revert statement is decoding generation, and having can Can not there is no the sentence pair in corpus is talked with, so the read statement and revert statement during epicycle is talked with are as conversation sentence It is right, it is added in corpus, so as to expand semantic key index.
It can be seen that by the implementation by gathering the completed dialogue of each round to be carried out more to semantic key index Newly.New semantic key index after updated is more enriched perfect.
The revert statement generation method of the method for above-mentioned establishment semanteme key index and dialogue can mutually independent operation and Optimization, can also combine integral system, as shown in figure 4, can be applied in the dialogue chat of robot, wherein, sentence To the be in short known as post, the sentence dialogue for responding dialogue is known as comment.
The external semantic memory of the system is one tensor of construction in the form of matrix table.Wherein, each matrix table Show the output statement for being possible to represent of corresponding certain semantic key.Each row vector of matrix is based on a sentence and forms, All row vectors in one matrix table all represent its corresponding one specific semantic key.Conversation sentence process is exported in generation In, a semantic key is extracted from read statement, by using the data and sequence and series model of external memory storage storage To generate last output conversation sentence.By operational semantics key, the theme or semanteme of output statement can be guided.
Wherein, external semantic memory homing sequence is made of to Sequence Learning frame three parts:One encoder E, one A decoder D and external semantic a memory M, wherein Fig. 5 describe its overall framework and data flow.
External semantic memory is incorporated into coder-decoder frame, which is considered as to make With mass data (being probably unsupervised) structure outside training dataset.Therefore, external semantic memory M can be counted as It is the explicit storage of external knowledge, this enables coder-decoder model to use the knowledge formation outside post comment.The use of external semantic memory is similar to the mode that we talk with.In our dialogues, it will usually pay close attention to Some specific semantic key words (such as keyword or theme) of one sentence, then know according to our target or background Know and think deeply relevant reply keyword or reply subject direction, finally forming an output statement according to the semanteme being combined into carries out Reply.
We simulate this dialog procedure using external semantic memory module.First, using semantic key and external semantic Memory construction indexes.When corresponding to when having input by sentence, responded.The semantic key of read statement is extracted, then basis is reflected The relation of penetrating obtains exporting semantic key, according to obtained output semanteme key, memory is addressed to handle, finally, decoder mould Type is based on extracting sentence generation output statement in read statement and external memory storage.
The whole frame of coder-decoder with external memory storage is formulated as follows.
Assuming that x1:T={ x1, x2..., xTRepresent post in each word, wherein T represent post length.In equation p, c1:T=E (x1:T) in, coder module E receives word insertion and produces the dense expression p and one group of context for inputting sentence hereinafter Vectorial c1:T={ c1, c2..., cT(abbreviation c), herein, we simply use a GRU (gated recurrent Unit, thresholding cycling element) Recognition with Recurrent Neural Network as encoder.
We extract semantic key from the sentence of input.One typical semantic key is a word or phrase, is being inputted Sentence in when extracting semantic key, the LTP instruments of POS (Part-of-speech, part-of-speech tagging) label can be used to be carried Take.Multiple semantic keys can be usually extracted from read statement, the semantic key extracted is referred to as inputting semantic key.We can With one semantic key of input of selection, when selection inputs semantic key, can use different methods, for example, iteration select or with The method of machine selection makes choice, after the semantic key of what a input is selected, by selecting different input semanteme keys, Wo Menke To generate the output statement with different semantic preferences.That is, once find the semantic key k of input(i), passed through semanteme Key mapper S is converted into the semantic key k of output(o), its formula is:k(o)=S (k(i))。
Semantic key mapper S can simply be mapped very much, for example, directly using the semantic key of input as output language Adopted key, i.e. k(o)=k(i).Alternatively, very complicated mapping can also be carried out, for example, by synonym or square from knowledge mapping etc. The word that method is derived.It is what that no matter can even input the word of semantic key, directly uses the word specified as output Semantic key, is forced using the output semanteme key specified when output statement is generated.So so that semantic key mapper has Controllability.
After the semantic key of output is determined, it would be desirable to indexed using it to be established for external semantic memory.Such as Fig. 5 Described, memory module M is made of k storage unit, wherein, k is the quantity of all semantic keys.Each semanteme key corresponds to corresponding Access unit address or index, each storage unit be the L*D matrix associated with certain semantic key, wherein L represents to deposit The line number of sentence in storage unit, D represent the dimension of sentence.Reader (Reader) R is read out selected storage unit m, with For producing external memory storage context vector r.Wherein r=R (m, p).
The inspiration of neural Turing machine is received, a kind of addressing based on content is applied to the square of selected storage unit by we Battle array.A suffix p is given, reader R will return to equationMatrix m on weighing vector, wherein wlBy equationProvide.L rows in wherein m (l) representing matrixes m.β in above-mentioned formula is control The coefficient of the acutance of weight vector.In order to ensure clarity, the value of β is set to 100 herein.
After memory context vector r is obtained, we are replicated each word in read statement, and are chased after It is added in original context C1, forms a new joint context vector collectionWherein, During decoding, decoder D is exported back using the linguistic context vector from postposition sentence and external semantic memory to produce response Multiple (being expressed as y).Wherein, response output is replied
Summarize below to the overall process of exterior semantic memory homing sequence to Sequence Learning in following algorithm. It should be noted that the structure of external memory storage is unrelated with the training of encoding-decoder., can be with the case of given external memory storage Encoding-decoder model is trained using normal gradients descent algorithm.Generation of the external semantic memory homing sequence to sequence Algorithm is as follows:
1st, input dialogue sentence is received;
2nd, conversation sentence is converted into a word insertion sequence x1:T;
3rd, the semantic key k of extraction input(i)
4th, the semantic push-to of input is crossed mapping to obtain exporting semantic key k(o)=S (k(i));
5th, associated memory block matrix m is found by exporting semantic key;
6th, coding c is carried out to input dialogue sentence using encoder p1:T=E (x1:T);
7th, memory context vector r=R (m, p) is read from external semantic memory;
8th, r is appended in original encoder context vector,
9th, formula is passed throughDecoding generation output revert statement.
Overall system can be generated more rich more specific and returned with intellectual compared to the model with other productions It is multiple, and the semantic direction of dialogue can be controlled to a certain extent.Meanwhile the overall system can be used and had Among the robot system of chat conversations function, the satisfaction of user can be improved, so as to increase user and the robot Interaction times, improve the degree of recognition of user.The overall system is applied talks with interactive task in other similar being related to, Can also improve system it is overall can only degree and user satisfaction.
The overall framework of the method is described above is, the structure of external semantic memory is described in detail below.
The content of each storage unit is to include the square of the K representative comment sentence insertion corresponding to certain semantic key Battle array." semantic key " can be represented by keyword or the vector of theme implication.In the case of the expression based on keyword, it is single To the index of vector.Such as above-mentioned formula k(o)=S (k(i)) shown in, can be mapped by semantic key mapper S the semantic key of input come Obtain the semantic key of output.
External semantic memory is independently of the training process of coder-decoder and builds.This separation can make Information can be added outside coder-decoder training data by obtaining us.The data constructed according to external memory storage, can make Encoder is determined with two kinds of different methods.
1st, coding decoder
In given parallel STC data, can training standard in advance sequence to Sequence Learning.Herein, data can With different from the data aiminged drill for external semantic memory, with reference to external knowledge.Once training is completed, decoding is abandoned Device, leaves behind encoder and is used for training and changing dialogue data.
2nd, self-encoding encoder
When given data are not sentences pair (such as news, novel, or other text materials), own coding can be used Device to it carries out conversion training.
In order to build external semantic memory, we can obtain the semantic key of each conversation sentence, then that these are right Language sentence carries out polymerization classification according to semantic key.Once packet is completed, it would be desirable to which it is big to build a fixation for each semantic key Small storage unit, i.e., select k most representational sentences in all sentences out of each group.In order to select most Representational sentence to all sentences in same storage unit, it is necessary to carry out semantic analysis, and obtain the projection vector in space.So Clustered afterwards on projection vector using " K- mean algorithms ".Finally, it is nearest from center for each corresponding potential projection vector Cluster centre select a sentence.In the case that the quantity of the sentence in storage unit is a less than k, carried out using null vector Filling.
It is trained by the external semantic memory of structure with coder-decoder, it is pre- that training process can be divided into data Processing and end to end coder-decoder train two parts.
1st, data prediction:As set forth above, it is possible to a sentence is put into multiple semantic key groups in memory construction. Training data is also handled according to identical rule.Since training corpus is rear comment sentence, we are according to semantic key to instruction White silk is filtered.It has two benefits (1) to be more convenient for the training of small lot;(2) some general responses may be filtered out.
The semantic key of read statement is extracted first, is then converted the semantic key of these read statements by semantic key mapper For the semantic key of relevant output.Then will be trained according to the semantic key of output to being divided into different crucial groups.It is worth noting that, one The sentence of a input may correspond to the sentence of multiple outputs.By application packet, we are divided according to their output statement Class.And the sentence of an output, it is also possible to it is related with multiple input sentence, thus illustrate that the sentence of this output is very general, Without actual meaning, this kind of meaningless sentence is removed.
2nd, end-to-end training and generation:As described above, during the training period, external semantic memory is fixed to training process , therefore it is substantially the additional input of decoder.The remainder of the model is the coder-decoder structure of standard. As shown in Figure 5.We randomly choose one from training data and answer semantic key, then carry out construction output statement.Encoder- The object function of decoder is defined as follows:
Wherein, object function is completely adjustable.In generating process, semantic key is extracted first from read statement, such as Fruit has multiple input semanteme key, then generates multiple output statement candidates using them successively.Then according to their possibility These candidates are ranked up, to produce final previous name or preceding n output.
In order to verify the method, we used about 4,200,000 pairs of STC data from microblogging to carry out encoder-decoding Device training.LTP instruments are used to segment and POS marks.Data set is filtered according to above-mentioned preprocess method, so as to be advised Mould narrows down to 1,500,000.All data are all used for the construction of external memory storage, and there was only 120 general-purposes in data set in training, with Display utilizes the potentiality of exterior corpus.1000 pairs of test sets as objective evaluation are extracted from data set.
We construct three kinds of sequence generation systems:Neural links-and-links machine NRM (neural responding machine, mark Accurate sequence is trained to sequence), (multiresolution recurrent neural network, multiresolution follow MrRNN Ring neutral net) and propose ESED (external semantic memory guided encoder-decoder Model, external semantic memory guiding coder-decoder model).For our model, we are come using three kinds of methods Build external memory storage:S2S (sequence to sequence model sequences to series model), Atten (neural Attention model, attention model) and AutoED (auto-encoder, self-encoding encoder), wherein, AutoED is only right Comment is trained and (can regard the non-parallel corpus storehouse using unsupervised big data as).
The encoder and decoder structure of ESED is identical.It is the hidden state arrow of word insertion and 800 dimensions with 400 dimensions One layer of GRU of amount.Semantic key is used as using noun, adjective.All parameters are evenly distributedly initial between -0.05 and 0.05 Change.Adam (A method for stochastic optimization, a kind of method of random optimization) optimizer it is initial Learning rate is 0.0004.Minimum lot size is dimensioned to 64.For NRM, we are embedded in word use and hidden state is identical Dimension as our model.For MrRNN, we use LSTM (long short term memory, long short-term memory) HRED (hierarchical recurrent encoder-decoder, layered circulation encoding-decoder) model is instead of, and Using nouns and adjectives as semantic key.All these models all have passed through the training of multiple times, with best effective solution The problem of certainly complicated.It is worth noting that, for ESED, generated successively using multiple input semanteme key multiple output statements into Row candidate, we arrange candidate word according to semantic key length descending, because it is considered that long word may include more letters Breath.
1st, diversity and substantive analysis
Herein, we have proposed two purpose analysis indexes:The diversity and substance of STC.They can be from Easily directly calculated in generated statement, and without necessarily referring to sentence.Coverage rate this avoid the reference in control can Can low-down problem.
Diversity reflects the rich of the word in generated statement, is defined as
By calculating the sum of word and the quantity of the word uniquely occurred, all test sentences are calculated.Here Basic assumption be generation vocabulary it is abundanter, the comment of generation is more diversified.Can be easily from this module Go out, it is common to use invalid answer can reduce diversity measurement.
Substance has reacted the substantive information included in the answer of generation, these information are defined as significant entity Quantity.We extract entity without using NER (name entity recognition, name Entity recognition) instruments, because These instruments usually only identify name, place name and organization name, this is relatively limited.On the contrary, we, which use, is based on Wiki hundred The maximum string matching algorithm of section's entity table identifies significant entity.Substantial definition is every words on test set by we Average entity number of words:
For each sentence in test set, we select top-1 and top-5 to calculate the two measurements.As a result such as Fig. 6 Shown in middle table 1.It is observed that ESED is better than other two models all the time in terms of diversity and substance, its difference for The result of top-1 has statistical meaning.For top-5's as a result, the diversity fall of ESED is less than NRM models. This shows that ESED can produce more diversified and abundant output statement.The reduction of result of the entity from top-1 to top-5 may It is due to that the information included in semantic key reduces (longer semantic key row name higher).
It is believed that the above-mentioned performance benefits of ESED are mainly due to the semantic key of use.Particularly when the training data of ESED When being divided into the rear comment data of shared identical semantic key, we effectively remove general revert statement.To be terrible To this specific idea, table two gives some examples in Fig. 7.It can be seen from these examples that NRM and MrRNN tend to Rationally, safety but the answer without practical significance, and ESED can be produced using different semantic keys it is more specific and diversified Comment.The ESED models of table 1 use expression of the word (noun, adjective, Chinese idiom) as semantic key in Fig. 6.By this list The semantic key of word rank and using theme implication rank semantic key be compared it is also meaningful.
In order to consider other semantic key types, we use theme (topic) to be tested as semantic key.On The topic model that we use in testing, it is intended to cultivate the model of a suitable short text.With LDA (Latent Dirichlet allocation, potential Di Li Crays distribution) compare, BTM (Biterm Topic Model, a kind of short text Topic model) it is more effective in terms of short text modeling.In addition, the model should assemble text with appropriate granularity:Neither General (no information), nor too specific (overfitting).Therefore, we select different theme sizes (200 in an experiment With 500).Detailed training program and setting is identical with above-mentioned original text.Therefore, we train theme mould as described above first Type, and theme is selected come the theme for each sentence distribution uniqueness by maximum probability.Therefore, we instruct as described above first Practice topic model, and theme is selected come the theme for each sentence distribution uniqueness by maximum probability.We assume that one common Semantic mapper be used for theme.Therefore, we only need to estimate the theme of a model, and semantic key is exactly this theme Index.In order to investigate the influence of topic numbers, topic numbers are respectively set to 200 and 500 by us.Table 3 shows master in Fig. 8 Inscribe the diversity of rank and substantive sex expression.In order to compare, the result of comparable words rank is also listed.
It is observed that the other external memory storage execution of subject matter level is more far short of what is expected than the memory of word rank.It is considered that this It is so fuzzy to be largely because the semanteme of a topic, so that we can not control caused output language well The semanteme of sentence.In addition, not having clear and definite relation between two themes, therefore it also is difficult to define a significant Semantic mapping Device.As soon as a theme can only be had by being additionally, since each sentence, thus ESED training in share identical theme make comments to It is more difficult.In fact, it is observed that for training the quantity of ESED and the other data of subject matter level much smaller.It is all this A little degenerations for all causing 3 data of table in Fig. 8.In the remainder of this paper, we only focus on the external memory storage of word rank.
2nd, the objective evaluation of quality is replied
We use BLEU (a method for automatic evaluation of machine Translation, a kind of method of machine translation automatic Evaluation) reference of scoring replies comment quality and objective degree to assess Amount.BLEU scorings are suggested the quality of assessment machine translation first.It is used also as answer quality in conversational system recently Objective indicator.As a result as shown in table 4 in Fig. 9.
It is observed that the comment generated for top-1 and top-5, ESED obtains preferably scoring than BLEU.In spite of People think BLEU scoring with the mankind evaluate uniformity when being evaluated for conversational system it is poor.But the performance benefits of uniformity are still So demonstrate the validity of proposed ESED methods.
3rd, personnel evaluation replies quality
We have also carried out human assessment, reply comment quality of the more different models in top-1 settings.We follow The evaluation criteria of STC-2 challenges.The appropriateness of answer judges from following four standard:
(1) it is fluent:Comment is acceptable as natural language text;
(2) uniformity:Comment should get up with logical communication link, and related to original text;
(3) it is self-sufficient:Evaluator can judge whether comment is suitable by reading comment to any content in addition;
(4) it is substantive:Comment provides the new information in this position initiation human eye.
If (1) or (2) are sub-standard, comment should be labeled as " L0 ";If (3) or (4) are sub-standard, comment " L1 " should be labeled as, otherwise.It is labeled as " L2 ".In order to emphasize general comment problem, we add a special symbol " LD " marks reasonable but general comment.When we calculate average mark, " L2 " is 2 points, and " L1 " is 1 point, and " L0 " is 0 point. Especially, " LD " is denoted as 0 point by us, and producing various information with Motivational systems feeds back, rather than general comment.Final Average mark is exactly the average value of all speaker's fractions in all test sentences.
Our test set is made of 100 sentences in training set.Six scholiasts are according to above-mentioned standard to generation Answer is scored.As a result as shown in table 5 in Figure 10.For LD, L0, L1, L2, numeral is to generate the percentage of sentence, and Ave is Average mark.It will be apparent that much fewer than the common replies that baseline NRM models produce of ESED models.Although ESED has than MrRNN The LD percentages of bigger, this is probably because MrRNN tends to produce a variety of answers that can not be understood.In general, ESED moulds Type can generate the more coherent and more rich comment of information, this is that people are appreciated.This is also consistent with the conclusion of Fig. 6 tables 1.
4th, semantic key guiding analysis
The effect for having shown that ESED methods STC above.These experiments all employ a common Uniform semantic key Mapping function.Multiple semantic keys that the rich and varied property of generation comment is extracted in same post.But as above Described, a major advantage of ESED is can to control the semanteme of generation by manipulating semantic mapper.
(1) controllable semantic key mapping method:In addition to using equal function as mapper, also many other is automatic Method inputs semantic key to map.This idea is to find and the relevant word or expression of the semantic key of input.Semanteme can be used Learn relationship map (such as synonym or antonym) or data-driven mapping (such as word insertion neighborhood).In addition to automated manner, also Mapping can be controlled manually.The semantic key of output is set.Table 6 shows what is generated using different Semantic mapping functions in Figure 11 The output statement of same read statement.It can be seen that using synonym or antonym mapping function, the sentence of generation and input Semantic relation is close, but content is different.This abundant semanteme indicates the advantage using external memory storage.We also use with The semantic key of machine selection guides the possibility of the comment generated to show.As shown in table 6 in Figure 11, these comments are most of can Keep semantic dependency and subject consistency.It will be appreciated, however, that and not all Semantic mapping function can all have obvious knot Fruit.
(2) knowledge importing is carried out by external storage:In addition to diversified information is fed back, our model can be with profit New knowledge is introduced to the comment of generation with external storage.Content in view of memory memory storage unit is represented comprising top-K The external semantic memory building process of the matrix of sentence insertion is commented on, may be included in matrix related to corresponding semantic keywords Knowledge.Embedded knowledge can with the word in the training data of external memory storage, occurred jointly or relevant entity associated, Table 7 has given some examples in fig. 12.
As can be seen from the table, external information can manually specify specific semantic key to introduce, this to annotate Substantially without influence fluency and uniformity.For example, giving the semantic key of " Song ", model shows associated ability, And the comment of " song of Zhou Jielun is also good " is generated, it uses the knowledge that Zhou Jielun is film star and singer.
Pass through method proposed in this paper.By combining external semantic memory in coder-decoder frame, this method is big Blanket invalid answer problem is alleviated greatly, and generates more diversified and specific reaction.Objective evaluation and the mankind Assessment all shows advantage.The separation of external storage construction and neural metwork training also so that using non-parallel corpus storehouse can Energy.Also, the corresponding semanteme of generation can be controlled by manipulating semantic mapper, it means that produce more rich newspeak Sentence.
A kind of structure diagram of the system of the establishment semanteme key index provided as shown in figure 13 for one embodiment of the invention, The method that the technical solution of the present embodiment is applicable to the semantic key index of establishment to equipment, the system can perform above-mentioned any real The method of the semantic key index of establishment described in example is applied, and is configured in the terminal.
A kind of system for creating semantic key index provided in this embodiment includes:Semantic key extraction procedure module 11, sentence Cluster Program module 12, coding network mapper module 13, statement coding program module 14 and semantic key index establish program Module 15.
Wherein, semantic key extraction procedure module 11 is used in corpus is talked with, and analyzes the read statement of each pair of sentence pair With the part of speech of each word of revert statement, each semantic key for meeting default part of speech requirement in each sentence is extracted, wherein, it is described Default part of speech includes:At least one of noun, adjective, verb;Sentence Cluster Program module 12 is used for according to semantic key pair All sentences in dialogue corpus are clustered, and the sentence of each classification corresponds to same semantic key;Coding network maps journey Sequence module 13 is used to series model obtain that sentence can be reflected using the sentence training sequence of all categories talked with corpus Penetrate the coding network as real-valued vectors;Statement coding program module 14 utilizes the volume for that will belong to same category of sentence Code network is encoded, and is obtained and the corresponding real-valued vectors set of each semantic key;Semantic key index is established program module 15 and is used Multiple real-valued vectors composition dot-blur patterns are selected in the corresponding real-valued vectors set of each semantic keys of Yu Congyu, in each semanteme Key and each dot-blur pattern part establish semantic key index.
Further, the semantic key index is established program module and is used for:
The central point of real-valued vectors set is determined by mean algorithm;
Determine the distance of each real-valued vectors and the central point in real-valued vectors set;
Real-valued vectors according to default quantity is chosen apart from ascending order form dot-blur pattern.
A kind of revert statement of the dialogue provided as shown in figure 14 for one embodiment of the invention generates the structural representation of system Figure, the technical solution of the present embodiment are applicable to the revert statement generation method of the dialogue to equipment, which can perform above-mentioned The revert statement generation method of dialogue described in any embodiment, and configure in the terminal.
A kind of system for creating semantic key index provided in this embodiment includes:Semantic key extraction procedure module 21, it is semantic Key expands program module 22, and dot-blur pattern concordance program module 23 and sentence reply program module 24.
Wherein, semantic key extraction procedure module 21 is used to extract at least one semantic key in read statement;Semantic key is opened up Open up program module 22 to be used to utilize default Semantic mapping relation, at least one semantic key is expanded into multiple semantic keysets Close;Dot-blur pattern concordance program module 23 indexes and described each for each semantic key in being closed according to the multiple semantic keyset The semantic matched dot-blur pattern of key;Sentence replys program module 24 and is used at least the dot-blur pattern is input in coding network Decoded, select a decoded wherein sentence as revert statement.
Further, the described at least dot-blur pattern is input in coding network carries out decoding and includes:
The sentence vector of the dot-blur pattern and the read statement is input in coding network together and is decoded, its In, the sentence vector of the read statement by read statement by importing pre-generatmg in the coding network.
Further, after the completion of a wheel dialogue, using the read statement and revert statement of epicycle dialogue centering as dialogue To being added in corpus, to expand semantic key index.
Herein, relational terms such as first and second and the like be used merely to by an entity or operation with it is another One entity or operation distinguish, without necessarily requiring or implying there are any this reality between these entities or operation Relation or order.Moreover, term " comprising ", "comprising", not only including those key elements, but also including being not explicitly listed Other element, or further include as elements inherent to such a process, method, article, or device.Do not limiting more In the case of system, the key element that is limited by sentence " including ... ", it is not excluded that in the process including the key element, method, article Or also there are other identical element in equipment.
Device embodiment described above is only schematical, wherein the unit illustrated as separating component can To be or may not be physically separate, physics list is may or may not be as the component that unit is shown Member, you can with positioned at a place, or can also be distributed in multiple network unit.It can be selected according to the actual needs In some or all of module realize the purpose of this embodiment scheme.Those of ordinary skill in the art are not paying creativeness Work in the case of, you can to understand and implement.
Through the above description of the embodiments, those skilled in the art can be understood that each embodiment can Realized by the mode of software plus required general hardware platform, naturally it is also possible to pass through hardware.Based on such understanding, on The part that technical solution substantially in other words contributes to the prior art is stated to embody in the form of software product, should Computer software product can store in a computer-readable storage medium, such as ROM/RAM, magnetic disc, CD, including some fingers Order is used so that a computer equipment (can be personal computer, server, or network equipment etc.) performs each implementation Method described in some parts of example or embodiment.
Finally it should be noted that:The above embodiments are merely illustrative of the technical solutions of the present invention, rather than its limitations;Although The present invention is described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that:It still may be used To modify to the technical solution described in foregoing embodiments, or equivalent substitution is carried out to which part technical characteristic; And these modification or replace, do not make appropriate technical solution essence depart from various embodiments of the present invention technical solution spirit and Scope.

Claims (10)

1. a kind of method for creating semantic key index, including:
In corpus is talked with, the part of speech of the read statement of each pair of sentence pair of analysis and each word of revert statement, is extracted each Meet each semantic key of default part of speech requirement in sentence sentence, wherein, the default part of speech includes:In noun, adjective, verb It is at least one;
All sentences in dialogue corpus are clustered according to semantic key, the sentence of each classification corresponds to same semanteme Key;
Using the sentence training sequence of all categories talked with corpus to series model, obtaining that sentence can be mapped becomes real It is worth the coding network of vector;
Same category of sentence will be belonged to be encoded using the coding network, obtain with each semantic corresponding real value of key to Duration set;
Multiple real-valued vectors composition dot-blur patterns are selected from the corresponding real-valued vectors set of each semantic key, in each language Adopted key and each dot-blur pattern part establish semantic key index.
It is 2. described from the corresponding real-valued vectors set of each semantic key according to the method described in claim 1, wherein Multiple real-valued vectors composition dot-blur patterns are selected to include:
The central point of real-valued vectors set is determined by mean algorithm;
Determine the distance of each real-valued vectors and the central point in real-valued vectors set;
Real-valued vectors according to default quantity is chosen apart from ascending order form dot-blur pattern.
3. a kind of revert statement generation method of dialogue, including:
Extract at least one semantic key in read statement;
Using default Semantic mapping relation, at least one semantic key is expanded into multiple semantic keysets and is closed;
Each semantic key in being closed according to the multiple semantic keyset, index and the matched dot-blur pattern of each semantic key;
At least the dot-blur pattern is input in coding network and is decoded, selects a decoded wherein sentence to make back Multiple sentence.
4. according to the method described in claim 3, wherein, the described at least dot-blur pattern is input in coding network carries out Decoding includes:
The sentence vector of the dot-blur pattern and the read statement is input in coding network together and is decoded, wherein, institute The sentence vector of read statement is stated by the way that read statement is imported pre-generatmg in the coding network.
5. according to the method described in claim 3, wherein, the method further includes:
After the completion of a wheel dialogue, using the read statement and revert statement of epicycle dialogue centering as dialogue to being added to corpus In, to expand semantic key index.
6. a kind of system for creating semantic key index, including:
Semantic key extraction procedure module, in corpus is talked with, analyzing the read statement and revert statement of each pair of sentence pair Each word part of speech, extract each semantic key for meeting the requirement of default part of speech in each sentence, wherein, the default part of speech bag Include:At least one of noun, adjective, verb;
Sentence Cluster Program module, for being clustered according to semantic key to all sentences in dialogue corpus, each classification Sentence correspond to same semantic key;
Coding network mapper module, for utilizing the sentence training sequence of all categories talked with corpus to sequence mould Type, obtains that sentence can be mapped to the coding network as real-valued vectors;
Statement coding program module, is encoded for that will belong to same category of sentence using the coding network, obtain with The corresponding real-valued vectors set of each semanteme key;
Semantic key index establishes program module, multiple for being selected from the corresponding real-valued vectors set of each semantic key Real-valued vectors form dot-blur pattern, and semantic key index is established in each semantic key and each dot-blur pattern part.
7. system according to claim 6, wherein, the semanteme key index is established program module and is used for:
The central point of real-valued vectors set is determined by mean algorithm;
Determine the distance of each real-valued vectors and the central point in real-valued vectors set;
Real-valued vectors according to default quantity is chosen apart from ascending order form dot-blur pattern.
8. a kind of revert statement generation system of dialogue, including:
Semantic key extraction procedure module, for extracting at least one semantic key in read statement;
Semantic key expands program module, and for utilizing default Semantic mapping relation, at least one semantic key is expanded to Multiple semanteme keysets close;
Dot-blur pattern concordance program module, for each semantic key in being closed according to the multiple semantic keyset, indexes and described each The semantic matched dot-blur pattern of key;
Sentence replys program module, is decoded at least the dot-blur pattern to be input in coding network, selection decoding A wherein sentence afterwards makes revert statement.
9. system according to claim 8, wherein, it is described to reply program module to sentence and be used for:
The sentence vector of the dot-blur pattern and the read statement is input in coding network together and is decoded, wherein, institute The sentence vector of read statement is stated by the way that read statement is imported pre-generatmg in the coding network.
10. system according to claim 8, wherein, the system is additionally operable to:
After the completion of a wheel dialogue, using the read statement and revert statement of epicycle dialogue centering as dialogue to being added to corpus In, to expand semantic key index.
CN201711320645.3A 2017-12-12 2017-12-12 Method and system for creating semantic key index Active CN107944027B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711320645.3A CN107944027B (en) 2017-12-12 2017-12-12 Method and system for creating semantic key index

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711320645.3A CN107944027B (en) 2017-12-12 2017-12-12 Method and system for creating semantic key index

Publications (2)

Publication Number Publication Date
CN107944027A true CN107944027A (en) 2018-04-20
CN107944027B CN107944027B (en) 2020-03-31

Family

ID=61942816

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711320645.3A Active CN107944027B (en) 2017-12-12 2017-12-12 Method and system for creating semantic key index

Country Status (1)

Country Link
CN (1) CN107944027B (en)

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108763462A (en) * 2018-05-28 2018-11-06 深圳前海微众银行股份有限公司 Update method, equipment and the computer readable storage medium of parallel statement library
CN108776677A (en) * 2018-05-28 2018-11-09 深圳前海微众银行股份有限公司 Creation method, equipment and the computer readable storage medium of parallel statement library
CN109165388A (en) * 2018-09-28 2019-01-08 郭派 A kind of method and module constructing English polysemant paraphrase semantic tree
CN109271496A (en) * 2018-08-30 2019-01-25 广东工业大学 A kind of natural answering method based on text, knowledge base and sequence to sequence
CN109637527A (en) * 2018-12-13 2019-04-16 苏州思必驰信息科技有限公司 The semantic analytic method and system of conversation sentence
CN109918646A (en) * 2019-01-30 2019-06-21 中国科学院自动化研究所 Chapter causality judgment method, system, device
CN110033091A (en) * 2018-12-13 2019-07-19 阿里巴巴集团控股有限公司 A kind of method and apparatus predicted based on model
CN110197279A (en) * 2019-06-10 2019-09-03 北京百度网讯科技有限公司 Transformation model training method, device, equipment and storage medium
CN110334338A (en) * 2019-04-29 2019-10-15 北京小米移动软件有限公司 Segmenting method, device and equipment
CN110489531A (en) * 2018-05-11 2019-11-22 阿里巴巴集团控股有限公司 The determination method and apparatus of high frequency problem
CN111553173A (en) * 2020-04-23 2020-08-18 苏州思必驰信息科技有限公司 Natural language generation training method and device
CN111831660A (en) * 2020-07-16 2020-10-27 深圳大学 Method and device for evaluating metric space division mode, computer equipment and storage medium
TWI712949B (en) * 2019-05-28 2020-12-11 雲義科技股份有限公司 Method for calculating a semantic similarity
CN112101037A (en) * 2019-05-28 2020-12-18 云义科技股份有限公司 Semantic similarity calculation method
CN115146650A (en) * 2022-06-27 2022-10-04 西安羚控电子科技有限公司 Test process creating method and system based on semantic recognition

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060179051A1 (en) * 2005-02-09 2006-08-10 Battelle Memorial Institute Methods and apparatus for steering the analyses of collections of documents
CN101630312A (en) * 2009-08-19 2010-01-20 腾讯科技(深圳)有限公司 Clustering method for question sentences in question-and-answer platform and system thereof
CN103064969A (en) * 2012-12-31 2013-04-24 武汉传神信息技术有限公司 Method for automatically creating keyword index table
CN106980683A (en) * 2017-03-30 2017-07-25 中国科学技术大学苏州研究院 Blog text snippet generation method based on deep learning
CN107085581A (en) * 2016-02-16 2017-08-22 腾讯科技(深圳)有限公司 Short text classification method and device

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060179051A1 (en) * 2005-02-09 2006-08-10 Battelle Memorial Institute Methods and apparatus for steering the analyses of collections of documents
CN101630312A (en) * 2009-08-19 2010-01-20 腾讯科技(深圳)有限公司 Clustering method for question sentences in question-and-answer platform and system thereof
CN103064969A (en) * 2012-12-31 2013-04-24 武汉传神信息技术有限公司 Method for automatically creating keyword index table
CN107085581A (en) * 2016-02-16 2017-08-22 腾讯科技(深圳)有限公司 Short text classification method and device
CN106980683A (en) * 2017-03-30 2017-07-25 中国科学技术大学苏州研究院 Blog text snippet generation method based on deep learning

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
冯凯: "基于问题模式分类的自动问答技术研究与应用", 《中国优秀硕士论文全文数据库》 *

Cited By (25)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110489531A (en) * 2018-05-11 2019-11-22 阿里巴巴集团控股有限公司 The determination method and apparatus of high frequency problem
CN110489531B (en) * 2018-05-11 2023-05-30 阿里巴巴集团控股有限公司 Method and device for determining high-frequency problem
CN108776677A (en) * 2018-05-28 2018-11-09 深圳前海微众银行股份有限公司 Creation method, equipment and the computer readable storage medium of parallel statement library
CN108763462A (en) * 2018-05-28 2018-11-06 深圳前海微众银行股份有限公司 Update method, equipment and the computer readable storage medium of parallel statement library
CN108763462B (en) * 2018-05-28 2021-11-12 深圳前海微众银行股份有限公司 Method and device for updating parallel sentence library and computer readable storage medium
CN109271496A (en) * 2018-08-30 2019-01-25 广东工业大学 A kind of natural answering method based on text, knowledge base and sequence to sequence
CN109165388A (en) * 2018-09-28 2019-01-08 郭派 A kind of method and module constructing English polysemant paraphrase semantic tree
CN109165388B (en) * 2018-09-28 2022-06-21 郭派 Method and system for constructing paraphrase semantic tree of English polysemous words
CN110033091A (en) * 2018-12-13 2019-07-19 阿里巴巴集团控股有限公司 A kind of method and apparatus predicted based on model
US11803752B2 (en) 2018-12-13 2023-10-31 Advanced New Technologies Co., Ltd. Separate deployment of machine learning model and associated embedding
CN109637527A (en) * 2018-12-13 2019-04-16 苏州思必驰信息科技有限公司 The semantic analytic method and system of conversation sentence
CN109637527B (en) * 2018-12-13 2021-08-31 思必驰科技股份有限公司 Semantic analysis method and system for dialogue statement
CN109918646A (en) * 2019-01-30 2019-06-21 中国科学院自动化研究所 Chapter causality judgment method, system, device
CN110334338A (en) * 2019-04-29 2019-10-15 北京小米移动软件有限公司 Segmenting method, device and equipment
CN110334338B (en) * 2019-04-29 2023-09-19 北京小米移动软件有限公司 Word segmentation method, device and equipment
TWI712949B (en) * 2019-05-28 2020-12-11 雲義科技股份有限公司 Method for calculating a semantic similarity
CN112101037A (en) * 2019-05-28 2020-12-18 云义科技股份有限公司 Semantic similarity calculation method
CN110197279A (en) * 2019-06-10 2019-09-03 北京百度网讯科技有限公司 Transformation model training method, device, equipment and storage medium
CN110197279B (en) * 2019-06-10 2021-01-29 北京百度网讯科技有限公司 Transformation model training method, device, equipment and storage medium
CN111553173B (en) * 2020-04-23 2023-09-15 思必驰科技股份有限公司 Natural language generation training method and device
CN111553173A (en) * 2020-04-23 2020-08-18 苏州思必驰信息科技有限公司 Natural language generation training method and device
CN111831660B (en) * 2020-07-16 2021-03-30 深圳大学 Method and device for evaluating metric space division mode, computer equipment and storage medium
CN111831660A (en) * 2020-07-16 2020-10-27 深圳大学 Method and device for evaluating metric space division mode, computer equipment and storage medium
CN115146650A (en) * 2022-06-27 2022-10-04 西安羚控电子科技有限公司 Test process creating method and system based on semantic recognition
CN115146650B (en) * 2022-06-27 2023-06-30 西安羚控电子科技有限公司 Test process creation method and system based on semantic recognition

Also Published As

Publication number Publication date
CN107944027B (en) 2020-03-31

Similar Documents

Publication Publication Date Title
CN107944027A (en) Create the method and system of semantic key index
CN110222163A (en) A kind of intelligent answer method and system merging CNN and two-way LSTM
CN110674410A (en) User portrait construction and content recommendation method, device and equipment
Collins Theories of memory
Mirkovic et al. Where does gender come from? Evidence from a complex inflectional system
CN110297887A (en) Service robot personalization conversational system and method based on cloud platform
CN107679225A (en) A kind of reply generation method based on keyword
CN108388944B (en) A kind of automatic chatting method and robot based on deep neural network
CN108491515A (en) A kind of sentence pair matching degree prediction technique for campus psychological consultation
CN114911932A (en) Heterogeneous graph structure multi-conversation person emotion analysis method based on theme semantic enhancement
Rácz et al. Morphological emergence
Acquaviva et al. Models of lexical meaning
CN112818106A (en) Evaluation method of generating type question and answer
Czachesz Network analysis of biblical texts
Majewska et al. Semantic data set construction from human clustering and spatial arrangement
CN112905776A (en) Emotional dialogue model construction method, emotional dialogue system and method
Maslennikova ELMo Word Representations For News Protection.
Gong et al. Computational modeling on language emergence: A coevolution model of lexicon, syntax and social structure
Liew et al. A sonification of cross-cultural differences in happiness-related tweets
CN112115722A (en) Human brain-simulated Chinese analysis method and intelligent interaction system
Shawar et al. Chatbots: Can they serve as natural language interfaces to QA corpus?
Mallios Virtual doctor: an intelligent human-computer dialogue system for quick response to people in need
Alahmed et al. “How Does ChatGPT Work” Examining Functionality To The Creative AI CHATGPT on X's (Twitter) Platform
Woolf et al. An active learning approach to transcript coding for education research
Tanwar et al. AI Based Chatbot for Healthcare using Machine Learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20200630

Address after: Building 14, Tengfei science and Technology Park, 388 Xinping street, Suzhou Industrial Park, Suzhou, Jiangsu Province, 215000

Co-patentee after: Shanghai Jiaotong University Intellectual Property Management Co.,Ltd.

Patentee after: AI SPEECH Co.,Ltd.

Address before: Suzhou City, Jiangsu Province, Suzhou Industrial Park 215123 Xinghu Street No. 328 Creative Industry Park 9-703

Co-patentee before: SHANGHAI JIAO TONG University

Patentee before: AI SPEECH Co.,Ltd.

TR01 Transfer of patent right
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20201019

Address after: 215123 14 Tengfei Innovation Park, 388 Xinping street, Suzhou Industrial Park, Suzhou, Jiangsu.

Patentee after: AI SPEECH Co.,Ltd.

Address before: Building 14, Tengfei science and Technology Park, 388 Xinping street, Suzhou Industrial Park, Suzhou, Jiangsu Province, 215000

Patentee before: AI SPEECH Co.,Ltd.

Patentee before: Shanghai Jiaotong University Intellectual Property Management Co.,Ltd.

CP01 Change in the name or title of a patent holder
CP01 Change in the name or title of a patent holder

Address after: 215123 14 Tengfei Innovation Park, 388 Xinping street, Suzhou Industrial Park, Suzhou, Jiangsu.

Patentee after: Sipic Technology Co.,Ltd.

Address before: 215123 14 Tengfei Innovation Park, 388 Xinping street, Suzhou Industrial Park, Suzhou, Jiangsu.

Patentee before: AI SPEECH Co.,Ltd.

PE01 Entry into force of the registration of the contract for pledge of patent right
PE01 Entry into force of the registration of the contract for pledge of patent right

Denomination of invention: Method and System for Creating Semantic Key Index

Effective date of registration: 20230726

Granted publication date: 20200331

Pledgee: CITIC Bank Limited by Share Ltd. Suzhou branch

Pledgor: Sipic Technology Co.,Ltd.

Registration number: Y2023980049433