CN112307754A

CN112307754A - Statement acquisition method and device

Info

Publication number: CN112307754A
Application number: CN202010287542.7A
Authority: CN
Inventors: 陈龙; 李宥壑
Original assignee: Beijing Wodong Tianjun Information Technology Co Ltd
Current assignee: Beijing Wodong Tianjun Information Technology Co Ltd
Priority date: 2020-04-13
Filing date: 2020-04-13
Publication date: 2021-02-02
Anticipated expiration: 2040-04-13
Also published as: CN112307754B

Abstract

The application provides a statement acquisition method and device. The method comprises the following steps: the method comprises the steps of obtaining body words, determining at least one metaphor corresponding to the body words from a database according to the body words, wherein the database comprises a plurality of body words and at least one metaphor corresponding to each body word, the metaphor corresponding to each body word is generated according to at least one triple and a preset metaphor template, the at least one triple is determined according to the correlation distances of the body word set, the plena set, the modifier set and the triple, the correlation distances are determined according to the first vector cosine distances of the body words and the modifiers, the second vector cosine distances of the metaphors and the modifiers and the difference between the first vector cosine distances and the second vector cosine distances, the vector cosine distances are calculated according to embedded vectors of the two words, and at least one metaphor is sent to terminal equipment. Therefore, the number of the metaphorical sentences is greatly increased, and the diversity of the metaphorical sentences is improved.

Description

Statement acquisition method and device

Technical Field

The present application relates to the field of data processing technologies, and in particular, to a statement acquisition method and apparatus.

Background

With the continuous expansion of computer application fields, natural language processing has been highly regarded by people, and a metaphor is a common paraphrase method, which means that a metaphor with similar points to the ontology is used to describe or explain the ontology, and metaphors are used in writing and talking, so that a higher level of language level can be embodied, and the metaphor is one of the difficulties of natural language processing. With the development of intelligent technology in recent years, chat robots or creation robots are evolving from "accuracy" to "openness" and "humanoid". Generally, metaphors are used in conversations or texts, so that the user's enjoyment can be greatly improved, and the user can be prompted to continuously talk or read.

In the conventional chat robot, metaphors in a corpus including chat logs or comment logs are collected, and then ontology words and metaphor words in the collected metaphors are identified and stored in a database, so that the metaphors corresponding to the ontology words can be directly used when the ontology words input by a user are received.

However, since the corpus of metaphors generated as described above is limited, the number of metaphors stored in the database is not large, and the metaphors may be used repeatedly, which may result in poor user experience.

Disclosure of Invention

The application provides a sentence acquisition method and a sentence acquisition device, which can obtain available metaphors of all common words of Chinese and greatly increase the number of metaphors.

In a first aspect, the present application provides a statement acquisition method, including:

obtaining a body word;

determining at least one metaphor sentence corresponding to the body word from a database according to the body word, wherein the database comprises a plurality of body words and at least one metaphor sentence corresponding to each body word, the at least one metaphor sentence corresponding to each body word is generated according to at least one triple and a preset metaphor sentence template, the at least one triple is determined according to the correlation distance of the body word set, the face word set, the modifier set and the triple, the triple comprises the body words, the body words and the modifiers, the correlation distance is determined according to the first vector cosine distance of the body words and the modifiers, the second vector cosine distance of the body words and the modifiers and the difference value of the first vector cosine distance and the second vector cosine distance, and the vector cosine distance is calculated according to the embedded vectors of the two words;

and sending the at least one metaphor sentence to the terminal equipment.

Optionally, the method further includes:

performing word segmentation processing on the received corpus to obtain a word sample set, and determining a word embedded vector index corresponding to the word sample set according to a preset training model, wherein the word embedded vector index is used for storing an embedded vector of each word;

determining the metaphor embedded vector index according to the word embedded vector index and the metaphor set, and determining the modifier embedded vector index according to the word embedded vector index and the modifier set;

for each body word in the body word set, determining M triples according to the body word embedded vector index and the modifier embedded vector index, wherein M is a preset positive integer;

and generating at least one metaphor sentence according to the correlation distance of each triple in the M triples and a preset metaphor sentence template.

Optionally, the generating at least one metaphorical sentence according to the relevance distance of each triplet in the multiple triplets and a preset metaphorical sentence template includes:

respectively calculating the correlation distance of each triple in the M triples;

determining P triples with the minimum correlation distance from the M triples, wherein P is a preset positive integer;

and generating the at least one metaphorical sentence according to the P triples and a preset metaphorical sentence template.

Optionally, determining M triples according to the word-lifting embedded vector index and the modifier-embedding vector index includes:

determining N modifying words with the minimum vector cosine distance with the body word by embedding the modifying words into the vector index, wherein N is a preset positive integer;

for each modifier in the N modifiers, determining Q metaphors with the minimum vector cosine distance from the modifier through the metaphor embedding vector index, and determining S metaphors with the vector cosine distance from the modifier smaller than or equal to a first distance through the metaphors embedding vector index, wherein the first distance is the vector cosine distance between the local word and the modifier;

and obtaining the M triples according to the Q metaphors, the S metaphors and the N modifiers, wherein M is Q + N + S.

Optionally, the correlation distance is calculated by the following formula:

Dist_α,β,γ＝dist_α,γ+dist_β,γ+log(|dist_α,γ-dis_β,γ|+)

wherein Dist_α,β,γFor the correlation distance, dist_α,γIs the first vector cosine distance, dist_β,γIs the second vector cosine distance, | dist_α,γ-dist_β,γAnd | is the difference value of the first vector cosine distance and the second vector cosine distance, and ξ is an integer.

Optionally, the determining a metaphor word embedding vector index from the word embedding vector index and the set of plenary words comprises:

finding word embedding vectors of all metaphors in the metaphor word set from the word embedding vector index, and obtaining the metaphor word embedding vector index according to the word embedding vectors of all metaphors;

determining a modifier-embedded vector index from the word-embedded vector index and the set of modifiers, comprising:

and finding out word embedding vectors of all modifiers in the modifier set from the word embedding vector index, and obtaining the modifier embedding vector index according to the word embedding vectors of all modifiers.

Optionally, the method further includes:

obtaining a ontology word database, a body word corpus and a modifier word corpus;

extracting the set of ontology words, and the set of modifiers from the ontology word corpus, and the modifier word corpus, respectively.

In a second aspect, the present application provides a sentence acquisition apparatus, including:

the acquisition module is used for acquiring the body words;

a processing module for determining at least one metaphor sentence corresponding to the body word from a database according to the body word, the database comprises a plurality of body words and at least one metaphor sentence corresponding to each body word, the at least one metaphor corresponding to each body word is generated according to the at least one triple and a preset metaphor template, the at least one triple is determined from the relevance distances of the body word set, the modifier set and the triple, the triple comprises a body word, a metaphor word and a modifier, the correlation distance is determined according to a first vector cosine distance between the body word and the modifier, a second vector cosine distance between the metaphor word and the modifier and a difference value between the first vector cosine distance and the second vector cosine distance, and the vector cosine distance is calculated according to embedded vectors of two words;

and the sending module is used for sending the at least one metaphor sentence to the terminal equipment.

Optionally, the apparatus further comprises:

the word segmentation module is used for carrying out word segmentation on the received corpus to obtain a word sample set, and determining a word embedded vector index corresponding to the word sample set according to a preset training model, wherein the word embedded vector index is used for storing an embedded vector of each word;

a first determining module, configured to determine the metaphor embedded vector index according to the word embedded vector index and the set of modifiers, and determine the modifier embedded vector index according to the word embedded vector index and the set of modifiers;

a second determining module, configured to determine, for each local word in the local word set, M triples according to the local word embedding vector index and the modifier embedding vector index, where M is a preset positive integer;

and the generating module is used for generating at least one metaphor sentence according to the correlation distance of each triple in the M triples and a preset metaphor sentence template.

Optionally, the generating module is configured to:

Optionally, the second determining module is configured to:

Optionally, the correlation distance is calculated by the following formula:

Dist_α,β,γ＝dist_α,γ+dist_β,γ+log(|dist_α,γ-dist_β,γ|+)

Optionally, the first determining module is configured to:

Optionally, the obtaining module is further configured to:

the processing module is further configured to: extracting the set of ontology words, and the set of modifiers from the ontology word corpus, and the modifier word corpus, respectively.

According to the sentence acquisition method and device provided by the application, after the body words are acquired, at least one metaphor sentence corresponding to the body words is determined from the database according to the body words, then the at least one metaphor sentence is sent to the terminal equipment, since the body words in the database are all commonly used words of chinese, the at least one metaphor sentence corresponding to each body word in the database is generated according to at least one triple and a preset metaphor sentence template, the at least one triple is determined according to the correlation distances of the body word set, the modifier set and the triple, therefore, the database stores the available metaphors of all the commonly used Chinese words, the number of the metaphors is greatly increased, the repeated use is avoided, the diversity of the metaphorical sentences is improved, the range of application scenes is expanded, and the user experience is improved.

Drawings

In order to more clearly illustrate the technical solutions in the present application or the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive exercise.

Fig. 1 is a schematic view of an application scenario of the present application;

FIG. 2 is a flowchart of an embodiment of a sentence acquisition method provided by the present application;

FIG. 3 is a flowchart of an embodiment of a sentence acquisition method provided by the present application;

FIG. 4 is a schematic diagram of a process for training a Word2vec model after corpus is received;

FIG. 5 is a schematic diagram of a word-embedded vector index;

FIG. 6 is a schematic diagram of a process for deriving a body word embedding vector index and a modifier embedding vector index from word embedding vector indices;

FIG. 7 is a schematic diagram of the relevance of ontology words, metaphorics, and modifiers;

fig. 8 is a schematic process diagram for determining M triples corresponding to each local word in the local word set provided by the present application;

fig. 9 is a schematic structural diagram of a sentence acquisition apparatus provided in the present application;

fig. 10 is a schematic structural diagram of a sentence acquisition apparatus provided in the present application;

fig. 11 is a schematic diagram of a hardware structure of an electronic device provided in the present application.

Detailed Description

To make the purpose, technical solutions and advantages of the present application clearer, the technical solutions in the present application will be clearly and completely described below with reference to the drawings in the present application, and it is obvious that the described embodiments are some, but not all embodiments of the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

First, some terms in the embodiments of the present application are explained below to facilitate understanding by those skilled in the art.

1. Word2vec, a group of correlation models used to generate Word vectors. These models are shallow, two-layer neural networks that are trained to reconstruct linguistic word text. The network is represented by words and the input words in adjacent positions are guessed, and the order of the words is unimportant under the assumption of the bag-of-words model in word2 vec. After training is complete, the word2vec model can be used to map each word to a vector, which can be used to represent word-to-word relationships, which is a hidden layer of the neural network.

2. TensorFlow is a symbolic mathematical system based on dataflow programming, and is widely applied to programming realization of various machine learning (machine learning) algorithms.

3. The Word Embedding vector index stores the Embedding vector of each Word, a Word2vec model is trained by using Tensorflow according to a Word sample set, and after multiple training is completed, a hidden layer parameter W of the neural network is the Embedding (Embedding) vector of all words in the Word sample set. The word embedding vector indexing in this application can be implemented using any vector storage engine, such as Faiss or Milvus. After entering the embedding vector v of an arbitrary word, the word embedding vector index can find several words and their embedding vectors that are closest to v within milliseconds.

4. The vector cosine distance is the vector cosine distance between two words, the word embedding vector index stores the embedding vector of each word, the vector cosine distance between the two words is calculated according to the embedding vectors of the two words, and the vector cosine distance calculation formula is as follows:

wherein A and B are two word embedding vectors respectively, cos theta is the distance of cosine of the vector, i.e. the dot product of A and BDivided by the product of the lengths of a and B.

5. A triplet refers to a combination that includes a body word, a metaphor word, and a modifier word.

In the conventional sentence acquisition method, the metaphors in the database are obtained from the collected existing metaphors by recognizing the local words and the metaphors of the collected metaphors in the corpus and storing the local words and the corresponding metaphors in the database, and the existing metaphors are limited, so that the number of the metaphors stored in the database is small, and the metaphors are repeatedly used during human-computer interaction, and the user experience is not high. In order to solve the problem, the present application provides a sentence acquisition method and apparatus, in the present application, the body words in the database are all commonly used words of the chinese language, at least one metaphor sentence corresponding to each body word in the database is generated according to at least one triple and a preset metaphor sentence template, and at least one triple is determined according to the correlation distance between the body word set, the metaphor word set, the modifier word set and the triple, so that the database stores the available metaphor sentences of all the commonly used words of the chinese language, the number of the metaphor sentences is greatly increased, the situation of repeated use is avoided, the diversity of the metaphor sentences is improved, the range of application scenarios is expanded, and the user experience is improved. The following describes in detail a specific implementation process of the statement acquisition method according to the embodiment of the present application, with reference to the accompanying drawings.

The sentence acquisition method and the sentence acquisition device can be applied to scenes such as machine-generated texts and chat robots (such as customer service robots), taking a customer service robot as an example, fig. 1 is a schematic view of an application scenario of the present application, as shown in fig. 1, the application scenario of the present application relates to a terminal device and a server, the terminal device is an electronic device such as a mobile phone, a personal computer, etc., an interface of the terminal device shown in fig. 1 is a chat interface between a user and the customer service robot, for example, the user inputs a sentence "what is a crescent moon" on the chat interface, the server can acquire the body word "crescent moon" according to the sentence input by the user, inquiring whether the noumenon word exists from the database according to the noumenon word 'crescent moon', if the noumenon word exists, at least one metaphor corresponding to the ontology word is sent to the terminal device, for example, the metaphor is sent: the crescent moon is just like a hook/a sickle, if the noumenon word does not exist, an invalid noumenon is sent to the terminal equipment, the terminal equipment receives at least one metaphor sentence, the received metaphor sentence is displayed, and therefore a user can see the displayed content.

Fig. 2 is a flowchart of an embodiment of a statement obtaining method provided in the present application, where an execution subject of the embodiment may be the server shown in fig. 1, and as shown in fig. 2, the method of the embodiment may include:

and S101, acquiring the body words.

Specifically, a user inputs a sentence through a terminal device or directly inputs a body word, the body word is a noun, and the sentence or word input by the user is analyzed, so that the body word is obtained, for example, a life input by the user is the body word, and for example, a month input by the user is the body word.

And S102, determining at least one metaphor corresponding to the body word from a database according to the body word, wherein the database comprises a plurality of body words and at least one metaphor corresponding to each body word, the at least one metaphor corresponding to each body word is generated according to at least one triple and a preset metaphor template, the at least one triple is determined according to the correlation distances of the body word set, the face-lifting body word set, the modifier set and the triple, the triple comprises the body word, the metaphor and the modifier, and the correlation distances are determined according to the first vector cosine distance of the body word and the modifier, the first vector cosine distance of the metaphor and the modifier and the difference value of the first vector cosine distance and the second vector cosine distance.

Specifically, the database includes a plurality of body words, each body word corresponds to at least one metaphor, the server searches whether there is a body word in the database according to the body word obtained in S101, and if there is a body word, sends at least one metaphor corresponding to the body word to the terminal device, for example, sends a metaphor: the crescent moon is just like a sickle/crescent moon like a fishhook, if the noumenon word does not exist, the invalid noumenon is sent to the terminal equipment, and the terminal equipment displays according to the received content, so that the interaction with the user is realized. In this embodiment, the at least one metaphorical sentence corresponding to each body word may be generated by the server or the metaphorical sentence generating apparatus according to the plurality of triples and the preset metaphorical sentence template after determining the plurality of triples according to the body word set, the modifier word set, and the correlation distance between the triples. The number of the triples can be preset according to the operational capability and the actual requirement of the server, for each triplet, the triplet includes a body word, a metaphor word and a modifier, the correlation distance of the triplet is determined according to the first vector cosine distance between the body word and the modifier, the second vector cosine distance between the metaphor word and the modifier and the difference between the first vector cosine distance and the second vector cosine distance, the correlation distance of the triplet is used for representing the possibility that the triplet can be used as a metaphor sentence, and the smaller the correlation distance of the triplet is, the greater the possibility that the triplet can be used as the metaphor sentence is.

In addition, at least one metaphor sentence corresponding to each body word may be generated by the server, or may be generated by another metaphor sentence generating means, and if the metaphor sentence is generated by the server, the server stores a plurality of body words and at least one metaphor sentence corresponding to each body word in the database after the generation; if the metaphorical sentence generating means generates the metaphorical sentence, the server stores the plurality of body words generated by the metaphorical sentence generating means and at least one metaphorical sentence corresponding to each body word in the database.

Wherein the body word set, the body word set and the modifier word set are three different word sets, the body words and the body word sets are nouns, the modifiers are verbs and adjectives, in an implementable manner, the body word set and the modifier word set can be pre-stored, the body words of the metaphors are generally more difficult to understand abstract matters such as "life", "love", and the like, while the body words are generally common image matters such as "honey", "marine buildings", therefore, in the embodiment, the body words in the body word set can be words in modern poetry sets, modern prose sets and literature magazines, the body words in the body word set can be words in chat logs or comment logs, and the modifiers in the modifier word set can be words in adjective dictionaries, verb dictionaries or dictionary nouns. In another practical implementation manner, the body word set, the face word set and the modifier word set may be obtained online, where the method of this embodiment may further include, before S101: obtaining a body word stock, a body word stock and a modifier word stock, and extracting a body word set, a body word set and a modifier word set from the body word stock, the body word stock and the modifier word stock respectively, for example, the body word stock can be a modern poem set, a modern prose set and a literature magazine set, and performing word segmentation processing on all the words in the body word stock through a word segmentation system to obtain a body word set; the metaphorical word corpus can be a chat log or a comment log, and all corpora in the metaphorical word corpus are segmented through a segmentation system to obtain a well-known word set; the modifying word corpus can be an adjective dictionary, a verb dictionary or a noun dictionary, and words in the dictionary can be directly used as modifying words to obtain a modifying word set.

In this embodiment, the embedded vector of each Word may be a Word2vec model trained by using Tensorflow according to a Word sample set, and after multiple training is completed, the hidden layer parameter W of the neural network is the embedded (Embedding) vector of all words in the Word sample set. The word embedding vector indexing in this application can be implemented using any vector storage engine, such as Faiss or Milvus. After the embedded vector v of any word is input, the word embedded vector index can find several words and embedded vectors thereof which are closest to v within milliseconds.

S103, at least one metaphor sentence is sent to the terminal equipment.

Specifically, after receiving at least one metaphor sentence, the terminal device displays the metaphor sentence according to the received content, thereby realizing interaction with the user.

According to the sentence acquisition method provided by the embodiment, after the body words are acquired, at least one metaphor corresponding to the body words is determined from the database according to the body words, and then the at least one metaphor is sent to the terminal device.

In the embodiment shown in fig. 1, the plurality of body words and the at least one metaphor corresponding to each body word stored in the database may be generated in advance and then stored, and may also be updated according to the above generation method in a later period, and may also be generated on line, and the generated plurality of body words and the at least one metaphor corresponding to each body word are directly stored in the database, and before S101, a process of generating the at least one metaphor corresponding to each body word may also be included, and a process of generating the at least one metaphor corresponding to each body word is described in detail below with reference to fig. 2.

Fig. 3 is a flowchart of an embodiment of a statement obtaining method provided in the present application, where an execution subject of the embodiment may be the server shown in fig. 1, and as shown in fig. 3, the method of the embodiment may include:

s201, performing word segmentation processing on the received corpus to obtain a word sample set, and determining a word embedded vector index corresponding to the word sample set according to a preset training model, wherein the word embedded vector index is used for storing an embedded vector of each word.

The received corpus can be commodity comments, chat logs and user authored documents, and it can be understood that the received corpus comprises a plurality of sentences, each sentence can be participled through a participle system to obtain a word sample set, the word sample set is a plurality of words obtained after participle, for example, one sentence is "today's weather is clear", the participle processing is carried out to obtain "today's", "weather" and "clear", and the words in the word sample set are stored according to the natural sequence of the sentences. Then, determining a Word embedded vector index corresponding to the Word sample set according to a preset training model, wherein the preset training model may be a Word2vec model, when the Word2vec model is a Word2vec model, the Word2vec model is trained by using Tensorflow according to the Word sample set, fig. 4 is a schematic process diagram of training the Word2vec model after the corpus is received, as shown in fig. 4, training the Word2vec model to the Word sample set, and after multiple training is completed, obtaining an input layer, a hidden layer and an output layer of a neural network, and obtaining a hidden layer parameter W (the size is Dx 300, and D is the total number of words in the Word sample set) of the neural network, namely, the embedded (embedded) vector of all words in the Word sample set. In this embodiment, for example, the width of the hidden layer may be set to 300 (also to the length of the word embedding vector obtained by training), the width of the vote window may be set to 8, for example, D is the total number of words in different sample sets, and may generally be several tens of thousands to several hundred thousands. After the embedded vectors of all words in the word sample set are obtained, the embedded vectors of all words are stored in a vector storage engine, and the word embedded vector index can be obtained. Fig. 5 is a schematic diagram of a word embedding vector index, and as shown in fig. 5, the word embedding vector index stores each word and an embedding vector of the word, for example, the embedding vector corresponding to the life of the word is "0.235760.813240.33255-0.27385 … …".

S202, determining a metaphor word embedding vector index according to the word embedding vector index and the metaphor word set, and determining a modifier embedding vector index according to the word embedding vector index and the modifier set.

Specifically, as a practical manner, fig. 6 is a schematic diagram illustrating a process of obtaining a body word embedding vector index and a modifier embedding vector index according to the word embedding vector index, and after obtaining the word embedding vector index according to S201, a body word set and the word embedding vector index may be subjected to intersection finding to obtain the body word embedding vector index; the set of modifiers can be intersected with the word embedding vector index to obtain the modifier embedding vector index.

S203, for each body word in the body word set, determining M triples according to the body word embedding vector index and the modifier embedding vector index, where M is a preset positive integer.

Specifically, based on word-wide embedding vector indices and modifier-embedding vector indices, M triples are determined, i.e., M triples that can constitute metaphors, which are not strongly related in general (e.g., "life" and "honey"), but which must have commonalities ("sweet"), which can be expressed as modifiers. Fig. 7 is a diagram illustrating the correlation between the body words, metaphors, and modifiers, as shown in fig. 7, the body word "living" may be found by embedding the body words into the vector index to find four body words whose distance from the cosine of the vector of the body word "living" is the smallest, such as "red," sweet, "" brave, "" distressed, "and the number between every two words is the vector cosine distance (e.g., the distance between the cosine of the vector between living and red is 0.816), and then the modifiers whose distance from the cosine of the vector of each determined metaphor is the smallest may be determined by embedding the modifiers into the vector index, such as three modifiers" dense, "" fresh blood, "and" ground prison. Multiple triples may be obtained by combining the ontology words with the metaphorical words and modifiers.

As an implementation manner, S203 may specifically be:

s2031, embedding modifiers into the vector index to determine N modifiers with the minimum distance from the cosine of the vector of the word, wherein N is a preset positive integer.

S2032, for each modifier in the N modifiers, determining Q metaphors with the minimum vector cosine distance with the modifier through the metaphor embedding vector index, and determining S metaphors with the vector cosine distance with the modifier smaller than or equal to a first distance through the metaphor embedding vector index, wherein the first distance is the vector cosine distance between the metaphors and the modifier.

S2033, obtaining M triples according to Q metaphors, S metaphors and N modifiers, wherein M is Q + N + S.

And S204, generating at least one metaphor sentence according to the correlation distance of each triple in the M triples and a preset metaphor sentence template.

Specifically, S204 may specifically include:

and S2041, respectively calculating the correlation distance of each triple in the M triples.

Wherein, taking the word α, the word β and the modifier γ as examples, the correlation distance can be calculated by the following formula:

Dist_α,β,γ＝dist_α,γ+dist_β,γ+log(|dist_α,γ-distβ,γ|+)

wherein Dist_α,β,γIs the correlation distance, dist_α,γIs the first vector cosine distance, dist_β,γIs the second vector cosine distance, | dist_α,γ-dist_β,γI is the difference between the first vector cosine distance and the second vector cosine distance, and ξ is an integer, such as 1. Dist_α,β,γThe smaller the likelihood that the triplet (α, β, γ) can be used as a metaphor.

S2042, determining P triples with the minimum correlation distance from the M triples, wherein P is a preset positive integer.

And S2043, generating at least one metaphor sentence according to the P triples and a preset metaphor sentence template.

Specifically, the preset metaphorical sentence templates may be one or more, and the following table one is an example of three metaphorical sentence templates and corresponding metaphorical sentences:

specifically, according to the P triples, the self words, the metaphors and the modifiers of each triplet replace corresponding wildcards [ in the template ], and then a complete metaphor sentence can be generated. For example, the substitution of triples ("life", "honey", "sweet") into the template "[ ontology ] just like an figurative [ adjective modifier ], generates a figurative sentence" life just as sweet as honey ".

The sentence acquisition method provided in this embodiment obtains a word sample set by performing word segmentation on a received corpus, determines a word embedding vector index corresponding to the word sample set according to a preset training model, then determines a metaphor embedding vector index and a modifier embedding vector index according to the word embedding vector index, the homonym set and the modifier set, then determines M triples for each body word in the body word set according to the homonym embedding vector index and the modifier embedding vector index, and finally generates at least one metaphor sentence according to a correlation distance of each triplet in the M triples and a preset metaphor sentence template, so that available metaphors of all commonly used words in chinese can be generated according to the received corpus, a potential metaphor relationship of any word can be intelligently discovered, and many metaphors that are not imaginable by human can be discovered, the method expands the visualization boundary of the text, has strong generalization capability, can generate metaphorical sentences similar to human handwritten metaphorical sentences, and is suitable for being used in scenes such as machine-generated text, chat robots and the like, and can improve the user experience.

Fig. 8 is a schematic diagram of a process for determining M triples corresponding to each local word in the local word set provided in the present application, where in this embodiment, taking N as 100, Q as 10, and S as 10 as examples, then correspondingly, M ═ Q × N + S × N ═ 10 × 100+10 ═ 100 ═ 2000, as shown in fig. 8, the process for determining M triples may include:

s301, judging whether the word set of the body has the uncaptured words.

If yes, go to step S302, otherwise, terminate.

S302, one body word is taken from the body word set and is set as W.

And S303, embedding the modifiers into the vector indexes to determine 100 modifiers with the minimum distance from the vector cosine of the body word.

S304, for each modifier in the 100 modifiers, determining 10 metaphors with the smallest vector cosine distance from the modifier through the metaphor-embedded vector index.

S305, determining 10 metaphors of which the distance from the vector cosine of the modifiers is smaller than or equal to a first distance through the metaphors embedded vector index, wherein the first distance is the distance between the vector cosine of the ontology word W and the vectors of the modifiers.

S306, obtaining 2000 triples according to the body words W, the 10 metaphors and the 100 modifiers.

Fig. 9 is a schematic structural diagram of a sentence acquisition apparatus provided in the present application, and as shown in fig. 9, the apparatus of this embodiment may include: an acquisition module 11, a processing module 12 and a sending module 13, wherein,

the obtaining module 11 is used for obtaining the body word;

the processing module 12 is configured to determine at least one metaphor sentence corresponding to a body word from a database according to the body word, the database includes a plurality of body words and at least one metaphor sentence corresponding to each body word, the at least one metaphor sentence corresponding to each body word is generated according to at least one triple and a preset metaphor sentence template, the at least one triple is determined according to a correlation distance between the body word set, the face-lifted body word set, the modifier set and the triple, the triple includes the body word, the body-lifted word and the modifier, the correlation distance is determined according to a first vector cosine distance between the body word and the modifier, a first vector cosine distance between the body word and the modifier and a difference between the first vector cosine distance and the first vector cosine distance, and the vector cosine distance is calculated according to embedded vectors of the two words;

the sending module 13 is configured to send at least one metaphor sentence to the terminal device.

The apparatus provided in the embodiment of the present application may implement the method embodiment shown in fig. 2, and for details of the implementation principle and technical effect, reference may be made to the method embodiment, which is not described herein again.

Fig. 10 is a schematic structural diagram of a sentence acquisition apparatus provided in the present application, and as shown in fig. 10, the apparatus of the present embodiment may further include, on the basis of the apparatus shown in fig. 9: a segmentation module 14, a first determination module 15, a second determination module 16 and a generation module 17, wherein,

the segmentation module 14 is configured to perform segmentation processing on the received corpus to obtain a word sample set, and determine, according to a preset training model, a word embedded vector index corresponding to the word sample set, where the word embedded vector index is used to store an embedded vector of each word;

the first determining module 15 is configured to determine a metaphor word embedding vector index according to the word embedding vector index and the facility word set, and determine a modifier embedding vector index according to the word embedding vector index and the modifier set;

the second determining module 16 is configured to determine, for each body word in the body word set, M triples according to the body word embedding vector index and the modifier embedding vector index, where M is a preset positive integer;

the generating module 17 is configured to generate at least one metaphor sentence according to the correlation distance of each triple in the M triples and a preset metaphor sentence template.

Further, the generating module 17 is configured to:

and generating at least one metaphor sentence according to the P triples and a preset metaphor sentence template.

Further, the second determining module 16 is configured to:

determining N modifiers with the minimum distance from the vector cosine of the body word through modifier embedding vector indexes, wherein N is a preset positive integer, determining Q metaphors with the minimum distance from the vector cosine of the modifiers through the modifier embedding vector indexes for each modifier in the N modifiers, determining S metaphors with the distance from the vector cosine of the modifiers smaller than or equal to a first distance through metaphors embedding vector indexes, wherein the first distance is the distance from the body word to the vector cosine of the modifiers, and obtaining M triples according to the Q metaphors, the S metaphors and the N modifiers, wherein M is Q + N + S.

Optionally, the correlation distance is calculated by the following formula:

Dist_α,β,γ＝dist_α,γ+dist_β,γ+log(|dist_α,γ-dist_β,γ|+)

wherein Dist_α,β,γIs the correlation distance, dist_α,γIs the first vector cosine distance, dist_β,γIs the second vector cosine distance, | dist_α,γ-dist_β,γAnd | is the difference value of the first vector cosine distance and the second vector cosine distance, and ξ is an integer.

Further, the first determining module 15 is configured to:

and finding out word embedding vectors of all modifiers in the modifier set from the word embedding vector indexes, and obtaining the modifier embedding vector indexes according to the word embedding vectors of all the modifiers.

Further, the obtaining module 11 is further configured to:

the processing module is further configured to: ontology word collections, body word collections, and modifier word collections are extracted from the ontology word corpus, the body word corpus, and the modifier word corpus, respectively.

The apparatus provided in the embodiment of the present application may implement the method embodiment shown in fig. 3, and for details of the implementation principle and technical effect, reference may be made to the method embodiment, which is not described herein again.

Fig. 11 is a schematic diagram of a hardware structure of an electronic device provided in the present application. As shown in fig. 11, the electronic device 20 of the present embodiment may include: a memory 21 and a processor 22;

a memory 21 for storing a computer program;

a processor 22 for executing the computer program stored in the memory to implement the printing method in the above-described embodiments. Reference may be made in particular to the description relating to the method embodiments described above.

Alternatively, the memory 21 may be separate or integrated with the processor 22.

When the memory 21 is a device separate from the processor 22, the electronic device 20 may further include:

a bus 23 for connecting the memory 21 and the processor 22.

Optionally, this embodiment further includes: a communication interface 24, the communication interface 24 being connectable to the processor 22 via a bus 23. The processor 22 may control the communication interface 23 to implement the above-described receiving and transmitting functions of the electronic device 20.

The electronic device provided by this embodiment can be used to execute the above method, and its implementation manner and technical effect are similar, and this embodiment is not described herein again.

In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described embodiments of the apparatus are merely illustrative, and for example, the division of modules is only one logical division, and other divisions may be realized in practice, for example, a plurality of modules may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or modules, and may be in an electrical, mechanical or other form.

Modules described as separate parts may or may not be physically separate, and parts displayed as modules may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.

In addition, functional modules in the embodiments of the present application may be integrated into one processing unit, or each module may exist alone physically, or two or more modules are integrated into one unit. The unit formed by the modules can be realized in a hardware form, and can also be realized in a form of hardware and a software functional unit.

The integrated module implemented in the form of a software functional module may be stored in a computer-readable storage medium. The software functional module is stored in a storage medium and includes several instructions to enable a computer device (which may be a personal computer, a server, or a network device) or a processor (processor) to execute some steps of the methods according to the embodiments of the present application.

It should be understood that the Processor may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of a method disclosed in connection with the present invention may be embodied directly in a hardware processor, or in a combination of the hardware and software modules within the processor.

The memory may comprise a high-speed RAM memory, and may further comprise a non-volatile storage NVM, such as at least one disk memory, and may also be a usb disk, a removable hard disk, a read-only memory, a magnetic or optical disk, etc.

The bus may be an Industry Standard Architecture (ISA) bus, a Peripheral Component Interconnect (PCI) bus, an Extended ISA (EISA) bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, the buses in the figures of the present application are not limited to only one bus or one type of bus.

The computer-readable storage medium may be implemented by any type or combination of volatile or non-volatile memory devices, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks. A storage media may be any available media that can be accessed by a general purpose or special purpose computer.

Those of ordinary skill in the art will understand that: all or a portion of the steps of implementing the above-described method embodiments may be performed by hardware associated with program instructions. The program may be stored in a computer-readable storage medium. When executed, the program performs steps comprising the method embodiments described above; and the aforementioned storage medium includes: various media that can store program codes, such as ROM, RAM, magnetic or optical disks.

Finally, it should be noted that: the above embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present application.

Claims

1. A sentence acquisition method, comprising:

obtaining a body word;

and sending the at least one metaphor sentence to the terminal equipment.

2. The method of claim 1, further comprising:

3. The method of claim 2, wherein generating at least one metaphor from the relevance distance of each of the plurality of triples and a preset metaphor template comprises:

4. The method of claim 2, wherein determining M triples from the modifier embedding vector index and the modifier embedding vector index comprises:

5. The method according to any one of claims 1-4, wherein the correlation distance is calculated by the following formula:

Dist_α，β，γ＝dist_α，γ+dist_β，γ+log(|dist_α，γ-dist_β，γ|+ξ)

wherein Dist_α，β，γFor the correlation distance, dist_α，γIs the first vector cosine distance, dist_β，γIs the second vector cosine distance, | dist_α，γ-dist_β，γL is the first vector cosine distance and the second vector cosine distanceThe difference of the distances, ξ, is an integer.

6. The method of any of claims 2-4, wherein said determining a metaphor word embedding vector index from the word embedding vector index and the set of words, comprises:

7. The method according to any one of claims 1-4, further comprising:

8. A sentence acquisition apparatus, comprising:

the acquisition module is used for acquiring the body words;

9. The apparatus of claim 8, further comprising:

10. The apparatus of claim 9, wherein the generating module is configured to:

11. The apparatus of claim 9, wherein the second determining module is configured to:

12. The apparatus according to any one of claims 8-11, wherein the correlation distance is calculated by the following formula:

wherein Dist_α，β，γFor the correlation distance, dist_α，γIs the first vector cosine distance, dist_β，γIs the second vector cosine distance, | dist_α，γ-dist_β，γAnd | is the difference value of the first vector cosine distance and the second vector cosine distance, and ξ is an integer.

13. The apparatus of any one of claims 9-11, wherein the first determining module is configured to:

14. The apparatus of any one of claims 8-11, wherein the obtaining module is further configured to:

15. A computer-readable storage medium on which a computer program is stored, the computer program, when being executed by a processor, implementing the sentence acquisition method of any one of claims 1-7.

16. An electronic device, comprising:

a processor; and

a memory for storing executable instructions of the processor;

wherein the processor is configured to perform the statement acquisition method of any of claims 1-7 via execution of the executable instructions.