CN110096572B

CN110096572B - Sample generation method, device and computer readable medium

Info

Publication number: CN110096572B
Application number: CN201910297962.0A
Authority: CN
Inventors: 宫雪
Original assignee: Chengdu Meiman Technology Co ltd
Current assignee: Meiman Technology (Chengdu) Group Co.,Ltd.
Priority date: 2019-04-12
Filing date: 2019-04-12
Publication date: 2023-09-15
Anticipated expiration: 2039-04-12
Also published as: CN110096572A

Abstract

The embodiment of the application discloses a sample generation method, a sample generation device and a computer readable medium, which relate to keyword extraction and text generation, wherein the method comprises the following steps: extracting keyword phrases of answer sentences in the historical answers, and determining keyword scoring weights of the keyword phrases according to scoring standards of the historical answers; obtaining answer sample parameters, wherein the answer sample parameters comprise a scoring range of an answer sample; determining a keyword phrase combination scheme with the score of the keyword phrase combination scheme within the scoring range of the answer sample according to the keyword scoring weight; and generating an answer template according to the combination scheme, and generating an answer sample semantically similar to the answer template. According to the embodiment of the application, the keywords can be extracted from the existing historical answer sentences, then the sample templates are obtained through the sample parameters input by the user and the keyword phrases of the historical answers, and finally a large number of answer samples which are similar to the sample template semanteme are automatically generated through the sample templates.

Description

Sample generation method, device and computer readable medium

Technical Field

The present application relates to the field of computer technologies, and in particular, to a sample generating method, apparatus, and computer readable medium.

Background

A large number of training samples and test samples are required in the artificial intelligence model training process. However, for obtaining the training sample and the test sample, at present, sample data is generally collected through the internet, and then the sample data is labeled to obtain the training sample and the test sample. Taking an answer intelligent scoring model as an example, before training the answer intelligent scoring model, training samples are generally obtained by collecting answers to a question on a network or manually answering the question and recording the answers, and after the answers are collected, the collected answers are scored (scoring models which can be manually scored or mature, namely, sample data are marked) before forming the training samples.

However, the manual access is required to obtain the training sample by adopting the traditional method, so that the labor cost is high and the efficiency is low; in addition, the method for acquiring the training samples is difficult to achieve full coverage.

Disclosure of Invention

The embodiment of the application provides a sample generation method which can automatically generate a large number of answer samples based on a small number of historical answers.

In a first aspect, an embodiment of the present application provides a sample generation method, including:

Extracting keyword phrases of answer sentences in the historical answers, and determining keyword scoring weights of the keyword phrases according to scoring standards of the historical answers;

obtaining answer sample parameters, wherein the answer sample parameters comprise a scoring range of an answer sample;

determining a keyword phrase combination scheme with the score of the keyword phrase combination scheme within the scoring range of the answer sample according to the keyword scoring weight;

and generating an answer template according to the combination scheme, and generating an answer sample semantically similar to the answer template.

As an optional implementation manner, the extracting the keyword phrase of the answer sentence in the historical answer includes:

performing word segmentation processing on answer sentences in the historical answers to obtain segmented words and segmented word vectors corresponding to the segmented words;

determining classification labels of the segmented words corresponding to the segmented word vectors according to a preset segmented word vector label classification method; the classification tag is used for indicating whether the segmentation word is a keyword or a preset part in the keyword;

and determining the word segmentation with the classification label of the word segmentation being a preset label as the keyword phrase.

As an optional implementation manner, the determining, according to a preset word segmentation vector label classification method, a classification label of a word segmentation corresponding to the word segmentation vector includes:

Inputting the word segmentation vector into a trained two-way Long and short term memory (BLSTM) network model for classification, and outputting a label probability vector of the word segmentation corresponding to the word segmentation vector;

and decoding the segmented words by using a conditional random field (conditional random field, CRF) according to the tag probability vector to obtain the classified tags corresponding to the segmented words.

As an optional implementation manner, the determining the keyword phrase scoring weight of the keyword phrase according to the scoring criteria of the historical answers includes:

obtaining scoring standards of the historical answers;

and determining statement scoring weights of answer statements of the historical answers according to the scoring criteria, and determining the statement scoring weights as keyword scoring weights of the keyword phrases.

As an optional implementation manner, the generating an answer template according to the combination scheme includes:

obtaining answer sentences corresponding to the keyword phrases in the combination scheme;

and generating the answer template based on the answer sentence.

As an optional implementation manner, the generating an answer sample similar to the answer template semanteme includes:

Word segmentation processing is carried out on the answer templates to obtain word segmentation vectors;

generating a similar word set similar to each word segmentation semantic in the answer template according to the word segmentation vector;

and replacing the corresponding words in the answer template by using the words in the similar word set to obtain the answer sample.

As an alternative embodiment, the method further comprises:

determining sentence vectors of answer sentences corresponding to the keyword phrases;

based on the corresponding sentence vectors of the keyword phrases, calculating cosine distances among the keyword phrases in the combination scheme;

and executing the step of generating an answer template according to the combination scheme under the condition that the combination scheme is determined to be an effective combination scheme based on cosine distances among keyword phrases in the combination scheme.

In a second aspect, an embodiment of the present application provides a sample generating device, including:

the extraction unit is used for extracting keyword phrases of answer sentences;

the first determining unit is used for determining the keyword scoring weight of the keyword phrase according to the scoring standard of the historical answer;

the system comprises an acquisition unit, a judgment unit and a storage unit, wherein the acquisition unit is used for acquiring answer sample parameters, and the answer sample parameters comprise a scoring range of an answer sample;

The second determining unit is configured to determine, according to the keyword scoring weight, a keyword phrase combination scheme in which a score of the keyword phrase combination scheme is within a scoring range of the answer sample;

the template generating unit is used for generating an answer template according to the combination scheme;

and the sample generation unit is used for generating an answer sample similar to the answer template semanteme.

As an alternative embodiment, the extraction unit comprises:

the word segmentation unit is used for carrying out word segmentation processing on answer sentences in the historical answers to obtain segmented words and segmented word vectors corresponding to the segmented words;

the classifying unit is used for determining classifying labels of the segmented words corresponding to the segmented word vectors according to a preset classifying method of the segmented word vector labels; the classification tag is used for indicating whether the segmentation word is a keyword or a preset part in the keyword;

and the first determining subunit is used for determining the word segmentation with the classification label of the word segmentation being a preset label as the keyword phrase.

As an optional implementation manner, the classifying unit is configured to input the word segmentation vector into a trained bidirectional Long and short term memory (Long-Short Term Memory, BLSTM) network model for classification, and output a tag probability vector of a word corresponding to the word segmentation vector; and decoding the segmented words by using a conditional random field (conditional random field, CRF) according to the tag probability vector to obtain the classified tags corresponding to the segmented words.

As an optional implementation manner, the first determining unit is configured to obtain a scoring criterion of the historical answer; and determining statement scoring weights of answer statements of the historical answers according to the scoring criteria, and determining the statement scoring weights as keyword scoring weights of the keyword phrases.

As an optional implementation manner, the template generating unit is configured to obtain an answer sentence corresponding to a keyword phrase in the combination scheme; and generating the answer template based on the answer sentence.

As an alternative embodiment, the sample generation unit includes:

the word segmentation subunit is used for carrying out word segmentation processing on the answer templates to obtain word segmentation vectors;

the generation subunit is used for generating a similar word set similar to each word segmentation semantic in the answer template according to the word segmentation vector;

and the replacing unit is used for replacing the corresponding word in the answer template by using the word in the similar word set to obtain the answer sample.

As an alternative embodiment, the apparatus further comprises:

a third determining unit, configured to determine a sentence vector of an answer sentence corresponding to the keyword phrase;

The calculating unit is used for calculating cosine distances among the keyword phrases in the combination scheme based on the corresponding sentence vectors of the keyword phrases;

the template generating unit is configured to execute the step of generating an answer template according to the combination scheme when the combination scheme is determined to be an effective combination scheme based on cosine distances between keyword phrases in the combination scheme.

In a third aspect, an embodiment of the present application provides another apparatus, including a processor, a memory, and a communication module, where the memory is configured to store program code, and the processor is configured to invoke the program code to perform the method of the first aspect and any of its alternatives.

In a fourth aspect, embodiments of the present application provide a computer readable storage medium storing a computer program comprising program instructions which, when executed by a processor, cause the processor to perform the method of the first aspect described above.

The embodiment of the application extracts the keyword phrase of the answer sentence and determines the keyword scoring weight of the keyword phrase. Then, answer sample parameters are obtained, wherein the answer sample parameters comprise a scoring range of an answer sample. And then, determining a combination scheme of the keyword phrases within the scoring range according to the keyword scoring weight. And finally, generating an answer template according to the combination scheme, and generating an answer sample similar to the answer template semanteme. The method comprises the steps of extracting keywords from the existing historical answer sentences, obtaining a sample template through sample parameters input by a user and keyword phrases of the historical answers, and finally automatically generating a large number of answer samples similar to sample template semantics through the sample template.

Drawings

In order to more clearly illustrate the technical solution of the embodiments of the present application, the drawings that are needed in the description of the embodiments will be briefly described.

FIG. 1 is a schematic flow chart of a sample generation method provided by an embodiment of the present application;

FIG. 2 is a schematic flow chart of another sample generation method provided by an embodiment of the present application;

FIG. 3 is a schematic flow chart of yet another sample generation method provided by an embodiment of the present application;

FIG. 4 is a schematic block diagram of an apparatus provided by an embodiment of the present application;

fig. 5 is a schematic structural diagram of a sample generating device according to an embodiment of the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

It should be understood that the terms "comprises" and "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It is also to be understood that the terminology used in the description of the application herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in this specification and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.

It should be further understood that the term "and/or" as used in the present specification and the appended claims refers to any and all possible combinations of one or more of the associated listed items, and includes such combinations.

As used in this specification and the appended claims, the term "if" may be interpreted as "when..once" or "in response to a determination" or "in response to detection" depending on the context. Similarly, the phrase "if a determination" or "if a [ described condition or event ] is detected" may be interpreted in the context of meaning "upon determination" or "in response to determination" or "upon detection of a [ described condition or event ]" or "in response to detection of a [ described condition or event ]".

Referring to fig. 1, fig. 1 is a schematic flow chart of a sample generation method according to an embodiment of the present application, where the method may include:

101: and extracting keyword phrases of answer sentences in the historical answers, and determining the keyword scoring weights of the keyword phrases according to the scoring standards of the historical answers.

Wherein, the keyword phrase may include at least one word, i.e. one or more words capable of characterizing the semantics of the answer sentence.

The embodiment of the application is mainly applied to automatic generation of answer samples of questions. The scoring of answers to questions generally has a scoring criteria or rule that generally refers to whether or not there is any key information associated with the questions in the answers, and may be multiple, the more key information in the answers, the higher the score. And these key information can be finally embodied by the key words in the answer. Thus, the contribution of a historical answer sentence to the score can be characterized by extracting keywords in the sentence.

In the embodiment of the application, after the obtained historical answers of the questions are obtained, the obtained historical answers of the questions are preprocessed, and a historical answer corpus is formed according to the preprocessed historical answers.

Wherein, preprocessing the historical answers comprises: and obtaining the scoring standard of the questions and answers, and calculating the scoring weight of each sentence in each historical answer according to the scoring standard.

After the historical answer corpus is obtained, extracting keywords from answer sentences in the historical answer corpus to obtain keyword phrases of the answer sentences; and determining the scoring weight of the answer sentence corresponding to the keyword phrase as the scoring weight of the keyword phrase.

As an optional implementation manner, the keyword extraction for the answer sentence in the historical answer corpus may specifically include: performing word segmentation processing on the answer sentences to obtain segmented words and segmented word vectors corresponding to the segmented words; determining classification labels of the segmented words according to the segmented word vectors; and determining the word segmentation corresponding to the preset label as the keyword phrase.

102: and obtaining answer sample parameters, wherein the answer sample parameters comprise a scoring range of the answer sample.

In the embodiment of the application, before generating the answer sample, the requirement information of the user needs to be acquired so as to generate the answer sample required by the user according to the requirement of the user.

Specifically, the requirement information input by the user is acquired, then answer sample parameters are generated according to the requirement information, and finally an answer sample meeting the requirement of the user is generated according to the answer sample parameters.

The answer sample parameters may include information such as score ranges of the answer samples, the number of answer samples, and the like.

103: and determining the keyword phrase combination scheme with the score of the keyword phrase combination scheme within the scoring range of the answer sample according to the keyword scoring weight.

In the embodiment of the application, after the scoring range included in the sample parameter is obtained, that is, the scoring of the answer sample generated finally needs to be within the scoring range, the keyword phrase can be extracted from the stored keyword phrases according to the scoring weight of the keyword phrase, so that the scoring corresponding to the combination of the extracted keyword phrases is within the scoring range.

Specifically, after the scoring range is obtained, the scoring weight of the keyword phrase may be converted into a score according to the full score of the question, and then the score of the keyword phrase may be combined according to the score of the keyword phrase, so as to obtain a plurality of keyword phrase combination schemes in the scoring range.

For example, assuming that the full score of a question is 100 points, the score range is 60-70 points, the scoring weight range is (0, 1) and five keyword phrases are provided, and the scoring weights of the five keyword phrases are 0.1, 0.3, 0.65, 0.5 and 0.75 respectively. First, the score of each keyword phrase is calculated according to the scoring weights of the five keyword phrases, and the scores of the five keyword phrases are 10, 15, 65, 50 and 75. And satisfying a combination of scores ranging from 60-70 points may include: 10+50, 15+50, 65 combinations. And finally, the combination scheme of three keyword phrases can be obtained from the five keyword phrases, and the score is within the range of 60-70 minutes.

As an alternative implementation manner, since the historical answer corpus includes at least one sentence of answer, a plurality of keyword phrases are obtained after extracting keywords from the answer sentences in the historical answer corpus. However, the scoring weights of the plurality of keyword phrases are generally different. Therefore, the keyword phrases can be classified into the plurality of keyword phrase partitions according to different scoring weight ranges. So that when the combination scheme of the keyword phrases in the determined scoring range is determined by the keyword phrases, the keyword phrases can be directly extracted from the classification of different scoring weight ranges to obtain the combination scheme.

For example, if the value range of the scoring weight is (0, 1), the scoring weights may be classified according to the weight ranges (1,0.8), (0.8,0.6), (0.6,0.4), (0.4,0.2), (0.2, 0), and tags are added to the classified keywords, and the tags corresponding to the weight ranges are a, b, c, d, e tags.

104: generating an answer template according to the combination scheme, and generating an answer sample with semantically similar answer to the answer template.

The answer template is an answer composed of answer sentences corresponding to the keyword phrases in the combination scheme.

For example, the above-mentioned combination scheme is composed of two keyword phrases, namely a keyword phrase 1 and a keyword phrase 2, wherein the keyword phrase 1 corresponds to the answer sentence 1 in the above-mentioned historical answer corpus, and the keyword phrase 2 corresponds to the answer sentence 2 in the above-mentioned historical answer corpus; the answer template is: answer sentence 1 plus answer sentence 2.

In the embodiment of the application, after the combination scheme of the keyword phrase within the scoring range is determined, an answer template is generated according to the combination scheme of the keyword phrase, and then an answer sample similar to the semantics of the answer template is generated according to the answer template.

Specifically, after determining the combination scheme of the keyword phrases within the scoring range, obtaining answer sentences corresponding to the keyword phrases from the historical answer corpus according to the keyword phrases in the combination scheme. And then combining the obtained answer sentences according to a combination mode corresponding to the combination scheme to obtain the answer template. And then, performing copying operation on the answer templates according to the synonym word stock to generate answer samples similar to the answer templates in terms of semantics.

As an alternative embodiment, the answer templates are produced according to the combination scheme. Index information of the keywords and the historical answer sentences can be established, and the index information can comprise: the mapping relation between the keyword phrase and the historical answer, the mapping relation between the keyword and the sentences in the historical answer and the scoring weight of the keyword.

In the embodiment of the application, because the sentences of the keyword phrases and the historical answers are in one-to-one correspondence, the scoring weights of the sentences in the historical answers can also be the scoring weights of the keywords. Therefore, the keyword index information can be generated according to the corresponding relation between the keywords and each sentence in the historical answers.

For example, i is a mapping relationship between the keyword a and the historical answer, j is a mapping relationship between the keyword a and the sentence in the historical answer, and k is a scoring weight of the keyword a, then the index of the keyword a may be (a, i, j, k). In the step of determining the combination scheme of the keyword phrases in the scoring range according to the keyword scoring weight, the combination scheme of the keyword phrases in the scoring range can be determined only according to k in the index information of the keyword phrases. Then in the step of generating the answer template according to the combination scheme, the answer sentence corresponding to the keyword phrase can be obtained according to i and j in the index information of the keyword phrase, so that the answer template is further generated.

As an optional implementation manner, the generating the answer sample similar to the answer template by performing the copying operation on the answer template according to the synonym word stock may specifically include: word segmentation processing is carried out on the answer templates to obtain word segmentation vectors; generating a similar word set similar to each word segmentation semantic in the answer template according to the word segmentation vector; and replacing the corresponding words in the answer templates by using the words in the similar word set to obtain the answer sample.

It can be seen that the embodiment of the application extracts the keyword phrase of the answer sentence and determines the keyword scoring weight of the keyword phrase. Then, obtaining answer sample parameters, wherein the answer sample parameters comprise a scoring range of an answer sample. And then, determining a combination scheme of the keyword phrases in the scoring range according to the keyword scoring weight. And finally, generating an answer template according to the combination scheme, and generating an answer sample similar to the answer template semanteme. The method comprises the steps of extracting keywords from the existing historical answer sentences, obtaining a sample template through sample parameters input by a user and keyword phrases of the historical answers, and finally automatically generating a large number of answer samples similar to sample template semantics through the sample template.

Referring to fig. 2, fig. 2 is a schematic flowchart of another sample generation method according to an embodiment of the present application, where the method may include:

201: preprocessing the historical answers to obtain scoring weights of the answer sentences to form the historical answer corpus.

202: and performing word segmentation processing on answer sentences in the historical answer corpus to obtain segmented words and segmented word vectors corresponding to the segmented words.

The historical answer may be a sentence containing one or more words or a paragraph containing multiple sentences. Wherein each sentence of the historical answers can be considered as a word sequence consisting of consecutive words.

Word segmentation is the process of recombining a continuous word sequence into a word sequence according to a certain specification. The aim of word segmentation of historical answers is to: and combining the historical answers into word sequences according to a certain specification, and extracting keyword phrases of answer sentences from the word sequences. Wherein, the keyword phrase may include at least one word, i.e. one or more words capable of characterizing the semantics of the answer sentence.

In one implementation, a method based on string matching may be used to segment the historical answers, which is also referred to as a mechanical word segmentation method, where the word sequence of each sentence of the historical answers is matched with entries in a dictionary according to a certain policy, and if a string of a certain character or a certain number of characters of the historical answers is found in the dictionary, the matching is successful, that is, a word is identified.

For example, one sentence in the historical answers is "i like basketball," and after the sentence is segmented by a mechanical word segmentation method, the segmented words corresponding to the sentence can be obtained as follows: i, like, play, basketball. It may be understood that in the above implementation, the word segmentation is performed on the historical answer to obtain all the segmented words, which means that each word in the historical answer is included in a certain segmented word. Of course, the method of word segmentation of the history answer is not limited thereto.

As an alternative implementation manner, determining the word vector of each word segment may specifically include: word vectors of each word are obtained through word2vec training.

word2vec is an efficient tool for characterizing words as real word vectors. Specifically, word2vec maps the word into a K-dimensional vector through CBoW network or Skip-gram network, where the K-dimensional vector is generally a high-dimensional vector, and for example, K may take the value of 400 or other integers with larger values. The CBoW model or the Skip-gram model both assume that one word is associated with a plurality of surrounding words, and the sequence relation of the plurality of surrounding words is not considered, so that the word vector obtained through word2vec training contains the syntax and semantic features of the words.

203: determining classification labels of the segmented words corresponding to the segmented word vectors according to a preset segmented word vector label classification method.

The classification tag is used for indicating whether the segmentation word is a keyword or a preset part in the keyword.

In the embodiment of the present application, the determining the classification label of the word according to the word segmentation vector of the answer sentence may be specifically divided into two small steps:

2031: inputting the word segmentation vector into a trained two-way long-short-term memory network model for classification, and outputting a label probability vector of the word segmentation corresponding to the word segmentation vector.

The character of Chinese can be known that the keyword can be a single word or a word formed by combining adjacent words, for example, the keyword "Chinese basketball" is a word formed by two adjacent words of "Chinese" and "basketball".

In the embodiment of the application, the keyword extraction problem can be regarded as a sequence labeling problem. Specifically, after obtaining a plurality of word segments of the historical answer, a label may be labeled on each word segment, and a category of each word segment may be determined. "category" herein refers to whether a word is a keyword or is part of a keyword.

For historical answers, a plurality of classification labels, such as 5 classification labels, can be set, wherein the W label is a keyword, the B label is a head part of the keyword, the I label is a middle part of the keyword, the E label is a tail part of the keyword, and the O label is a non-keyword. In practical applications, the number of classification tags and specific types may be set as needed, taking the above example only.

In the embodiment of the application, labeling each word segmentation can be performed in two steps: firstly, determining the probability that each word corresponds to all classification labels; then, a label corresponding to each word is determined.

The BLSTM network is obtained after training the initial BLSTM network according to the training text and the labeling result of the keywords of the training text in the training text. The tag probability vector for each word segment is a vector composed of probabilities that each word segment corresponds to each of the plurality of class tags. The vector of probability contributions for 5 class labels, e.g., the word "Chinese" corresponds to W, B, I, E, O described above, is [0.6,0.7,0.2,0.1,0.2]. After training the BLSTM network, the tag probability vector of each word segment of the historical answer can be determined by inputting the word vector of each word segment of the historical answer into the trained BLSTM network.

2032: and decoding the segmented words by using a conditional random field (Conditional Random Field, CRF) according to the tag probability vector to obtain the classified tags corresponding to the segmented words.

CRF is a typical discriminant prediction model, and the output sequence Y with the highest conditional probability is obtained through the conditional random field P (y|x) and the input sequence (i.e., the observation sequence) X, i.e., the observation sequence X is labeled. The prediction algorithm of the CRF is a Viterbi algorithm, which is a dynamic optimization algorithm, and a state transition path with the highest probability can be calculated through a known observation sequence and a known state transition probability, and is taken as an optimal path, and the state of each element in the observation sequence X corresponding to the optimal path forms an output sequence Y.

In the embodiment of the application, according to the tag probability vector of each word in each sentence in the historical answers, after the sentence is subjected to CRF decoding, the tags from the first word to the last word in the sentence can be determined, for example, after the sentence of 'i liked playing table tennis' is subjected to CRF decoding, the tags corresponding to the words "i," "like," "play," "table tennis" in the sentence are determined to be O, O, B, E respectively.

204: and determining the word segmentation of which the classification label is a preset label as the keyword phrase, and determining the scoring weight of the keyword phrase according to the sentence scoring weight of the answer sentence corresponding to the keyword phrase.

Specifically, determining the word segmentation of each sentence, which is classified by the classification label as a keyword, as the keyword of the sentence; determining a word formed by combining two adjacent word segmentation groups with a keyword head and a keyword tail in sequence in each sentence as a keyword of the sentence; and determining the word formed by combining adjacent three word segments of which the classification labels are a keyword head part, a keyword middle part and a keyword tail part in sequence in each sentence as the keyword of the sentence.

It will be appreciated that when the category label of each word in a sentence is not a preset category label, then the sentence does not correspond to the relevant keyword.

After the keyword phrase is obtained, the sentence scoring weight of the answer sentence corresponding to the keyword phrase is obtained, and then the sentence scoring weight of the answer sentence corresponding to the keyword phrase is determined to be the scoring weight of the keyword phrase.

205: and obtaining answer sample parameters, wherein the answer sample parameters comprise a scoring range of the answer sample.

206: and determining the keyword phrase combination scheme with the score of the keyword phrase combination scheme within the scoring range of the answer sample according to the keyword scoring weight.

207: and generating an answer template according to the combination scheme.

Specifically, after determining the combination scheme of the keyword phrases within the scoring range, obtaining answer sentences corresponding to the keyword phrases from the historical answer corpus according to the keyword phrases in the combination scheme. And then combining the obtained answer sentences according to a combination mode corresponding to the combination scheme to obtain the answer template.

208: and performing word segmentation processing on the answer templates to obtain word segmentation vectors.

In the embodiment of the present application, unlike the word segmentation process when extracting keywords, a word vector model for processing words with similar semantics and a language model for judging the likelihood of the generated semantic similar sentence sample need to be trained and formed in advance.

In this embodiment, the Word vector model may be formed by using some tools for characterizing words as real-valued vectors, for example, word2vec, which may use the idea of deep learning to simplify the processing of text content into vector operations in K-dimensional vector space through training, and similarity in vector space may be used to represent similarity in text semantics. The word vector is to model a language model by using a neural network, obtain a representation of a word in a vector space, and process the word by using the word vector to obtain a similar word of the word according to the similarity between the words.

Specifically, in this embodiment, the training samples for training to form the word vector model may be a large amount of text data, and these text data may be derived from text data on different crawled forums, and may need to be subjected to word segmentation before being input.

After the word vector model is passed, the output should be a low-dimensional real vector for representing words, and each word in the training corpus should correspond to a low-dimensional real vector. After the word vector model is passed, the output should be a low-dimensional real vector for representing words, and each word in the training corpus should correspond to a low-dimensional real vector.

The real vectors described above may be generally expressed in [0.792, -0.177, -0.107,0.109, -0.542, ] or similar forms. The distance of the vectors from word to word can be measured by the most conventional euclidean distance or by the cosine distance.

Accordingly, in the present embodiment, the language model described above may be a model for calculating the sentence forming probability of one sentence, for example, expressed as P (W1, W2,..wk). With the language model it can be determined which word sequence is more likely to be a sentence, or given several words, the next most likely word can be predicted. Briefly, a language model is used to determine whether a word sequence consisting of several words conforms to the habit of speaking by a person, i.e., the likelihood that the word sequence is a sentence. In a preferred embodiment of the present invention, the language model may be implemented using an n-gram model.

Specifically, in the process of training the language model, each text sentence subjected to word segmentation is input into the model, and the output probability of word collocation combination in each text sentence can be output.

209: and generating a similar word set similar to each word segmentation semantic in the answer template according to the word segmentation vector.

In the embodiment of the application, the answer template is subjected to word segmentation processing to obtain the word segmentation vectors, and then a word vector model is adopted according to each word segmentation vector to obtain a set of similar words of expected winning bets.

Specifically, the similar words are words which are consistent with the words and have similar semantics, the answer templates are subjected to word segmentation processing, and after the word segmentation vectors are obtained, the set of similar words can be obtained through processing according to a word vector model and output.

210: and replacing the corresponding words in the answer templates by using the words in the similar word set to obtain the answer sample.

In the embodiment of the application, after a similar word set similar to each word segmentation semantic in the answer template is generated, the words in the similar word set are used for replacing the corresponding words in the answer template, so that the answer sample is obtained. For example, there may be a words corresponding to one sentence sample, that is, one sentence sample is formed by sequentially arranging a words, and there is one similar word set for each word, and there are b similar words with the most similar meaning to the word in each set, then there may be ba similar semantic sentence samples corresponding to one sentence sample, that is, there may be one similar semantic sentence sample set for one sentence sample, and there may be multiple similar semantic sentence sample sets for multiple sentence samples, so that automatic generation of a large number of similar semantic sentence samples can be achieved.

Referring to fig. 3, fig. 3 is a schematic flowchart of yet another sample generation method according to an embodiment of the present application, where the method may include:

301: preprocessing the historical answers to obtain scoring weights of the answer sentences to form the historical answer corpus.

302: and performing word segmentation processing on answer sentences in the historical answer corpus to obtain segmented words and segmented word vectors corresponding to the segmented words.

303: and obtaining sentence vectors of the answer sentences according to the word segmentation vectors of the answer sentences.

In the embodiment of the application, when the word segmentation vector of the answer sentence is obtained, the word segmentation vector of the answer sentence is input into a sentence vector generation model to obtain the sentence vector of the answer sentence.

The sentence vector generation model may be a trained recurrent neural network.

304: determining classification labels of the segmented words corresponding to the segmented word vectors according to a preset segmented word vector label classification method.

305: and determining the word segmentation of which the classification label is a preset label as the keyword phrase, and determining the sentence scoring weight of the answer sentence corresponding to the keyword phrase as the scoring weight of the keyword phrase.

306: and obtaining answer sample parameters, wherein the answer sample parameters comprise a scoring range of the answer sample.

307: and determining the keyword phrase combination scheme with the score of the keyword phrase combination scheme within the scoring range of the answer sample according to the keyword scoring weight.

308: and acquiring sentence vectors of answer sentences corresponding to the keywords in the combination scheme, and determining the combination scheme as an effective combination scheme based on cosine distances among the keywords in the combination scheme.

In the embodiment of the application, after the combination scheme is determined, sentence vectors of answer sentences corresponding to the keyword phrases in the combination scheme are obtained, then cosine distances between the sentence vectors corresponding to the keyword phrases in the combination scheme are calculated, and when the cosine distances between the sentence vectors corresponding to the keyword phrases in the combination scheme are smaller than a threshold value, namely, when the semantics of the answer sentences corresponding to the two keyword phrases are not similar, the combination scheme is determined to be the effective combination scheme.

309: and generating an answer template according to the effective combination scheme.

310: and performing word segmentation processing on the answer templates to obtain word segmentation vectors.

311: and generating a similar word set similar to each word segmentation semantic in the answer template according to the word segmentation vector.

312: and replacing the corresponding words in the answer templates by using the words in the similar word set to obtain the answer sample.

The embodiment of the application also provides a device for executing the unit of the method of any one of the previous claims. In particular, referring to fig. 4, fig. 4 is a schematic block diagram of an apparatus according to an embodiment of the present application. The device of the embodiment comprises: the extraction unit 410, the first determination unit 420, the acquisition unit 430, the second determination unit 440, the template generation unit 450, the sample generation unit 460.

The extracting unit 410 is configured to extract a keyword phrase of an answer sentence;

the first determining unit 420 is configured to determine a keyword scoring weight of the keyword phrase according to a scoring criterion of the historical answer;

the obtaining unit 430 is configured to obtain answer sample parameters, where the answer sample parameters include a scoring range of an answer sample;

the second determining unit 440 is configured to determine a keyword phrase combination scheme within a scoring range of the answer sample according to the keyword scoring weight;

a template generating unit 450 for generating an answer template according to the combination scheme;

the sample generating unit 460 is configured to generate an answer sample that is semantically similar to the answer template.

As an alternative embodiment, the extracting unit 410 includes:

As an alternative embodiment, the first determining unit 420 is configured to obtain a scoring criterion of the historical answer; and determining the sentence scoring weight of the answer sentences of the historical answers according to the scoring standard, and determining the keyword scoring weight of the keyword phrase according to the sentence scoring weight.

As an optional implementation manner, the template generating unit 450 is configured to obtain an answer sentence corresponding to a keyword phrase in the combination scheme; and generating the answer template based on the answer sentence.

As an alternative embodiment, the sample generation unit 460 includes:

As an alternative embodiment, the apparatus further comprises:

the template generating unit 450 is configured to execute the step of generating an answer template according to the combination scheme when the combination scheme is determined to be an effective combination scheme based on cosine distances between keyword phrases in the combination scheme.

Referring to fig. 5, fig. 5 is a schematic structural diagram of a sample generating device 500 according to an embodiment of the present application, where, as shown in fig. 5, the sample generating device 500 includes a processor, a memory, a communication interface, and one or more programs, where the one or more programs are different from the one or more application programs, and the one or more programs are stored in the memory and configured to be executed by the processor. The program includes instructions for performing the steps of: extracting keyword phrases of answer sentences in the historical answers, and determining keyword scoring weights of the keyword phrases according to scoring standards of the historical answers; obtaining answer sample parameters, wherein the answer sample parameters comprise a scoring range of an answer sample; determining a keyword phrase combination scheme with the score of the keyword phrase combination scheme within the scoring range of the answer sample according to the keyword scoring weight; and generating an answer template according to the combination scheme, and generating an answer sample semantically similar to the answer template.

It should be appreciated that in embodiments of the present application, the processor may be a central processing unit (Central Processing Unit, CPU), which may also be other general purpose processors, digital signal processors (Digital Signal Processor, DSPs), application specific integrated circuits (Application Specific Integrated Circuit, ASICs), off-the-shelf programmable gate arrays (Field-Programmable Gate Array, FPGAs) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

In another embodiment of the present application, there is provided a computer-readable storage medium storing a computer program which, when executed by a processor, implements: extracting keyword phrases of answer sentences in the historical answers, and determining keyword scoring weights of the keyword phrases according to scoring standards of the historical answers; obtaining answer sample parameters, wherein the answer sample parameters comprise a scoring range of an answer sample; determining a keyword phrase combination scheme with the score of the keyword phrase combination scheme within the scoring range of the answer sample according to the keyword scoring weight; and generating an answer template according to the combination scheme, and generating an answer sample semantically similar to the answer template.

The computer readable storage medium may be an internal storage unit of the terminal according to any of the foregoing embodiments, for example, a hard disk or a memory of the terminal. The computer readable storage medium may be an external storage device of the terminal, for example, a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card) or the like. Further, the computer-readable storage medium may further include both an internal storage unit and an external storage device of the terminal. The computer-readable storage medium is used for storing the computer program and other programs and data required by the terminal. The above-described computer-readable storage medium may also be used to temporarily store data that has been output or is to be output.

In the several embodiments provided in the present application, it should be understood that the disclosed system, server, and method may be implemented in other manners. For example, the above-described sample generating device embodiments are merely illustrative, e.g., the division of the above-described units is merely a logical functional division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. In addition, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, sample generating devices or units, or may be an electrical, mechanical, or other form of connection.

The units described above as separate components may or may not be physically separate, and components shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the embodiment of the present application.

In addition, each functional unit in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated units described above, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application is essentially or a part contributing to the prior art, or all or part of the technical solution may be embodied in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the above-mentioned method of the various embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

While the application has been described with reference to certain preferred embodiments, it will be understood by those skilled in the art that various changes and substitutions of equivalents may be made and equivalents will be apparent to those skilled in the art without departing from the scope of the application. Therefore, the protection scope of the application is subject to the protection scope of the claims.

Claims

1. A method of generating a sample, comprising:

generating an answer template according to the combination scheme, generating an answer sample semantically similar to the answer template, and comprising the following steps: obtaining answer sentences corresponding to keyword phrases in the combination scheme, generating an answer template based on the answer sentences, performing word segmentation processing on the answer template to obtain word segmentation vectors, generating a similar word set similar to each word segmentation semantic in the answer template according to the word segmentation vectors, and replacing corresponding words in the answer template by words in the similar word set to obtain the answer sample.

2. The method of claim 1, wherein extracting the keyword phrases of the answer sentences in the historical answers comprises:

3. The method according to claim 2, wherein the determining the classification label of the word segment corresponding to the word segment vector according to the preset classification method of the word segment vector label includes:

inputting the word segmentation vector into a trained two-way long-short-term memory network model for classification, and outputting a label probability vector of the word segmentation corresponding to the word segmentation vector;

and carrying out conditional random field decoding on the segmented words according to the tag probability vector to obtain classified tags of the segmented words.

4. The method of claim 1, wherein determining keyword phrase scoring weights for the keyword phrases based on scoring criteria for the historical answers comprises:

Obtaining scoring standards of the historical answers;

5. The method according to any one of claims 1-4, further comprising:

and triggering the step of generating an answer template according to the combination scheme under the condition that the combination scheme is determined to be an effective combination scheme based on cosine distance between keyword phrases in the combination scheme.

6. A sample generating device comprising means for performing the method of any of claims 1-5.

7. A sample generating apparatus comprising a processor, an input device, an output device and a memory, the processor, the input device, the output device and the memory being interconnected, wherein the memory is adapted to store a computer program comprising program instructions, the processor being configured to invoke the program instructions to perform the method of any of claims 1-5.

8. A computer readable storage medium, characterized in that the computer readable storage medium stores a computer program comprising program instructions which, when executed by a processor, cause the processor to perform the method of any of claims 1-5.