CN110969005B

CN110969005B - Method and device for determining similarity between entity corpora

Info

Publication number: CN110969005B
Application number: CN201811151935.4A
Authority: CN
Inventors: 王芳; 林文辉; 王志刚; 孙科武; 杨硕; 赖新明; 王亚平
Original assignee: Aisino Corp
Current assignee: Aisino Corp
Priority date: 2018-09-29
Filing date: 2018-09-29
Publication date: 2023-10-31
Anticipated expiration: 2038-09-29
Also published as: CN110969005A

Abstract

The invention discloses a method and a device for determining similarity between entity corpuses, the training device randomly extracts a training set from a preset entity corpus, pairs the entity corpuses in the training set to obtain a training entity corpus relation pair, obtains a matrix vector corresponding to the training entity corpus relation pair, processes the matrix vector by using a convolutional neural network to obtain training classification probability of the training entity corpus relation pair, thereby completing training of the convolutional neural network, providing an accurate search function for answers of questions for users by using the convolutional neural network and intelligent customer service of the preset entity corpus, further solving the technical problem that in the prior art, for the intelligent customer service system, the intelligent customer service system cannot find correct answers from own knowledge base due to inaccurate information input by users, and further reducing user experience.

Description

Method and device for determining similarity between entity corpora

Technical Field

The invention relates to the technical field of deep learning, in particular to a method and a device for determining similarity between entity corpuses.

Background

With the rapid development of artificial intelligence technology, it has been common to apply the relation between extracted entity corpora to text search, for example, in terms of tax, the relation between tax entity corpora refers to the similarity between tax entity corpora. The relation extraction method between entity corpus is divided into three types, one type is a supervised learning method, namely the relation extraction task is regarded as a classification problem. And designing effective features according to the training data so as to learn various classification models, and then predicting the relation by using a trained classifier. The disadvantage of this method is that a large amount of manual labeling of training entity corpora is required, and the corpus labeling work is usually very time-consuming and labor-consuming. The second category is semi-supervised learning methods: the relation extraction is mainly carried out by adopting BootStrapping, and for the relation to be extracted, the method firstly manually sets a plurality of seed examples, and then iteratively extracts a relation template and more examples corresponding to the relation from data. The third category is an unsupervised learning method: it is assumed that pairs of entities having the same semantic relationship have similar context information. Therefore, the semantic relationship of each entity-corpus relationship pair can be represented by the corresponding context information of the entity pair, and the semantic relationships of all entity pairs are clustered. The existing supervised learning relation extraction methods have better effects, but the existing supervised learning relation extraction methods are seriously dependent on natural language processing labels such as part-of-speech labels, syntax analysis and the like to provide classification features, and the natural language processing labels usually have a large number of errors, which are continuously propagated and amplified in a relation extraction system and finally influence the relation extraction effect.

For example, existing intelligent customer service systems, tax services have stepped into the intelligent era of "internet+tax". The intelligent customer service provides convenient, intelligent and ubiquitous customer service for tax payers, for example, intelligent customer service systems such as WeChat public numbers in certain markets, and usually at the entrance of consultation, the tax payers can input related questions in a voice or text mode, and the intelligent customer service finds matched answers from a tax knowledge base through artificial intelligence technologies such as voice recognition, natural language understanding and the like and feeds back the answers to the tax payers in the forms of text, graphics context, web page links and the like. However, because the taxpayers are distributed around the country, the phenomena of dialects mixed with mandarin in the tax consultation process, different spoken expressions of tax entities or non-strict expressions of spoken languages of tax entities in all places exist, and the intelligent customer service system cannot accurately match the nonstandard spoken expression content with standard answers, so that answers cannot be quickly searched, and the satisfaction degree of the intelligent question-answering system is low. For example, the "tax disc" in the spoken language of a tax payer in a certain place is consistent with the designation of "Jin Shuipan" in the standard knowledge base, and belongs to agreeing but different words, the intelligent customer service system cannot take the content expressed by the spoken language and the answer of the standard knowledge base as complete matching items, so that the accurate search of the answer cannot be completed, and the result of low satisfaction of the intelligent customer service system is caused.

Therefore, at least the following technical problems exist in the prior art:

for the intelligent customer service system, because the information input by the user is inaccurate, the intelligent customer service system cannot find a correct answer from the knowledge base of the intelligent customer service system, so that the user experience is reduced.

Disclosure of Invention

The embodiment of the invention provides a method and a device for determining the similarity between entity corpora, which are used for solving the technical problem that an intelligent customer service system cannot find a correct answer from a knowledge base of the intelligent customer service system due to inaccurate information input by a user in the prior art, so that user experience is reduced.

In a first aspect, an embodiment of the present invention provides a method for determining similarity between entity corpora, including:

randomly extracting a training set from a preset entity corpus, wherein the training set is composed of a plurality of entity corpora;

pairing any entity corpus in the training set with each entity corpus except the entity corpus until all entity corpora in the training set are paired, thereby obtaining a plurality of training entity corpus relation pairs;

acquiring each training sentence matrix vector corresponding to each training entity corpus relation pair;

Processing the matrix vectors of each training sentence by using a convolutional neural network to obtain training classification probabilities of the corpus relation pairs of each training entity;

and determining the similarity between the training entity corpora in the training entity corpus relation pair based on the training classification probability.

Optionally, the obtaining each training sentence matrix vector corresponding to each training entity corpus relation pair specifically includes:

acquiring a first set of word vectors corresponding to all words constituting the training set, wherein each entity corpus in the training set is respectively composed of a plurality of words;

based on the first set, training sentence matrix vectors of the training entity corpus relation pairs are obtained, wherein the training sentence matrix vectors are composed of a plurality of word vectors.

Optionally, the processing the matrix vectors of each training sentence by using a convolutional neural network to obtain training classification probabilities of the corpus relation pairs of each training entity specifically includes: performing convolution operation on the training sentence matrix vectors to obtain training feature information corresponding to the training entity corpus relation pair;

Sampling the training feature information to obtain a plurality of training optimal features of the training entity corpus pairs;

combining the training optimal features to obtain training local optimal features of the training entity corpus pairs;

and processing each training local optimal feature by using a Softmax model to obtain training classification probability of each training entity corpus pair.

Optionally, the randomly extracting the training set from the preset entity corpus specifically includes:

extracting a training set and a testing set from a preset entity corpus by using a random extraction algorithm; the union set of the training set and the testing set is the preset entity corpus, and the training set and the testing set have no intersection set.

After the processing is performed on the training sentence matrix vectors by using the convolutional neural network to obtain the test classification probability of the training entity corpus relation pairs, the method further comprises:

pairing any entity corpus in the test set with each entity corpus except the entity corpus until all entity corpora in the test set are paired, thereby obtaining a plurality of test entity corpus relation pairs, wherein the test set is composed of a plurality of entity corpora;

Acquiring each test sentence matrix vector corresponding to each test entity corpus relation pair;

processing the matrix vectors of each test statement by using the convolutional neural network to obtain the classification probability of each test entity corpus relation pair;

and outputting the classification probability of each test entity corpus relation pair, so that a user can judge whether the convolutional neural network needs to be trained again or not based on the classification probability.

Optionally, the obtaining each test sentence matrix vector corresponding to each test entity corpus relation pair specifically includes:

acquiring a second set of word vectors corresponding to all words constituting the test set, wherein each entity corpus in the test set is composed of a plurality of words;

and based on the second set, obtaining test sentence matrix vectors of the test entity corpus relation pairs, wherein the test sentence matrix vectors are composed of a plurality of word vectors.

Optionally, the processing the matrix vectors of each test sentence by using a convolutional neural network to obtain test classification probabilities of corpus relation pairs of each test entity specifically includes:

Performing convolution operation on the test statement matrix vectors to obtain test feature information corresponding to the test entity corpus relation pairs;

sampling the test feature information to obtain a plurality of test optimal features of the test entity corpus pairs;

combining the plurality of test optimal features to obtain test local optimal features of each test entity corpus pair;

and processing each test local optimal feature by using a Softmax model to obtain the test classification probability of each test entity corpus pair.

Optionally, the preset entity corpus is a preset tax entity corpus.

In a second aspect, an embodiment of the present invention provides an apparatus for determining similarity between entity corpora, including:

the extraction unit is used for randomly extracting a training set from a preset entity corpus, wherein the training set is composed of a plurality of entity corpora;

the first pairing unit is used for pairing any entity corpus in the training set with each entity corpus except the entity corpus until all entity corpora in the training set are paired, so that a plurality of training entity corpus relation pairs are obtained;

The first acquisition unit is used for acquiring each training sentence matrix vector corresponding to each training entity corpus relation pair;

the second acquisition unit is used for processing the matrix vectors of the training sentences by using a convolutional neural network to acquire training classification probabilities of the corpus relation pairs of the training entities;

and the determining unit is used for determining the similarity between the training entity corpora in the training entity corpus relation pair based on the training classification probability.

Optionally, the first obtaining unit specifically includes:

a first obtaining subunit, configured to obtain a first set of word vectors corresponding to all words that form the training set, where each entity corpus in the training set is formed by a plurality of words;

and the second acquisition subunit is used for acquiring training sentence matrix vectors of the corpus relation pairs of each training entity based on the first set, wherein the training sentence matrix vectors are composed of a plurality of word vectors.

Optionally, the second obtaining unit specifically includes:

the first operation subunit is used for carrying out convolution operation on the matrix vectors of the training sentences to obtain training characteristic information corresponding to the training entity corpus relation pairs;

The first sampling subunit is used for sampling and processing the training feature information to obtain a plurality of training optimal features of the training entity corpus pairs;

the first merging subunit is used for merging the plurality of training optimal features to obtain training local optimal features of the training entity corpus pairs;

and the first classification subunit is used for processing each training local optimal feature by using the Softmax model to obtain training classification probability of each training entity corpus pair.

Optionally, the apparatus further includes:

the second pairing unit is used for pairing any entity corpus in the test set with each entity corpus except the entity corpus after the training sentence matrix vectors are processed by using the convolutional neural network to obtain the test classification probability of each training entity corpus relation pair, so that all entity corpuses in the test set are paired, and a plurality of test entity corpus relation pairs are obtained, wherein the test set consists of a plurality of entity corpuses;

the third acquisition unit is used for acquiring each test statement matrix vector corresponding to each test entity corpus relation pair;

The fourth obtaining unit is used for processing the matrix vectors of each test sentence by using the convolutional neural network to obtain the classification probability of each test entity corpus relation pair;

and the output unit is used for outputting the classification probability of each test entity corpus relation pair, so that a user can judge whether the convolutional neural network needs to be trained again or not based on the classification probability.

Optionally, the third obtaining unit specifically includes:

a third obtaining subunit, configured to obtain a second set of word vectors corresponding to all words that form the test set, where each entity corpus in the test set is formed by a plurality of words;

and the fourth acquisition subunit is used for acquiring the test statement matrix vectors of the corpus relation pairs of each test entity based on the second set, wherein the test statement matrix vectors are composed of a plurality of word vectors.

Optionally, the fourth obtaining unit specifically includes:

the second operation subunit is used for carrying out convolution operation on the matrix vectors of each test statement to obtain test feature information corresponding to the test entity corpus relation pair;

the second sampling subunit is used for sampling and processing the test feature information to obtain a plurality of test optimal features of the test entity corpus pairs;

The second merging subunit is used for merging the plurality of test optimal features to obtain test local optimal features of each test entity corpus pair;

and the second classification subunit is used for processing each test local optimal feature by using a Softmax model to obtain the test classification probability of each test entity corpus pair.

Optionally, the preset entity corpus is a preset tax entity corpus.

In a third aspect, an embodiment of the present invention provides an apparatus for determining similarity between entity corpora, including:

at least one processor, and a memory coupled to the at least one processor;

wherein the memory stores instructions executable by the at least one processor, the at least one processor performing the method as described in the first aspect above by executing the instructions stored by the memory.

In a fourth aspect, an embodiment of the present invention provides a computer-readable storage medium, including:

the computer-readable storage medium has stored thereon computer instructions which, when executed by at least one processor of the apparatus for determining similarity between entity corpora, implement the method as described in the first aspect above.

One or more technical solutions provided in the embodiments of the present invention at least have the following technical effects or advantages:

in the invention, the device for determining the similarity between the entity corpora randomly extracts the training set from the preset entity corpus by executing the method for determining the similarity between the entity corpora, and pairs any entity corpus in the training set with all entity corpora except the entity corpus until all entity corpora in the training set are paired, so that a plurality of training entity corpus relation pairs are obtained, each training sentence matrix vector corresponding to each training entity corpus relation pair is obtained, the training sentence matrix vectors are processed by using the convolutional neural network, the training classification probability of each training entity corpus relation pair is obtained, the learning process of the convolutional neural network for the preset entity corpus can be completed, and therefore, the intelligent customer service using the convolutional neural network and the preset entity corpus can provide a problem accurate searching function for users, so that the intelligent customer service system in the prior art can be solved, and the technical problem of user experience is reduced because the information input by the user is inaccurate, and the intelligent customer service system cannot find correct answers from the knowledge base, and the technical effect of user experience is improved.

Drawings

Fig. 1 is a schematic structural diagram of a convolutional neural network according to an embodiment of the present invention;

FIG. 2 is a flowchart of a method for determining similarity between entity corpora according to an embodiment of the present invention;

FIG. 3 is a flowchart of a method for determining training classification probabilities for training entity corpus relationship pairs using convolutional neural networks, provided by an embodiment of the present invention;

FIG. 4 is a flowchart of a method for determining whether a convolutional neural network needs to be retrained according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of an intelligent customer service system according to an embodiment of the present invention employing a method for determining similarity between entity corpora;

fig. 6 is a schematic structural diagram of an apparatus for determining similarity between entity corpora according to an embodiment of the present invention;

fig. 7 is a schematic physical structure diagram of an apparatus for determining similarity between entity corpora according to an embodiment of the present invention.

Detailed Description

In order to solve the technical problems, the general idea of the technical scheme in the embodiment of the invention is as follows:

a method and device for determining similarity between entity corpora specifically comprises the following steps:

In order to better understand the above technical solutions, the following detailed description will be made with reference to the accompanying drawings and specific embodiments, and it should be understood that specific features in the embodiments and examples of the present invention are detailed descriptions of the technical solutions of the present invention, and not limit the technical solutions of the present invention, and the technical features in the embodiments and examples of the present invention may be combined with each other without conflict.

In the embodiment of the invention, the convolutional neural network comprises a convolutional layer, a pooling layer and a full-connection layer; convolutional neural networks (CNN, convolutional Neural Network) are the principle of neural mechanisms derived from vision, and Hubel et al found that the presence of a network structure within the neural mechanisms of vision can reduce the complexity of the network, and the network structure has invariance to changes such as scaling, translation, etc., thereby providing a convolutional neural network. Referring to fig. 1, the basic structure of cnn is a hierarchical recursive network structure, mainly comprising two layers: the convolutional layer and the sampling layer also comprise a fully connected layer, and the input of the convolutional neural network is input in the form of matrix vectors. The convolution layer is also called a feature extraction layer, and the sampling layer is called a feature mapping layer or pooling layer. The two-layer structure can be actually understood as reducing the characteristic dimension and optimizing parameters, which is also the advantage of the convolutional network over other neural networks. The parameters in the network are reduced by sharing the local weight, so that the method has good effects on speech recognition and image processing. Based on the above advantages, the convolutional neural network has great advantages in terms of text processing.

The training device can be any terminal equipment capable of running computer programs, such as a mobile phone, a tablet computer, a desktop computer and the like;

the entity-corpus relation pair may be composed of two entity-corpora or a plurality of entity-corpora, for example, the entity-corpus relation pair may be expressed as<e ₁ ,e ₂ >Wherein e is ₁ ,e ₂ ∈E，e ₁ ,e ₂ The method comprises the steps that entity corpus E is a preset entity corpus;

the sentence matrix vector corresponding to the entity-corpus relation pair can be expressed as X, wherein X is a two-dimensional matrix of n X k, n is the length of the word of the entity-corpus relation pair, and k is the word vector X of the i-th word forming the entity-corpus relation pair _i Is determined by the total number of words.

The above list is illustrative only and is not intended to be a specific limitation on the embodiments of the present invention.

Referring to fig. 2, a first embodiment of the present invention provides a method for determining similarity between entity corpora, including the following steps:

step S101, randomly extracting a training set from a preset entity corpus, wherein the training set is composed of a plurality of entity corpora.

Step S102, any entity corpus in the training set is paired with each entity corpus except the entity corpus until all entity corpora in the training set are paired, so that a plurality of training entity corpus relation pairs are obtained.

Step S103, each training sentence matrix vector corresponding to each training entity corpus relation pair is obtained.

And step S104, processing the matrix vectors of each training sentence by using a convolutional neural network to obtain training classification probabilities of the corpus relation pairs of each training entity.

Step S105, determining the similarity between the corpora of the corpora relation pair based on the training classification probability.

Step S101 is executed first, and a training set is randomly extracted from a preset entity corpus, wherein the training set is composed of a plurality of entity corpora.

Specifically, the algorithm used in extracting the training set may be a random extraction algorithm, or may be another algorithm capable of implementing a random extraction function, which is not limited herein.

Further, randomly extracting the training set from the preset entity corpus further comprises randomly extracting a test set from the preset entity corpus; the union set of the training set and the testing set is the preset entity corpus, and the training set and the testing set have no intersection set.

Specifically, the method for extracting the test set is to extract the test set while randomly extracting the training set from the preset entity corpus, for example, simultaneously extracting the training set and the test set from the preset entity corpus by using a random extraction algorithm; another method for extracting the test set is to take the part except the training set in the preset entity corpus as the test set after the training set is extracted; yet another method for extracting the test set is to randomly extract the test set from the preset entity corpus by using an algorithm, and then use the part of the preset entity corpus except for the test set as a training set.

In addition, the ratio of the training set to the testing set can be freely set, for example, the ratio of the training set to the testing set is 1:2, that is, the number of entity corpuses forming the training set is one third of the total number of entity corpuses in the preset entity corpus.

After the training set extraction is completed, step S102 is executed, where any entity corpus in the training set is paired with each entity corpus except the entity corpus until all entity corpora in the training set are paired, so as to obtain a plurality of training entity corpus relation pairs.

Specifically, for example, 10 entity corpuses numbered 1, 2, … and 10 are in the training set, in order to obtain a training entity corpus pair, the 1 entity corpus and the 2-10 entity corpuses can be respectively paired to obtain 9 training entity corpus relation pairs, the 2 entity corpus and the 3-10 entity corpuses are respectively paired to obtain 8 training entity corpus relation pairs, and so on until the 9 entity corpuses and the 10 entity corpuses are paired, 45 training entity corpus relation pairs are obtained in total, wherein the sequence and the mode of pairing the entity corpuses in the training set are not limited, as long as the entity corpuses in the final training set are all paired and have no repeated entity corpus relation pairs.

After step S102 is completed, step S103 is executed to obtain each training sentence matrix vector corresponding to each training entity corpus relation pair.

Further, the obtaining each training sentence matrix vector corresponding to each training entity corpus relation pair specifically includes:

Specifically, a first set of Word vectors corresponding to all words forming the training set is obtained, word2Vec models can be used to obtain Word vectors corresponding to words, the Word2Vec models can convert natural language into vector forms which can be identified by a computer, for example, an entity corpus in the training set is "i love beijing", "i love beijing" includes 3 words, i love "," beijing "respectively, and Word2Vec models are used to convert" i "," i love "," beijing "into Word vectors, the three Word vectors can be [1, 0], [0, 1], and the length of a specific Word vector is determined by the number of non-repeated words forming a preset entity corpus. In addition, each word corresponds to a word vector, and the first set includes word vectors corresponding to all non-duplicate words that make up the training set.

After the first set is obtained, based on the first set, training sentence matrix vectors of each training entity corpus relation pair in the training set can be obtained, for example, each training entity corpus relation pair is composed of two training entity corpora, namely, a plurality of words, and based on the first set, training sentence matrix vectors corresponding to the training entity corpus relation pair can be obtained, wherein the matrix vectors are composed of word vectors corresponding to the words composing the training entity corpus relation pair.

After the execution of the step S103 is completed, the execution of the step S104 is performed, and the convolutional neural network is used to process the matrix vectors of each training sentence, so as to obtain the training classification probability of each training entity corpus relation pair.

Further, referring to fig. 3, the training sentence matrix vectors are processed by using a convolutional neural network to obtain training classification probabilities of the training entity corpus relation pairs, which specifically includes the following steps:

step S104a, performing convolution operation on the training sentence matrix vectors to obtain training feature information corresponding to the training entity corpus relation pair.

Step S104b, sampling processing is carried out on each training feature information, and a plurality of training optimal features of each training entity corpus pair are obtained.

And step S104c, combining the training optimal features to obtain training local optimal features of the training entity corpus pairs.

And step S104d, processing each training local optimal feature by using a Softmax model to obtain training classification probability of each training entity corpus pair.

In step S104, step S104a is first executed, and convolution operation is performed on each training sentence matrix vector, so as to obtain training feature information corresponding to the training entity corpus relation pair.

Specifically, after the training sentence matrix vector is input into the convolutional neural network, the convolutional layer of the convolutional neural network performs convolutional operation on the training sentence matrix vector to obtain training feature information corresponding to training entity corpus, for example, the convolutional layer presets the size of a filtering window through a filter, and performs convolutional operation on the input matrix vector by using the filter, if the size of the filtering window is m, and performs convolutional operation with offset, the feature information after convolutional can be expressed as:

c _i ＝f(w·x _i:i＝m-1 +b)

wherein c _i For the ith eigenvalue corresponding to the convolution operation, f (·) is the choice of the convolution kernel of the layer, w is the weight matrix in the filter, where w∈r ^h*m H.m is the size of the selected filter window, b.epsilon.R is the bias matrix, x _i:i＝m-1 Is the length from the i-th word to i + m-1 words in the text sentence. In addition, the convolution layer may perform convolution operations using a plurality of filters, each of which may set the size of the filter window.

After the convolution layer of the convolutional neural network carries out convolution operation on the training sentence matrix vector, training feature information corresponding to the obtained training entity corpus pair can be expressed as a feature matrix c:

c＝[c ₁ ,c ₂ ,…,c _n-h+1 ]

wherein c is E R ^n-h+1 。

After the feature information is obtained, step S104b is executed, and the sampling processing is performed on each training feature information, so as to obtain a plurality of training optimal features of each training entity corpus pair.

Specifically, for example, after the convolution layer of the convolutional neural network performs the convolution operation on the training sentence matrix vector, a plurality of convolution results (such as the feature matrix c) can be obtained, and the pooling layer of the convolutional neural network can sample the plurality of convolution results by using a Max-pooling method according to the following stepsAnd taking the maximum value to obtain the training optimal characteristics of the expected pair of the training entity.

After the step S104b is performed, a step S104c is performed, where the plurality of training optimal features are combined, so as to obtain training local optimal features of the corpus pairs of each training entity.

Specifically, the convolution result of the plurality of training optimal features is combined, so that the plurality of training optimal features can be combined into one training local optimal feature, aggregation statistics of the plurality of training optimal features is realized, and the dimension of the optimal features is reduced.

After the step S104c is performed, the step S104d is performed, and the Softmax model is used to process each training local optimal feature, so as to obtain the training classification probability of each training entity corpus pair.

Specifically, after receiving the training local optimal feature, the full-connection layer of the convolutional neural network performs relationship classification on the local optimal feature by using a Softmax model to obtain classification probability.

After step S104 or step S104d is performed, step S105 is performed to determine the similarity between the training entity corpora in the training entity corpus relation pair based on the training classification probability.

Specifically, the similarity between each training entity corpus in the training entity corpus relation pair may be represented by similarity (Y) and dissimilarity (N), and the similarity may be determined based on the training classification probability, or may be determined to be a similarity relationship between each training entity corpus in the training entity corpus relation pair if the training classification probability is greater than a preset threshold, or be determined to be a dissimilarity relationship between each training entity corpus in the training entity corpus relation pair if the training classification probability is less than the preset threshold.

Further, referring to fig. 4, after step S105 is performed, the training method further includes the following steps:

step S201, pairing any entity corpus in the test set with each entity corpus except the entity corpus until all entity corpora in the test set are paired, thereby obtaining a plurality of test entity corpus relation pairs, wherein the test set is composed of a plurality of entity corpora.

Step S202, each test sentence matrix vector corresponding to each test entity corpus relation pair is obtained.

And step S203, processing the matrix vectors of each test statement by using the convolutional neural network to obtain the classification probability of each test entity corpus relation pair.

Step S204, the classification probability of each test entity corpus relation pair is output, so that a user can judge whether the convolutional neural network needs to be trained again or not based on the classification probability.

After step S104d is performed, step S201, step S202, step S203, and step S204 are sequentially performed, where the specific methods of performing the test set in step S201, step S202, and step S203 are the same as the specific methods of performing the training set in step S102, step S103, and step S104, respectively, and are not described herein again.

When executing step S203, specifically includes:

The specific method for executing step S203 specifically includes the same specific method for executing the test set as the specific method for executing step S104a to step S104d on the training set, and will not be described herein again.

For step S204, specifically, the classification probability of each test entity corpus pair is output, where the classification probability includes a set of classification probability values, where the classification probability values are the relative probabilities of the classifications corresponding to the values of a local optimal feature, and the convolutional neural network outputs the classification probability and the words corresponding to the classification probability, and evaluates the output result by using the following formula:

Wherein r is _i Representing the number, t, of pairs of the i-th test entity corpus relationship of a certain class which is correctly classified _i The total number of pairs of ith training entity corpus relations, a, determined as the class _i Total number of i-th test entity corpus relation pairs of test set, F ₁ Is a predefined index function.

Further, the convolutional neural network outputs the accuracy, recall rate and F to the user ₁ From the slaveWhile the user may be based on accuracy, recall, and F ₁ It is determined whether the convolutional neural network requires retraining. For example, the accuracy is 60%, and the user judges that the convolutional neural network needs to be trained again.

Further, the convolutional neural network outputs the accuracy, recall rate and F to the user ₁ During the process, the accuracy, recall rate and F of the text format can be output through the display interface of the training device ₁ 。

Further, the preset entity corpus is a preset tax entity corpus, for example, the preset tax entity corpus may be an intelligent customer service tax knowledge base (the knowledge base has 7000 pieces of related tax knowledge and 11000 pieces of expansion problems).

For example, referring to fig. 5, the method for determining similarity between entity corpora is applied to an intelligent customer service system, where the preset entity corpus is a preset tax entity corpus, and the intelligent customer service system performs the following steps:

Randomly extracting a training set or a testing set from a preset tax entity corpus, pairing entity corpuses in the training set or the training set to obtain a plurality of training entity corpus relation pairs or testing entity corpus relation pairs, and if the training entity corpus relation pairs are input into the convolutional neural network, outputting the similarity between each training entity corpus in the training set, wherein the similarity can be similar or dissimilar; if the input convolutional neural network is a test entity corpus relation pair, the output result is the accuracy, recall and F ₁ Etc. for judging to the user whether retraining is required.

After the intelligent customer service executes the method, when a user uses the intelligent customer service system, the system can search similar entity corpora in the preset tax entity corpus based on the information input by the user even if the input information is inaccurate, so that answers of the user are searched.

Referring to fig. 6, based on the same inventive concept, a second embodiment of the present invention provides a device for determining similarity between entity corpora, including:

the extraction unit 601 is configured to randomly extract a training set from a preset entity corpus, where the training set is composed of a plurality of entity corpora;

A first pairing unit 602, configured to pair any entity corpus in the training set with each entity corpus except the entity corpus until all entity corpora in the training set are paired, thereby obtaining a plurality of training entity corpus relation pairs;

a first obtaining unit 603, configured to obtain each training sentence matrix vector corresponding to each training entity corpus relation pair;

a second obtaining unit 604, configured to process the matrix vectors of the training sentences by using a convolutional neural network, to obtain training classification probabilities of the corpus relation pairs of the training entities;

a determining unit 605 is configured to determine, based on the training classification probability, a similarity between the training entity corpora in the training entity corpus relation pair.

Optionally, the first obtaining unit specifically includes:

Optionally, the second obtaining unit specifically includes:

Optionally, the extracting unit is further configured to:

extracting a test set from a preset entity corpus by using a random extraction algorithm; the union set of the training set and the testing set is the preset entity corpus, and the training set and the testing set have no intersection set.

Optionally, the apparatus further includes:

Optionally, the third obtaining unit specifically includes:

Optionally, the fourth obtaining unit specifically includes:

Optionally, the preset entity corpus is a preset tax entity corpus.

Referring to fig. 7, based on the same inventive concept, a third embodiment of the present invention provides an apparatus for determining similarity between entity corpora, including:

at least one processor 701, and a memory 702 coupled to the at least one processor;

wherein the memory 702 stores instructions executable by the at least one processor 701, the at least one processor 701 performing the steps of the method as described in the method embodiments above by executing the instructions stored by the memory 702.

Alternatively, the processor 701 may include a central processing unit (central processing unit, CPU), an application specific integrated circuit (application specific integrated circuit, ASIC), one or more integrated circuits for controlling program execution, a hardware circuit developed using a field programmable gate array (field programmable gate array, FPGA), and a baseband processor.

Optionally, the processor 701 may include at least one processing core.

Optionally, the apparatus further includes a memory 702, where the memory 702 may include Read Only Memory (ROM), random access memory (random access memory, RAM) and disk memory. The memory 702 is used to store data required by the processor 701 when it is running.

Based on the same inventive concept, a fourth embodiment of the present invention provides a computer readable storage medium, comprising:

the computer readable storage medium has stored thereon computer instructions which, when executed by at least one processor of the training apparatus, implement a method as described in the method embodiments above.

The technical scheme provided by the embodiment of the invention at least has the following technical effects or advantages:

While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. It is therefore intended that the following claims be interpreted as including the preferred embodiments and all such alterations and modifications as fall within the scope of the invention.

It will be apparent to those skilled in the art that embodiments of the present invention may be provided as a method, apparatus, or computer program product. Accordingly, embodiments of the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, embodiments of the invention may take the form of a computer program product on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.

Embodiments of the present invention are described with reference to flowchart illustrations and/or block diagrams of methods, apparatus, and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

It will be apparent to those skilled in the art that various modifications and variations can be made to the present invention without departing from the spirit or scope of the invention. Thus, it is intended that the present invention also include such modifications and alterations insofar as they come within the scope of the appended claims or the equivalents thereof.

Claims

1. A method for determining similarity between corpora, comprising:

based on the training classification probability, determining similarity between each training entity corpus in the training entity corpus relation pair;

wherein determining similarity between each training entity corpus in the training entity corpus relation pair based on the training classification probability comprises:

if the training classification probability is larger than a preset threshold, determining that the training entity corpora in the training entity corpus relation pair are similar, and if the training classification probability is smaller than the preset threshold, determining that the training entity corpora in the training entity corpus relation pair are non-similar.

2. The method of claim 1, wherein the obtaining each training sentence matrix vector corresponding to each training entity corpus relation pair specifically comprises:

3. The method according to claim 1 or 2, wherein the processing the matrix vectors of each training sentence by using a convolutional neural network to obtain training classification probabilities of each training entity corpus relation pair specifically includes:

performing convolution operation on the training sentence matrix vectors to obtain training feature information corresponding to the training entity corpus relation pair;

sampling the training feature information to obtain a plurality of training optimal features of the training entity corpus relation pairs;

combining the training optimal features to obtain training local optimal features of the training entity corpus relation pairs;

And processing each training local optimal feature by using a Softmax model to obtain training classification probabilities of each training entity corpus relation pair.

4. The method according to claim 1 or 2, wherein the randomly extracting the training set from the preset entity corpus further comprises:

5. The method of claim 4, wherein after said processing of said respective training sentence matrix vectors with a convolutional neural network to obtain test classification probabilities for said respective training entity corpus relationship pairs, said method further comprises:

6. The method of claim 5, wherein the obtaining each test sentence matrix vector corresponding to each test entity corpus relation pair specifically comprises:

acquiring a second set of word vectors corresponding to all words constituting the test set, wherein each entity corpus in the test set is respectively composed of a plurality of words;

7. The method of claim 5 or 6, wherein the processing the matrix vectors of each test sentence by using a convolutional neural network to obtain test classification probabilities of each test entity corpus relation pair specifically comprises:

sampling the test feature information to obtain a plurality of test optimal features of the corpus relation pairs of each test entity;

combining the plurality of test optimal features to obtain test local optimal features of each test entity corpus relation pair;

and processing each test local optimal feature by using a Softmax model to obtain the test classification probability of each test entity corpus relation pair.

8. The method of claim 1, 2, 5 or 6, wherein the pre-set entity corpus is a pre-set tax entity corpus.

9. An apparatus for determining similarity between corpora, comprising:

the determining unit is used for determining the similarity between the training entity corpora in the training entity corpus relation pair based on the training classification probability;

wherein, the determining unit is specifically configured to: if the training classification probability is larger than a preset threshold, determining that the training entity corpora in the training entity corpus relation pair are similar, and if the training classification probability is smaller than the preset threshold, determining that the training entity corpora in the training entity corpus relation pair are non-similar.

10. The apparatus of claim 9, wherein the first acquisition unit specifically comprises:

11. The apparatus according to claim 9 or 10, wherein the second acquisition unit specifically comprises:

the first sampling subunit is used for sampling and processing the training feature information to obtain a plurality of training optimal features of the training entity corpus relation pairs;

the first merging subunit is used for merging the plurality of training optimal features to obtain training local optimal features of the training entity corpus relation pairs;

and the first classification subunit is used for processing each training local optimal feature by using the Softmax model to obtain training classification probability of each training entity corpus relation pair.

12. The apparatus of claim 9 or 10, wherein the extraction unit is further configured to:

13. The apparatus of claim 12, wherein the apparatus further comprises:

14. The apparatus of claim 13, wherein the third acquisition unit specifically comprises:

15. The apparatus according to claim 13 or 14, wherein the fourth acquisition unit specifically comprises:

the second sampling subunit is used for sampling and processing the test feature information to obtain a plurality of test optimal features of the corpus relation pairs of each test entity;

The second merging subunit is used for merging the plurality of test optimal features to obtain test local optimal features of each test entity corpus relation pair;

and the second classification subunit is used for processing each test local optimal feature by using a Softmax model to obtain the test classification probability of each test entity corpus relation pair.

16. The apparatus of claim 9, 10, 13 or 14, wherein the pre-set entity corpus is a pre-set tax entity corpus.

17. An apparatus for determining similarity between corpora, comprising:

at least one processor, and a memory coupled to the at least one processor;

wherein the memory stores instructions executable by the at least one processor, the at least one processor performing the method of any of claims 1-8 by executing the instructions stored by the memory.

18. A computer-readable storage medium, comprising:

the computer-readable storage medium having stored thereon computer instructions which, when executed by at least one processor of the apparatus for determining similarity between entity corpora, implement the method of any of claims 1-8.