CN112800777B - Semantic determination method - Google Patents

Semantic determination method Download PDF

Info

Publication number
CN112800777B
CN112800777B CN202110398762.1A CN202110398762A CN112800777B CN 112800777 B CN112800777 B CN 112800777B CN 202110398762 A CN202110398762 A CN 202110398762A CN 112800777 B CN112800777 B CN 112800777B
Authority
CN
China
Prior art keywords
sentence
vector
branch
sentences
network model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110398762.1A
Other languages
Chinese (zh)
Other versions
CN112800777A (en
Inventor
王光勇
姜巍
李乘风
于游
赵永强
廖望梅
张姗姗
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Yuxueyuan Health Management Center Co ltd
Original Assignee
Beijing Yuxueyuan Health Management Center Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Yuxueyuan Health Management Center Co ltd filed Critical Beijing Yuxueyuan Health Management Center Co ltd
Priority to CN202110398762.1A priority Critical patent/CN112800777B/en
Publication of CN112800777A publication Critical patent/CN112800777A/en
Application granted granted Critical
Publication of CN112800777B publication Critical patent/CN112800777B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Evolutionary Computation (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Biophysics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the invention provides a semantic determining method, which comprises the steps of obtaining a statement of a semantic to be determined; and training a multitask twin network model from the preprocessed corpus, and obtaining all sentence vectors corresponding to the corpus through the trained model. The sentence vectors corresponding to the sentences are extracted from the trained model, whether the vector difference values between all the sentence vectors and the sentence vectors are within a preset difference value range or not is calculated, the similarity of the sentence vectors meeting preset conditions is sequenced, the sentence vector with the highest similarity is used as the corresponding matched semantic of the sentences, and then the sentence vectors are matched to the specific sentences.

Description

Semantic determination method
Technical Field
The invention belongs to the field of mother-infant intelligent artificial analysis, and particularly relates to a semantic determination method.
Background
In recent years, with the rapid development of the artificial intelligence field, especially deep learning, natural language processing plays an increasingly important role in human learning work and life. The emergence of intelligent question and answer has enabled a large number of applications or services to emerge, such as the hundredth degree xiao, the ari genie of ali, the xiaojie of millet, Siri of apple, the xiaoqi of microsoft, and other equipment systems.
In the prior art, the technical route of the question-and-answer robot is to map sentences into sentence vectors, calculate cosine similarity of the sentence vectors to obtain sentences with highest score, and give corresponding answers to users. However, due to the complex and changeable syntactic structure of the Chinese sentence, the heterogeneity of semantic context and other factors, the on-line effect evaluation of the model is poor, and bad experience is brought to the user; particularly, if the complexity and diversity of the Chinese grammar structure cannot be fully understood, the question-answering robot used in the field of mother and infant cannot accurately output the calculation result, so that the on-line effect evaluation of the model is poor, and great trouble is caused to the user experience.
Disclosure of Invention
In order to solve the technical problems that in the prior art, due to the fact that the grammatical structure of a Chinese sentence is complex and changeable, the on-line effect evaluation of a model is poor and poor experience is brought to a user due to the fact that the semantic context is different and the like, the semantic determining method is provided.
In a first aspect, the present invention provides a semantic determination method, including:
obtaining a statement C of a to-be-determined semantic;
inputting the statement into a preset twin network model to obtain a feature vector of the statement C, wherein the twin network model is used for vectorizing the statement;
determining a similar vector with the highest similarity with the feature vector of the statement C from a vector library;
and determining the standard sentence corresponding to the similar vector as the semantic corresponding to the sentence.
Further, the twin network model comprises a first branch of the network structure, a second branch of the network structure, a main classification task, a first branch auxiliary task and a second branch auxiliary task, and the training process of the twin network model comprises the following steps:
obtaining sentences A and B from a preset training corpus;
inputting the sentence A and the sentence B into the twin network model, so that the first branch of the network structure determines a feature vector of the sentence A, the second branch of the network structure determines a feature vector of the sentence B, and the twin network model performs feature fusion on the feature vector of the sentence A and the feature vector of the sentence B to obtain a fusion vector, and then determining a main classification task of the twin network model for vectorizing the sentence to be processed based on the feature vector of the sentence A, the feature vector of the sentence B and the fusion vector.
Further, the loss functions of the first branch of the network structure, the second branch of the network structure, the main classification task, the first branch auxiliary task and the second branch auxiliary task are custom losses, wherein the formula of the loss functions is as follows:
Figure GDA0003108266090000021
wherein z isiFor the twin network model output, i ═ 1 is the first branch auxiliary task output, i ═ 2 is the main classification task output, i ═ 3 is the second branch auxiliary task output,
Figure GDA0003108266090000022
and sigma is a hyper-parameter of the twin network model, w is a weight matrix of the twin network model, and epsilon is 0.1.
Further, the determining a similarity vector with the highest similarity to the feature vector of the statement C from the vector library includes:
calculating the similarity of each sentence vector and the sentence vector through a cosine similarity formula, wherein the sentence vectors are stored in the vector library;
if the similarity between the sentence vector and the feature vector of the sentence C is greater than a preset threshold value, determining that the vector difference value between the sentence vector and the feature vector of the sentence C is within a preset difference value range;
and determining a similar vector with the highest similarity to the sentence vector from the sentence vectors in the preset difference range.
Further, the semantic determination method further includes: and if the similarity between the sentence vector and the feature vector of the sentence C is smaller than a preset threshold value, determining that the vector difference value between the sentence vector and the feature vector of the sentence C is not in a preset difference value range.
Further, the training process of the twin network model further comprises:
inputting the obtained sentences and the labels corresponding to the sentences into a twin network model for training so as to train hidden labels and sentence vectors to which the sentences belong, wherein the hidden labels are used for machine classification of the sentences, and the sentence vectors can comprise a plurality of dimensions.
Further, before calculating whether the vector difference value of the vector and the statement vector is within a preset difference value range, the method further includes:
extracting an implicit label corresponding to the statement from a pre-trained first branch of the network structure;
and screening out the sentences belonging to the range of the hidden labels from the second branch of the network structure corresponding to the first branch of the network structure.
Further, the weight of the second branch of the network structure is the same as the weight of the first branch of the network structure.
Compared with the prior art, the technical scheme provided by the embodiment of the application has the following advantages: the method comprises the steps of obtaining a statement of a to-be-determined semantic; the sentence vectors corresponding to the sentences are extracted from the pre-trained first branch of the network structure, whether the vector difference value between each sentence vector in the second branch of the network structure corresponding to the first branch of the network structure and the sentence vector of the sentence is within a preset difference value range or not is calculated, a plurality of similarities meeting preset conditions are sequenced, the machine sentence with the highest similarity is used as the semantic meaning corresponding to the sentences, and the problem that the fitting of the sentences is caused by using a single mode until the accuracy of output results is low is solved.
Drawings
The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:
FIG. 1 is a flow chart of the main operation steps of a semantic determination method according to an embodiment of the present application;
FIG. 2 is a schematic diagram of a specific model implementation flow of a semantic determination method according to an embodiment of the present application;
FIG. 3 is a flow chart illustrating a specific implementation of a twin network model involved in the semantic determination method according to an embodiment of the present disclosure;
FIG. 4 is a flowchart illustrating specific operation steps of determining a similarity vector with the highest similarity to a feature vector of a sentence C from a vector library in the semantic determination method according to the embodiment of the present application;
FIG. 5 is a schematic overall implementation flow diagram of a twin network model involved in the semantic determination method according to the embodiment of the present application;
FIG. 6 is a flow chart of a Tensorflow _ ranking model implementation of an embodiment of the present application;
fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention.
In the prior art, the technical route of the question-and-answer robot is to map sentences into sentence vectors, obtain the sentences with the highest score by calculating the cosine similarity of the sentence vectors, and then give the corresponding answers to the users. However, due to the complex and changeable syntactic structure of the Chinese sentence, the heterogeneity of semantic context and other factors, the on-line effect evaluation of the model is poor, and bad experience is brought to the user.
In view of the above, experts and scholars have proposed a large number of solutions, which are mainly divided into: prior art 1: combining with a Bert Chinese pre-training language model, and converting sentences into space vectors; prior art 2: combining with an Albert Chinese pre-training language model, and converting sentences into space vectors.
Although the current methods have been improved to some extent in the course of continuous research by researchers, certain problems still remain. In the prior art 1, a method based on fine tuning of a Bert Chinese pre-training language model basically represents a method for extracting a strongest sentence vector at present, the method can abstract a sentence into a sentence vector with a certain dimension, the vector integrates rich semantic features of the sentence, the evaluation effect of the model is better than that of Word2Vec, but the requirement of online is not met, and because the Bert parameter is large, the search time is long when the model is deployed on line, and because of the two reasons, the Bert cannot meet the requirement of industrial production; however, in the prior art 2, based on the Albert chinese pre-training language model, the number of parameters is much smaller than that of Bert, but the effect is not changed much, and only the video memory of the training model can be reduced by using Albert, but the response time of effect evaluation and search is not improved compared with that of Bert.
In the current technical route, the strongest model is the pre-training Chinese model which is finely adjusted on the basis of Albert, but the online effect evaluation and the search response duration are poor, and poor user experience is brought to the user.
The innovation point of the embodiment of the invention is that on the basis of Albert, a multitask twin network model is provided, and the multitask twin network model comprises a first branch of a network structure, a second branch of the network structure, a main classification task, a first branch auxiliary task, a second branch auxiliary task and the like. The method comprises the steps that a twin method of a network structure first branch and a network structure second branch share weight is adopted, different layers are added for method research and model effect verification, under a specific (such as a sports park APP) knowledge system, the technical method of the embodiment of the invention is greatly improved compared with the existing model effect of a single Albert method, the accuracy of testing is improved by nearly 20% in open data evaluation, and the effect is obvious.
As shown in fig. 1, an embodiment of the present invention provides a semantic determination method, where the method includes:
step S101, obtaining a statement C of a to-be-determined semantic;
in the embodiment of the present invention, the sentence C refers to an input sentence, for example, the actual application scenario is a speech for recognizing a user, and the speech of the user is translated (speech is converted into text) to obtain the sentence C for a subsequent recognition process.
Step S102, inputting the statement into a preset twin network model to obtain a feature vector of the statement C, wherein the twin network model is used for vectorizing the statement;
step S103, determining a similar vector with the highest similarity with the feature vector of the statement C from a vector library;
and step S104, determining the standard sentence corresponding to the similar vector as the semantic corresponding to the sentence.
In the embodiment of the present invention, the twin network model of the embodiment of the present invention includes a network structure first branch, a network structure second branch, a main classification task, a first branch auxiliary task, a second branch auxiliary task, and the like. In order to avoid overfitting of the sentences in the same model, which is one of the technical drawbacks to be overcome by the embodiments of the present invention, the sentences are trained in the manner of twin network structure branches (i.e., the network branches in which the network structure first branch and the network structure second branch are formed in the same twin network model).
In practical application, the embodiment of the invention provides a multitask twin network model to train sentence vectors on the basis of the Tiny Albert; as shown in fig. 2, a twin network model (i.e., twin Albert model) is built based on a Tiny Albert chinese pre-training model, and the training can better capture the relationship between sentences to generate a better sentence vector. The twin network can measure the similarity of two inputs, the two inputs are respectively sent into two neural networks to obtain the representation of the two inputs in a new space, and the similarity between the two inputs is finally calculated according to the Loss function.
And then, accessing the pooling layer, and further performing feature extraction and compression on the sub-vectors output by the twin network by adopting GlobalatagePooling 1D () to obtain feature vectors u and v with the feature dimensionality of 312 dimensions. The model is further trained by adopting a multitask mode based on u and v. In the main classification task, because u and v represent the feature vectors of the sentence 1 and the sentence 2 respectively, and the two sentences are independent from each other, in order to train the model more fully, the sentence 1 and the sentence 2 need to be feature-fused, the embodiment of the invention adopts a mode that the absolute values of u and v are used for carrying out feature vector fusion processing (for example, the two sentences are subjected to addition, subtraction, multiplication and division operation); meanwhile, in the auxiliary tasks (namely the first branch auxiliary task and the second branch auxiliary task), for the auxiliary tasks, u and v are respectively input of each auxiliary task (namely, a sentence vector u is input into the first branch auxiliary task, and a sentence vector v is input into the second branch auxiliary task);
the main classification task and the auxiliary task are respectively described below; in the main classification task, the feature vectors of the sentences 1 and 2 after feature fusion become [ u, v, | u-v | ], namely the three fused feature vectors are used as the input of a full connection layer, the dimension of the feature vectors is 936 dimensions, each feature vector is 312 dimensions, and by analogy, the three feature vectors are 936 dimensions; in order to extract more sufficient feature vector information later, a full connection layer with units of 512 dimensions is accessed, and the activation function uses a SELU activation function, where the calculation formula of the SELU activation function is:
Figure GDA0003108266090000071
wherein x is the output value of the previous layer, λ is a constant greater than 1, and α is a constant.
After the SELU activation function, the sample distribution is automatically normalized to mean value 0 and variance 1 (self-normalization ensures that the gradient does not explode or disappear in the training process), and it can be seen from the SELU activation function that in the positive half axis, the variance is increased when too small, so that the gradient is prevented from disappearing, and in the negative half axis, the variance is decreased when too large, so that the gradient is prevented from exploding, so that the SELU activation function has a fixed point λ, and then the experimental knowledge shows that the above conditions are met when λ is 1.0507700987, and the output of each layer after the network is deep is mean value 0 and variance 1.
In addition, in order to prevent overfitting of the training model in practical application, an AlphaDropout layer is accessed, for example, when the ratio is 0.3, 30% of neurons, 0.3 in AlphaDropout, are called parameters, are randomly discarded in each training, AlphaDropout has no fixed position, and the specific setting can be determined according to practical situations, and AlphaDropout is a Dropout layer which keeps input mean and variance unchanged and has the function of keeping the sub-normalization of data at Dropout time through scaling and translation, wherein Dropout is a normalization means applied in a deep learning environment. It operates as follows: in a loop, we randomly select some units in the neural layer and temporarily hide them, and then do the training and optimization process of the neural network in the loop. In the next cycle we will hide some other neurons again, so on until the training is finished. At training, each neural unit is retained with a probability p; in the testing stage, each nerve unit exists, and the weight parameter w is multiplied by p to form pw; next, a full connection layer with 256 dimensions of units is accessed, the SELU activation function is used, and then an AlphaDropout layer is accessed, and the ratio is 0.3. Next, a full connection layer with units of 128 dimensions is accessed, and the activation function is selu. And then the AlphaDropout layer is accessed, and the ratio is 0.3. Next, a fully connected layer with units of 64 dimensions is accessed and the activation function is selu. And then the AlphaDropout layer is accessed, and the ratio is 0.3. Next, a fully connected layer with units of 32 dimensions is accessed, and the activation function is selu. Then the AlphaDropout layer is accessed again, and the ratio is 0.3. Next, a fully connected layer with units of 16 dimensions is accessed, and the activation function is selu. And then the AlphaDropout layer is accessed, and the ratio is 0.3. Through the layers, each layer can be ensured to stably and fully extract the characteristics of the vector of the previous layer, finally, a 2-dimensional full connection layer with units is accessed as final output, the activation function is softmax, and the softmax activation function is characterized in that the sum of output values is 1, and the formula is as follows:
Figure GDA0003108266090000081
where i represents the number of output nodes and C is the total number of outputs.
Similarly, in the auxiliary task, the output u and v dimensions are 312 dimensions, that is, the dimension of each sentence vector is 312 dimensions, after global pooling is adopted, a full connection layer is accessed, the activation function is soffmax, and the final classification output is 2662.
In practical application, the main classification task is responsible for classification tasks of whether two sentences are similar or not, if only the main classification task exists, the number of training rounds can quickly reach a high evaluation index on a test set, for example, an input sentence F1 is close to 1, but in practice, a model is not trained well, and the model cannot be trained well even if the number of training rounds is increased, because the evaluation index of the model is close to 1, the model considers that the training is finished from the perspective of the model; however, in the actual on-line test, because each sentence vector of the model is not trained sufficiently, the semantics represented by the sentence vectors can distinguish whether two sentences are similar or dissimilar under the condition of insufficient training, and the final generalization capability is very poor, therefore, the technical scheme adopted by the embodiment of the invention adds the auxiliary task on the network structure branch, therefore, the phenomenon of insufficient sentence vector training of each sentence can be effectively avoided, the auxiliary task is in the running process of the main classification task, it is also necessary to identify the classification of each of the secondary tasks, so that the three tasks are classified by the custom loss design, the rich semantics of each sentence can be effectively learned through the directional propagation of the gradient, so that the expression of the sentence vector of each sentence is more reasonable, the generalization capability is stronger, therefore, the effect of the model is better, and the generalization capability of the technical scheme adopted by the embodiment of the invention is stronger.
In the twin network model, there is also the most important step, namely the design of the loss function, and the loss function layer customized by the embodiment of the invention is accessed in the last layer (see fig. 2 in detail) of three tasks, namely, a main classification task (called a main task for short), a first branch auxiliary task and a second branch auxiliary task.
Previously (i.e., in the prior art), when classification was performed, it was desirable to use cross entropy as the loss function loss. However, this prior art approach has several disadvantages, one of which is that the classification is too absolute, and even though the input noise is generated, the result of the classification is almost non-1, i.e. 0, which usually causes the risk of fitting and also causes that we cannot determine the confidence interval and set the threshold value well in practical application. There are many times that we will want to make the classification not too fit, and modifying loss is one of the means. If loss is not modified, we use cross entropy to fit a one hot distribution. The formula of the cross entropy is:
Figure GDA0003108266090000091
for example, the output is [ z ]1,z2,z3]Target [1, 0]Is that
Figure GDA0003108266090000092
As long as z1 has been [ z ]1,z2,z3]Then we can always "turn up" by increasing the training parameter so that z is1,z2,z3By increasing a sufficiently large proportion (equivalently, by increasing the vector z1,z2,z3]Die length) of the die, thereby
Figure GDA0003108266090000093
Close enough to 1 (equivalently, loss close enough to 0). This is the reason why the softmax function fits heavily in general; therefore, as can be seen from the above technical contents, the loss can be reduced by only blindly increasing the module length. Therefore, in order to make the classification not be self-confident, one scheme is to fit a one hot distribution only and fit a uniform distribution at a certain ratio, so that a large number of experiments by the inventor of the embodiment of the present invention show that the loss can be changed into the following (i.e. designing the loss functions of the first branch of the network structure and the second branch of the network structure, the main classification task (for short), the first branch auxiliary task and the second branch auxiliary task as custom losses, and the formula of the loss function is as follows):
Figure GDA0003108266090000101
wherein z isiFor the twin network model output, i ═ 1 is the first branch auxiliary task output, i ═ 2 is the main classification task output, i ═ 3 is the second branch auxiliary task output,
Figure GDA0003108266090000102
sigma is a hyper-parameter of the twin network model, w is a weight matrix of the twin network model, and epsilon is 0.1, so that the proportion is increased blindly to ensure that
Figure GDA0003108266090000103
The optimal solution is no longer obtained when the model is close to 1, so that the condition that softmax is too confident can be relieved, and the accuracy of the model is ideal in the whole test effect; therefore, incorporate the present inventionAccording to the specific technical scheme of the embodiment of the invention, the model (namely the first branch of the network structure and the second branch of the network structure) adopted by the method optimizes the loss function into the custom loss, and the accuracy of the first branch of the network structure and the accuracy of the second branch of the network structure (collectively called twin network) can be ensured through the design of the loss function.
It should be noted that the loss functions of the main classification task (referred to as the main task), the first branch auxiliary task, and the second branch auxiliary task are the same, and the three losses are differentiated by σ, and the total loss is counted.
The loss of the twin network (i.e. loss function in fig. 2) is well designed, but the weight of different tasks has a great influence on the result, and poor weight setting may make the result poor. Therefore, in practical application, the covariance uncertainty is used as the multitask uncertainty, so that the model output is compressed by the softmax function, and samples are extracted from the generated probability vector, and the calculation formula is as follows:
Figure GDA0003108266090000104
constructing a maximum likelihood function for the multitask as:
Figure GDA0003108266090000105
wherein coefficient σ2The flatness degree of the discrete degree is determined by automatic learning of the model. Sigma2In relation to the uncertainty of the distribution, its log-likelihood function can be written as:
Figure GDA0003108266090000111
from the above derivation:
Figure GDA0003108266090000112
Figure GDA0003108266090000113
Figure GDA0003108266090000114
the log-likelihood function is organized as:
Figure GDA0003108266090000115
wherein z isiFor the twin network model output, i ═ 1 is the first branch auxiliary task output, i ═ 2 is the main classification task output, i ═ 3 is the second branch auxiliary task output,
Figure GDA0003108266090000116
sigma is a superparameter of the twin network model, w is a weight matrix of the twin network model, and epsilon is 0.1, namely the whole calculation process of the twin network, and on the training set and the test set, the evaluation index accuracy of the model and the sentence F1 are respectively 0.99, 0.99 and 1, 0.99.
The sentence vectors are trained through the twin network model provided by the embodiment of the invention, the sentence vectors are extracted through a GlobalatAveragePooling 1D () layer, and the extracted sentence vectors are inserted into a Milvus library (a vector library which is an open-source similarity search engine aiming at massive feature vectors).
The operation steps are as follows:
step 100: firstly, standard questions and similar questions in the knowledge points are output and processed, and the standard questions and the similar questions are stored in a database after duplication is removed.
Step 200: the twin network model is then invoked and sentences in the knowledge base are predicted using the globalagregappooling 1D () layer, where each sentence vector has dimensions of 312 dimensions.
Finally, each extracted sentence vector is inserted into a Milvus library. The Milvus library is a vector search engine, and the searching of hundreds of millions of data can be completed within a few milliseconds. The statements inserted into the Milvus library are: insert (collection _ name, records, ids), wherein collection _ name is a name inserted into the Milvus library, records are vectors, and ids is a number corresponding to the vectors. This step is best prepared for subsequent searches of the Milvus library, similarity calculation of sentence vectors, return to the knowledge base and sentence vectors most similar to the user question.
The establishment of the twin network model is completed, and the preset condition comprises a specific actual condition that the vector difference value is within a preset difference value range; as shown in fig. 4, based on step S102, the determining a similar vector with the highest similarity to the feature vector of the sentence C from the vector library includes:
step S201, calculating the similarity between each sentence vector and the sentence vector through a cosine similarity formula, wherein the sentence vectors are stored in the vector library;
step S202, if the similarity between the sentence vector and the feature vector of the sentence C is greater than a preset threshold, determining that the vector difference between the sentence vector and the feature vector of the sentence C is within a preset difference range;
in step S203, if the similarity between the sentence vector and the sentence vector is smaller than a preset threshold, it is determined that the vector difference between the sentence vector and the sentence vector is not within a preset difference range.
Step S204, determining a similarity vector with the highest similarity to the sentence vector from the sentence vectors within the preset difference range.
Calculating whether the vector difference value between each sentence vector in the second twin network structure corresponding to the first twin network structure and the sentence vector is within a preset difference value range or not
In the process of actually identifying the question of the user and responding, the implementation flow of the twin network model is as shown in fig. 5:
aiming at the problems of users, through a well-trained twin network, vector extraction is carried out on sentences by using a GlobavalagePooling 1D () layer (sentence vectors of the 1 GlobavalagePooling 1D layer, u for short); and then calculating the cosine similarity by using each vector in the u and Milvus libraries, wherein the calculation formula is as follows:
Figure GDA0003108266090000131
this in turn was calculated with the vectors in the Milvus library, returning top10 with the highest score. Verification proves that 10 ten thousand + pieces of knowledge in the question-answering robot adopted in the embodiment of the invention are removed in weight, vector search is carried out by using a Milvus library, the required time is millisecond-level, and only one sentence, namely, Milvus search (collection _ name ═ collection _ name, query _ records ═ query _ records, top _ k ═ 10, and params ═ search _ param) is needed, wherein search _ param is a configuration parameter of the Milvus library, query _ records is each vector in the vector library, top _ k is a sentence with the highest score returned by the previous 10, and collection _ name is the name of the Milvus library (after the sentence vector is extracted, similarity calculation is carried out one by one with the knowledge in the Milvus library, and the calculation similarity calculation method adopted in the embodiment of the invention is cosine similarity, and the similarity with the user question is returned to top 10. Referring specifically to fig. 3, sentence 1 is a sentence vector at the level of GlobalAveragePooling1D, referred to as u for short); the cosine similarity is calculated using each vector in the u and Milvus libraries, and the calculation formula is as described above. As shown in fig. 3, Cosine-sim (u, v) is the calculation of Cosine similarity, wherein Cosine is translated into Cosine similarity calculation.
In addition, in practical application, the training process of the first branch of the network structure mainly includes inputting the obtained multiple sentences and the labels corresponding to each sentence into the first branch of the network structure for training, so as to train the hidden labels and sentence vectors to which each sentence belongs, wherein the hidden labels are used for the machine to classify the sentences, and the sentence vectors can include multiple dimensions. In step S102, before calculating whether a vector difference between each sentence vector in the second branch of the network structure corresponding to the first branch of the network structure and the sentence vector of the sentence is within a preset difference range, the method further includes: extracting an implicit label corresponding to the sentence from a pre-trained first branch of the network structure; and screening out machine sentences belonging to the range of the hidden labels from the second branch of the network structure corresponding to the first branch of the network structure.
Sentence 1 tag (S) in general1Label), sentence 2 tag (S)2Label) is not well expressed by sentence 1 (S)1) Sentence 1 tag (S)1Label), training the implicit label S to which the sentence belongs1L label _ h1, by sentence 2 (S)2) Sentence 2 tag (S)2Label) training sentence belongs to implicit label S1Label _ h 2. The aim of doing so is to establish that corresponding auxiliary tasks participate in the convergence of the model through the comparability of the hidden label, thereby improving the accuracy of the model (obviously, the hidden label adopted by the embodiment of the invention and the design of further technical operation by utilizing the hidden label have great technical difference compared with the prior art); verification proves that the technical scheme adopted by the embodiment of the invention extracts the hidden label corresponding to the statement and screens out the machine statement in the hidden label range from the second branch of the network structure corresponding to the first branch of the network structure, so that the technical operation is a new design compared with the conventional prior art.
Specifically, through previous tests, for a user problem, the top1 returned by the Milvus library is used as a return result of the user, and the accuracy is lower than that of the top10 by 10 percent, so that the embodiment of the invention searches the top10 from the Milvus library and then carries out fine ranking on the top 10; as shown in fig. 6, a fine-line processing method is used, that is, a tensrflow _ ranking model (temporarily defined as a fine-line model, which is a tensrflow _ ranking model) is used;
tensorflow _ ranking is a newly developed architecture of google and has a good effect in a fine-ranking model. The embodiment of the invention combines Tensorflow _ ranking to construct a data set, and makes a lot of efforts in the aspect of data enhancement. The data set construction method comprises the steps of after duplication is removed for knowledge points (standard questions) of the question-answering robot, extracting a top100 through a twin network model for each knowledge point, in the top100, if sentences similar to knowledge exist, the label value of each knowledge point is marked as 2, in the similarity questions from a knowledge base, the label value of each knowledge point is removed as 2, the rest of the similarity questions (A B C D E F are assumed, A B is the similarity question marked as 2, the similarity question marked as 2 is removed, namely A B is removed, the rest of the similarity questions are C D E F), the label is marked as 1, and then 8 selected from the top100 are marked as 0. Then training the data through a Tensorflow _ ranking model, rearranging the returned top10, returning to the top1, matching with a question-answer library, and returning the answer corresponding to the top1 to the user.
While the necessity of deduplication in the above example is generally reflected in data augmentation of training text, in an embodiment of the present invention, a method for data enhancement is provided, which is as follows: the common knowledge point standard questions of the question-answering robot (namely the education school App) of the embodiment of the invention are 1.9 ten thousand, the similar questions corresponding to the standard questions are 8 ten thousand plus, the total amount of the standard questions and the similar questions is more than 10 ten thousand plus, for example: a standard question: the baby of my family looks nice and nice, and the similarity question is: the baby looks like | | i's baby is really beautiful | | | i's baby is really lovely, this is the data in the database, such data bulk is far from being deficient to twinborn Albert multitask model, however the embodiment of the invention, on the premise of not changing the sample distribution, adopt the form of cartesian product, by 10 ten thousand + data, 200 ten thousand pieces of data have been generated, the sample format after generating is labels (or called combination label) of sentence 1, sentence 2, sentence 1 and sentence 2, the label of sentence 1, the label of sentence 2, wherein the label of sentence 1 and sentence 2 is 0 and 1, represent two sentences to be similar or not separately, namely the combination label is 1 and represents similarly, the combination label is 0 and represents two sentences to be dissimilar; the labels of sentence 1 and sentence 2 are three-level labels below knowledge, and exemplarily, "child care/growth/premature infant growth" is a label, wherein "child care" is a first-level label, and "growth" is a third-level label, and the number of the labels is 2662, and these data constitute a data set of the training model, and meanwhile, in order to eliminate redundant data, it is necessary to perform deduplication on similar data in the data set. In addition, in the third-level tag, two previous levels of the third level may be the same, and in another possibility, in order to ensure the matching accuracy, two sentences may be determined to be similar under the condition that the three-level tags are completely the same.
Specifically, in practical application, the problems are as follows: the top100 of the ' baby afternoon nap duration ' returned by the millius is [ ' how much the baby afternoon nap is fit, how much the baby afternoon nap needs to sleep ', … …, ' the baby likes to eat sugar ', ' the baby always likes to cry ' ], the similarity is [ ' how much the baby afternoon nap is fit, how much the baby afternoon nap needs to sleep ', ' how much the baby afternoon nap is most reasonable to the body ' ], thus, label is 2 [ ' how much the baby sleeps fit for afternoon nap ', ' how much the baby sleeps to be fit for the baby ' the baby afternoon nap of the baby ', label is 1 [ ' how much the baby sleeps to be most reasonable to the body ' ], label is 0 [ ' the baby likes to sugar ', ' the baby always love to cry ' ], after the millius is supplemented, the n value is larger, the representation and the relevance of the problem are lower.
In practical application, if the data entry expanded in the training set is wider subsequently, label design with more than three levels may be involved, and the specific setting may be determined according to actual conditions, which is not specifically limited in the embodiment of the present invention.
In conclusion, the inventor finds that the conventional method adopts an Albert model to input sentences into the model and then performs feature extraction, so that the embedding calculated by each sentence simultaneously contains the information of sentence pairs and cannot well distinguish the two sentences; the twin network is adopted, the sentence pairs are respectively input into each network, the independence of each sentence is guaranteed, so that the embedding extracted from the model is mutually independent, then the model fusion method is adopted to fuse the information of each sentence, the advantages of guaranteeing the independence of each sentence and containing the fusion information of the sentence pairs are achieved, and the calculation output effect of the model is better.
The modeling is carried out by adopting a multi-task (main classification task + auxiliary task) mode, which is equivalent to increase the difficulty of learning for the model, longer time and more complex network are needed to learn the embedding information of each sentence, after the model is trained, the information generalization capability of each sentence learned is stronger, and the robustness of the model can be enhanced. For example, if there is only one similar task, the learned embedding is only similar and dissimilar for the two sentences s1 and s2, and for the model, it is only required to ensure that the two sentences are learned to be not similar, but s1 and s2 belong to which labels respectively, the model is not concerned, and in the case that training of each sentence is insufficient, the goal that the sentence pairs are not similar can be achieved, and actually the embedding of each sentence is not well trained, which also results in that the generalization capability of the model is not good. However, the multitasking method of the embodiment of the invention learns whether two sentences are similar or not, and also learns that each sentence belongs to the label category, so that the imbedding of each sentence containing sentence information is richer, and the generalization capability of the model is stronger.
One of the core technologies of the technical scheme of the embodiment of the invention is as follows: the embedding calculated by each sentence is ensured to be independent, and meanwhile richer sentence information is ensured to be contained.
In another embodiment of the present invention, there is also provided a semantic determination apparatus including:
the acquisition module is used for acquiring a statement C of a to-be-determined semantic;
the extraction module is used for inputting the statement into a preset twin network model to obtain a feature vector of the statement C, wherein the twin network model is used for vectorizing the statement;
the calculation module is used for determining a similar vector with the highest similarity with the feature vector of the statement C from the vector library;
and the sequencing module is used for determining the standard sentences corresponding to the similar vectors as the semantemes corresponding to the sentences.
In still another embodiment of the present invention, there is also provided an electronic apparatus including: the system comprises a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory complete mutual communication through the communication bus;
a memory for storing a computer program;
and the processor is used for realizing the semantic determination method in the embodiment of the method when executing the program stored in the memory.
In the electronic device provided by the embodiment of the invention, the processor acquires the statement of the semantic to be determined; and training a multitask twin network model from the preprocessed corpus, and obtaining all sentence vectors corresponding to the corpus through the trained model. The sentence vectors corresponding to the sentences are extracted from the trained model, whether the vector difference values between all the sentence vectors and the sentence vectors are within a preset difference value range or not is calculated, the similarity of the sentence vectors meeting preset conditions is sequenced, the sentence vector with the highest similarity is used as the corresponding matched semantic of the sentences, and then the sentence vectors are matched to the specific sentences.
The communication bus 1140 mentioned in the above electronic device may be a Serial Peripheral Interface (SPI) bus, an integrated circuit (ICC) bus, or the like. The communication bus 1140 may be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown in FIG. 7, but this is not intended to represent only one bus or type of bus.
The communication interface 1120 is used for communication between the electronic device and other devices.
The memory 1130 may include a Random Access Memory (RAM), and may also include a non-volatile memory (non-volatile memory), such as at least one disk memory. Optionally, the memory may also be at least one memory device located remotely from the processor.
The processor 1110 may be a general-purpose processor, and includes a Central Processing Unit (CPU), a Network Processor (NP), and the like; the integrated circuit may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other programmable logic device, discrete gate or transistor logic device, or discrete hardware components.
Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.

Claims (6)

1. A method for semantic determination, the method comprising:
obtaining a statement C of a to-be-determined semantic;
inputting the statement into a preset twin network model to obtain a feature vector of the statement C, wherein the twin network model is used for vectorizing the statement;
determining a similar vector with the highest similarity with the feature vector of the statement C from a vector library;
determining a standard sentence corresponding to the similar vector as a semantic corresponding to the sentence;
the twin network model comprises a network structure first branch, a network structure second branch, a main classification task, a first branch auxiliary task and a second branch auxiliary task, and the training process of the twin network model comprises the following steps:
obtaining sentences A and B from a preset training corpus;
inputting the sentence A and the sentence B into the twin network model, so that the first branch of the network structure determines a feature vector of the sentence A, the second branch of the network structure determines a feature vector of the sentence B, and the twin network model performs feature fusion on the feature vector of the sentence A and the feature vector of the sentence B to obtain a fusion vector, and then determining a main classification task of the twin network model for vectorizing the sentence to be processed based on the feature vector of the sentence A, the feature vector of the sentence B and the fusion vector;
the loss functions of the first branch of the network structure, the second branch of the network structure, the main classification task, the first branch auxiliary task and the second branch auxiliary task are custom losses, wherein the formula of the loss functions is as follows:
Figure FDA0003108266080000011
wherein z isiFor the twin network model output, i ═ 1 is the first branch auxiliary task output, i ═ 2 is the main classification task output, i ═ 3 is the second branch auxiliary task output,
Figure FDA0003108266080000012
and sigma is a hyper-parameter of the twin network model, w is a weight matrix of the twin network model, and epsilon is 0.1.
2. The semantic determination method according to claim 1, wherein the determining a similarity vector with the highest similarity to the feature vector of statement C from the vector library comprises:
calculating the similarity of each sentence vector and the sentence vector through a cosine similarity formula, wherein the sentence vectors are stored in the vector library;
if the similarity between the sentence vector and the feature vector of the sentence C is greater than a preset threshold value, determining that the vector difference value between the sentence vector and the feature vector of the sentence C is within a preset difference value range;
and determining a similar vector with the highest similarity to the sentence vector from the sentence vectors in the preset difference range.
3. The semantic determination method according to claim 2, further comprising: and if the similarity between the sentence vector and the feature vector of the sentence C is smaller than a preset threshold value, determining that the vector difference value between the sentence vector and the feature vector of the sentence C is not in a preset difference value range.
4. The semantic determination method according to claim 2, wherein the training process of the twin network model further comprises:
inputting the obtained sentences and the labels corresponding to the sentences into a twin network model for training so as to train hidden labels and sentence vectors to which the sentences belong, wherein the hidden labels are used for machine classification of the sentences, and the sentence vectors can comprise a plurality of dimensions.
5. The semantic determination method according to claim 4, further comprising, before calculating whether the vector difference value of the vector and the sentence vector is within a preset difference value range:
extracting an implicit label corresponding to the statement from a pre-trained first branch of the network structure;
and screening out the sentences belonging to the range of the hidden labels from the second branch of the network structure corresponding to the first branch of the network structure.
6. The semantic determination method according to claim 2, characterized in that the network structure second branch has the same weight as the network structure first branch.
CN202110398762.1A 2021-04-14 2021-04-14 Semantic determination method Active CN112800777B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110398762.1A CN112800777B (en) 2021-04-14 2021-04-14 Semantic determination method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110398762.1A CN112800777B (en) 2021-04-14 2021-04-14 Semantic determination method

Publications (2)

Publication Number Publication Date
CN112800777A CN112800777A (en) 2021-05-14
CN112800777B true CN112800777B (en) 2021-07-30

Family

ID=75811384

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110398762.1A Active CN112800777B (en) 2021-04-14 2021-04-14 Semantic determination method

Country Status (1)

Country Link
CN (1) CN112800777B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113435182A (en) * 2021-07-21 2021-09-24 唯品会(广州)软件有限公司 Method, device and equipment for detecting conflict of classification labels in natural language processing
CN114723008A (en) * 2022-04-01 2022-07-08 北京健康之家科技有限公司 Language representation model training method, device, equipment, medium and user response method
CN116629346B (en) * 2023-07-24 2023-10-20 成都云栈科技有限公司 Language model training method and device
CN117827014B (en) * 2024-03-05 2024-06-04 四川物通科技有限公司 Digital twin model multi-person interaction collaboration system based on meta universe

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108519890A (en) * 2018-04-08 2018-09-11 武汉大学 A kind of robustness code abstraction generating method based on from attention mechanism
CN110543551A (en) * 2019-09-04 2019-12-06 北京香侬慧语科技有限责任公司 question and statement processing method and device
CN111178084A (en) * 2019-12-26 2020-05-19 厦门快商通科技股份有限公司 Training method and device for improving semantic similarity
CN111723572A (en) * 2020-06-12 2020-09-29 广西师范大学 Chinese short text correlation measurement method based on CNN convolutional layer and BilSTM
CN111737954A (en) * 2020-06-12 2020-10-02 百度在线网络技术(北京)有限公司 Text similarity determination method, device, equipment and medium
CN111859986A (en) * 2020-07-27 2020-10-30 中国平安人寿保险股份有限公司 Semantic matching method, device, equipment and medium based on multitask twin network
CN111859960A (en) * 2020-07-27 2020-10-30 中国平安人寿保险股份有限公司 Semantic matching method and device based on knowledge distillation, computer equipment and medium
CN111859988A (en) * 2020-07-28 2020-10-30 阳光保险集团股份有限公司 Semantic similarity evaluation method and device and computer-readable storage medium
CN112417894A (en) * 2020-12-10 2021-02-26 上海方立数码科技有限公司 Conversation intention identification method and system based on multi-task learning

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109871451B (en) * 2019-01-25 2021-03-19 中译语通科技股份有限公司 Method and system for extracting relation of dynamic word vectors
CN112287688B (en) * 2020-09-17 2022-02-11 昆明理工大学 English-Burmese bilingual parallel sentence pair extraction method and device integrating pre-training language model and structural features

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108519890A (en) * 2018-04-08 2018-09-11 武汉大学 A kind of robustness code abstraction generating method based on from attention mechanism
CN110543551A (en) * 2019-09-04 2019-12-06 北京香侬慧语科技有限责任公司 question and statement processing method and device
CN111178084A (en) * 2019-12-26 2020-05-19 厦门快商通科技股份有限公司 Training method and device for improving semantic similarity
CN111723572A (en) * 2020-06-12 2020-09-29 广西师范大学 Chinese short text correlation measurement method based on CNN convolutional layer and BilSTM
CN111737954A (en) * 2020-06-12 2020-10-02 百度在线网络技术(北京)有限公司 Text similarity determination method, device, equipment and medium
CN111859986A (en) * 2020-07-27 2020-10-30 中国平安人寿保险股份有限公司 Semantic matching method, device, equipment and medium based on multitask twin network
CN111859960A (en) * 2020-07-27 2020-10-30 中国平安人寿保险股份有限公司 Semantic matching method and device based on knowledge distillation, computer equipment and medium
CN111859988A (en) * 2020-07-28 2020-10-30 阳光保险集团股份有限公司 Semantic similarity evaluation method and device and computer-readable storage medium
CN112417894A (en) * 2020-12-10 2021-02-26 上海方立数码科技有限公司 Conversation intention identification method and system based on multi-task learning

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Evaluation of BERT and ALBERT Sentence Embedding Performance on Downstream NLP Tasks;Hyunjin Choi etal.;《2020 25th International Conference on Pattern Recognition (ICPR)》;20210115;第5482-5487页 *
基于孪生网络的中文语义匹配算法研究;赵源;《中国优秀硕士学位论文全文数据库(电子期刊)》;20210131;第I138-2511页 *

Also Published As

Publication number Publication date
CN112800777A (en) 2021-05-14

Similar Documents

Publication Publication Date Title
CN112800777B (en) Semantic determination method
CN117033608B (en) Knowledge graph generation type question-answering method and system based on large language model
CN110442718B (en) Statement processing method and device, server and storage medium
CN110427463B (en) Search statement response method and device, server and storage medium
CN111159416A (en) Language task model training method and device, electronic equipment and storage medium
CN112650886B (en) Cross-modal video time retrieval method based on cross-modal dynamic convolution network
CN110222163A (en) A kind of intelligent answer method and system merging CNN and two-way LSTM
CN111767385A (en) Intelligent question and answer method and device
CN114428850B (en) Text retrieval matching method and system
CN114691864A (en) Text classification model training method and device and text classification method and device
CN114661872A (en) Beginner-oriented API self-adaptive recommendation method and system
Parvathi et al. Identifying relevant text from text document using deep learning
CN114239599A (en) Method, system, equipment and medium for realizing machine reading understanding
CN112148994A (en) Information push effect evaluation method and device, electronic equipment and storage medium
CN111859955A (en) Public opinion data analysis model based on deep learning
CN117216617A (en) Text classification model training method, device, computer equipment and storage medium
CN116595170A (en) Medical text classification method based on soft prompt
CN114496231B (en) Knowledge graph-based constitution identification method, device, equipment and storage medium
CN115600595A (en) Entity relationship extraction method, system, equipment and readable storage medium
CN114036289A (en) Intention identification method, device, equipment and medium
Su et al. Automatic ontology population using deep learning for triple extraction
CN113821610A (en) Information matching method, device, equipment and storage medium
CN111858885A (en) Keyword separation user question intention identification method
CN110569331A (en) Context-based relevance prediction method and device and storage equipment
CN117556275B (en) Correlation model data processing method, device, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant