CN112269856B - Text similarity calculation method and device, electronic equipment and storage medium - Google Patents
Text similarity calculation method and device, electronic equipment and storage medium Download PDFInfo
- Publication number
- CN112269856B CN112269856B CN202011010599.9A CN202011010599A CN112269856B CN 112269856 B CN112269856 B CN 112269856B CN 202011010599 A CN202011010599 A CN 202011010599A CN 112269856 B CN112269856 B CN 112269856B
- Authority
- CN
- China
- Prior art keywords
- word
- vector
- text
- attention
- fusion
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000004364 calculation method Methods 0.000 title claims abstract description 87
- 239000013598 vector Substances 0.000 claims abstract description 329
- 230000004927 fusion Effects 0.000 claims abstract description 150
- 235000019633 pungent taste Nutrition 0.000 claims abstract description 60
- 238000000034 method Methods 0.000 claims abstract description 48
- 238000012549 training Methods 0.000 claims description 45
- 238000012795 verification Methods 0.000 claims description 21
- 238000004590 computer program Methods 0.000 claims description 9
- 238000010801 machine learning Methods 0.000 claims description 6
- 238000013507 mapping Methods 0.000 claims description 5
- 239000011159 matrix material Substances 0.000 description 58
- 238000010586 diagram Methods 0.000 description 19
- 230000007246 mechanism Effects 0.000 description 19
- 230000006870 function Effects 0.000 description 16
- 230000008569 process Effects 0.000 description 13
- 238000004891 communication Methods 0.000 description 10
- 238000007476 Maximum Likelihood Methods 0.000 description 5
- 239000012943 hotmelt Substances 0.000 description 5
- 238000013527 convolutional neural network Methods 0.000 description 4
- 230000000694 effects Effects 0.000 description 3
- 238000000605 extraction Methods 0.000 description 3
- 238000001914 filtration Methods 0.000 description 3
- 238000013528 artificial neural network Methods 0.000 description 2
- 238000013135 deep learning Methods 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 230000011218 segmentation Effects 0.000 description 2
- 238000006467 substitution reaction Methods 0.000 description 2
- WFKWXMTUELFFGS-UHFFFAOYSA-N tungsten Chemical compound [W] WFKWXMTUELFFGS-UHFFFAOYSA-N 0.000 description 2
- 238000010420 art technique Methods 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 125000004122 cyclic group Chemical group 0.000 description 1
- 230000007423 decrease Effects 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 238000012423 maintenance Methods 0.000 description 1
- 230000000873 masking effect Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000010606 normalization Methods 0.000 description 1
- 238000004806 packaging method and process Methods 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 230000000306 recurrent effect Effects 0.000 description 1
- 230000009467 reduction Effects 0.000 description 1
- 230000001105 regulatory effect Effects 0.000 description 1
- 230000035945 sensitivity Effects 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3344—Query execution using natural language analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/332—Query formulation
- G06F16/3329—Natural language query formulation or dialogue systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/216—Parsing using statistical methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Computational Linguistics (AREA)
- General Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Mathematical Physics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- General Health & Medical Sciences (AREA)
- Human Computer Interaction (AREA)
- Probability & Statistics with Applications (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The embodiment of the invention provides a text similarity calculation method, a text similarity calculation device, electronic equipment and a storage medium; the method comprises the following steps: obtaining word embedding vectors of the target text according to the target text; inputting word embedding vectors of the target text into a pre-trained hotness fusion transducer model to obtain feature vectors of the words in the target text; and calculating the similarity of the target text according to the feature vector of the word in the target text. According to the text similarity calculation method, the device, the electronic equipment and the storage medium, the existing transform model is improved, and the access heat is integrated when the Attention is calculated, so that a text similarity calculation result is more accurate.
Description
Technical Field
The present invention relates to the field of intelligent recognition technologies, and in particular, to a text similarity calculation method and apparatus, an electronic device, and a storage medium.
Background
Calculating text similarity is a hotspot problem in the field of artificial intelligence. The text similarity calculation method in the prior art mainly carries out supervised learning or unsupervised learning by a deep learning method to complete corpus training, thereby extracting characteristic information of texts, and finally calculating cosine distances among text characteristics to obtain the similarity among the texts.
The method for extracting the characteristic information of the text by the deep learning method has various implementation modes and mainly comprises the following steps: a method for extracting features based on CNN (Convolutional Neural Networks, convolutional neural network), a method for extracting features based on RNN (Recurrent Neural Network, cyclic neural network) and a method for extracting features based on a transducer model.
The transducer model is an NLP model based on an automatic encoder and an Attention mechanism proposed by Google. The transform model encodes and decodes words through an attribute mechanism, calculates the substitution rate of each word to other words in the text sequence, and calculates the similarity as the text feature. Compared with RNN and CNN, the transducer model does not need to be marked with data, is unsupervised learning, can consider the relation among all words of the whole text sequence, and meanwhile, the automatic encoder mechanism used by the transducer model can be used for conveniently carrying out parallelization calculation, and has higher performance.
The transducer model has been widely used to achieve good results, but there are problems in some applications in some situations, the disadvantages include:
the text features extracted by the existing Transformer model are based on the corpus itself, and access heat information generated in the corpus in the using process is ignored, so that the calculated text features are incomplete, and similarity calculation is not correct.
Disclosure of Invention
Aiming at the problems existing in the prior art, the embodiment of the invention provides a text similarity calculation method, a text similarity calculation device, electronic equipment and a storage medium.
An embodiment of a first aspect of the present invention provides a text similarity calculation method, including:
obtaining word embedding vectors of the target text according to the target text;
inputting word embedding vectors of the target text into a pre-trained hotness fusion transducer model to obtain feature vectors of the words in the target text; the feature vectors of the words can reflect text similarity between the words at the same time and heat difference between the words;
calculating the similarity of the target text according to the feature vector of the word in the target text; wherein,
the hotness fusion transducer model is obtained by training based on word embedding vectors of sample texts and word trend vectors of the sample texts; the heat fusion transducer model is obtained by replacing a self-attention layer in the transducer model with a fusion attention layer and arranging a convolution layer between the fusion attention layers;
the fusion attention layer is used for calculating the attention of the word according to the self-attention and the heat attention of the word; the word trend vector is a vector for describing the degree of association between words, which is obtained from the text similarity between words and the heat difference between words.
In an alternative embodiment, before the step of calculating the similarity of the target text from the feature vectors of the words in the target text, the method further comprises:
calculating an estimated value of the hotword probability according to the hotness probability of the words in the target text;
taking the estimated value of the hotword probability as a threshold value, and dividing words in the target text into hotwords and non-hotwords according to the threshold value;
and mapping the feature vector of the non-hotword to a preset value.
In an alternative embodiment, obtaining a word embedding vector of the target text according to the target text includes:
obtaining a text vector, a word position vector and a word heat vector of the target text according to the target text;
obtaining a first word embedded vector of the target text according to the text vector of the target text;
inputting a first word embedding vector, a word position vector and a word heat vector of a target text into a word fusion model trained in advance to obtain a second word embedding vector fused with word position information and word heat information at the same time, and taking the second word embedding vector as the word embedding vector of the target text; wherein,
the word fusion model is obtained by training a first word embedding vector, a word position vector, a word heat vector and a word trend vector based on the sample text; the first word embedding vector is a vector for reflecting semantic relevance of words.
In an alternative embodiment, the fused attention layer comprises: a weight proportion setting layer and an attention calculating layer; wherein,
the weight proportion setting layer is used for setting weight proportion for the heat attention of the words; the weight ratio is obtained based on a verification ratio, and the verification ratio is determined according to the layer number of the fusion attention layer in the heat fusion transducer model; the higher the number of layers of the fusion attention layer in the heat fusion transducer model is, the lower the value of the verification ratio is;
the attention calculating layer is used for calculating the attention of the word according to the self-attention of the word and the heat attention provided with the weight proportion.
In an alternative embodiment, the number of convolution kernel steps of the convolution layer is determined according to the number of layers of the convolution layer in the heat fusion transform model; the higher the number of layers of the convolution layer in the heat fusion transducer model, the greater the number of convolution kernel steps.
In an alternative embodiment, the method further comprises:
obtaining a word embedding vector and a word trend vector of the sample text according to the sample text;
and training the word embedded vector of the sample text serving as input data for training, and the word trend vector of the sample text serving as a label for training by adopting a machine learning mode to obtain a hotness fusion transducer model for generating the feature vector of the word in the target text.
In an alternative embodiment, the method further comprises:
obtaining a text vector, a word position vector, a word heat vector and a word trend vector of the sample text according to the sample text;
obtaining a first word embedded vector of the sample text according to the text vector of the sample text;
and training the first word embedding vector, the word position vector and the word popularity vector of the sample text serving as input data for training, and the word tendency vector of the sample text serving as a label for training by adopting a machine learning mode to obtain a word fusion model for generating the second word embedding vector of the sample text.
An embodiment of a second aspect of the present invention provides a text similarity calculation device, including:
the word embedding vector generation module is used for obtaining a word embedding vector of the target text according to the target text;
the feature vector generation module is used for inputting word embedding vectors of the target text into a pre-trained hotness fusion transducer model to obtain feature vectors of words in the target text; the feature vectors of the words can reflect text similarity between the words at the same time and heat difference between the words;
the similarity calculation module is used for calculating the similarity of the target text according to the feature vector of the word in the target text; wherein,
The hotness fusion transducer model is obtained by training based on word embedding vectors of sample texts and word trend vectors of the sample texts; the heat fusion transducer model is obtained by replacing a self-attention layer in the transducer model with a fusion attention layer and arranging a convolution layer between the fusion attention layers;
the fusion attention layer is used for calculating the attention of the word according to the self-attention and the heat attention of the word; the word trend vector is a vector for describing the degree of association between words, which is obtained from the text similarity between words and the heat difference between words.
An embodiment of the third aspect of the present invention provides an electronic device, including a memory, a processor, and a computer program stored on the memory and executable on the processor, where the processor implements the steps of the text similarity calculation method according to the embodiment of the first aspect of the present invention when the processor executes the program.
An embodiment of a fourth aspect of the present invention provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of a text similarity calculation method as an embodiment of the first aspect of the present invention.
According to the text similarity calculation method, the device, the electronic equipment and the storage medium, the pre-trained heat fusion transform model is obtained by training the word embedding vector based on the sample text and the word trend vector of the sample text, wherein the word trend vector is a vector for describing the relevance between words and obtained according to the text similarity between words and the heat difference between words, so that the feature vector of the words in the target text can reflect the text similarity between words and the heat difference between words at the same time through the pre-trained heat fusion transform model, and then the access heat information can be considered when the similarity of the target text is calculated according to the feature vector of the words in the target text, so that the calculated text features are more comprehensive and the similarity is more accurate.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions of the prior art, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained according to these drawings without inventive effort for a person skilled in the art.
Fig. 1 is a flowchart of a text similarity calculation method according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of an internal structure of a fusion attribute layer related to a heat fusion transform model in a text similarity calculation method according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of an encoder in a heat fusion transducer model in a text similarity calculation method according to an embodiment of the present invention;
FIG. 4 is a schematic diagram of a decoder in a heat fusion transducer model in a text similarity calculation method according to an embodiment of the present invention;
fig. 5 is a schematic diagram of the overall structure of a heat fusion transducer model in the text similarity calculation method according to the embodiment of the present invention;
FIG. 6 is a schematic diagram of a stacked multiple-input automatic encoder employed in an embodiment of the present invention;
FIG. 7 is a diagram illustrating a process for generating a second word embedding vector with both location information and heat information fused;
fig. 8 is a schematic diagram of a text similarity calculation device according to an embodiment of the present invention;
FIG. 9 is a schematic diagram of a text similarity calculation system;
FIG. 10 is a schematic diagram showing steps for implementing the services provided by the text similarity calculation system shown in FIG. 9;
Fig. 11 is a schematic physical structure of an electronic device according to an embodiment of the present invention.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
The text similarity calculation method based on the transducer model in the prior art has been widely applied and achieves good effects. There are problems in some applications in certain scenarios, however, the disadvantages of which mainly include the following two aspects:
on the one hand, text features extracted by the existing transducer model are based on the corpus itself, access heat information generated in the corpus in the using process is ignored, the calculated text features are incomplete, and similarity calculation is not correct. For example, the hot word "C ro" appears in the unified search system, and since different users describe similar things in different ways, for example, "C ro is the best player today" and "first player of the world to be selected," that must be "C ro" are all the text features that are actually expressed in a similar manner, but text features that are directly calculated by using the Transformer model are likely to be similar because the words used are mostly different, the meaning differences of the two words are considered to be large, and the false similarity is calculated.
On the other hand, many existing methods fuse the access hotness into a word embedding matrix when the words are embedded, and then directly use a trans-former model to calculate text features, which may cause the Attention mechanism of the trans-former to pay Attention to hot words excessively, so that two sentences containing the same keywords but with different semantics are calculated as sentences with similar semantics.
In order to overcome the defects of the text similarity calculation method in the prior art, the embodiment of the invention provides a text similarity calculation method based on an improved transducer model.
Fig. 1 is a flowchart of a text similarity calculation method provided by an embodiment of the present invention, where, as shown in fig. 1, the text similarity calculation method provided by the embodiment of the present invention includes:
and 101, obtaining a word embedding vector of the target text according to the target text.
In the embodiment of the invention, the target text refers to a text needing text similarity calculation. Since the text similarity reflects the similarity between different texts, the target text includes at least two texts.
Word embedding refers to embedding a high-dimensional space, which is the number of all words in a dimension, into a continuous vector space, which is much lower in dimension. Word embedding vectors refer to vectors in which words or phrases are mapped onto the real number domain.
The word embedding vector for obtaining the target text according to the target text specifically comprises the following steps:
and carrying out operations such as text formatting, word segmentation, stop word removal and the like on the target file, and obtaining a set of words contained in the target text. The text formatting, word segmentation, stop word removal and other operations can be implemented through the prior art, such as a genesim toolkit.
After the collection of words contained in the target text is obtained, the text vector, the word position vector and the word heat vector of the target text are extracted from the collection.
Since the text vector, the word position vector and the word popularity vector obtained from the target text are usually multiple, and the matrix is a common expression form of the set, in the embodiment of the present invention, the text matrix, the word position matrix and the word popularity matrix are used to represent the text vector, the word position vector and the word popularity vector obtained from the target text respectively.
The Text matrix is used to describe words contained in the Text, and is typically represented by Text. One line in the text matrix represents one text, and one column represents the number of one word contained in the text in a preset dictionary. If a word is not included in a text, 0 is filled in the position corresponding to the word.
The following is one example of a text matrix:
the word position matrix is used to describe the position of the word in the text to which it belongs, and is generally indicated by Location. One row in the word position matrix represents words belonging to the same text, and a numerical value in the word position matrix represents the position of a word in the text to which the word belongs. For example, 608 in the above matrix indicates that a word is from the second file in the target text (608 in the second line), and that the word is 608 th in the second file.
The word popularity matrix is used to describe how Hot a word is accessed, and is typically represented by Hot. One row in the word popularity matrix represents words belonging to the same text, and a numerical value in the word popularity matrix represents the popularity of a word being accessed.
The extraction of text matrices, word position matrices, and word popularity matrices from target text can be accomplished by prior art techniques. For example, the text matrix can be obtained by comparing the words in the target text with a preset dictionary. A word position matrix may be obtained in conjunction with the position of the word in the target text. The word heat matrix may be obtained in combination with the accessed heat of the word over a period of time.
Although there are two or more texts in the target text, the text matrix, the word position matrix, and the word popularity matrix are not extracted from the single text, but are extracted from all the texts included in the target text.
After obtaining the text vector, the word position vector and the word heat vector of the target text, the word embedding vector of the target text can be further obtained.
The word embedding vector has various forms, and can be a word embedding vector fused with position information or a word embedding vector fused with position information and heat information. In the embodiment of the present invention, the specific form of the word embedding vector is not limited.
The word embedding vectors of the target text are usually plural, and the matrix is a common expression form of the set, so in the embodiment of the present invention, the word embedding matrix is used to represent the plural word embedding vectors obtained from the target document.
If a word embedding vector fused with position information is to be generated, a first word embedding vector is obtained according to a text matrix of a target text and a word2vec method in the prior art. The first word embedding vector reflects semantically preliminary association features of the word, namely semantic relativity. However, the word2vec method is simpler, and the obtained first word embedding vector is not very accurate in description of semantic relevance, so that other information needs to be fused. And if the first word embedding vector is fused with the word position vector of the target text, obtaining the word embedding vector fused with the position information. This implementation is common general knowledge to a person skilled in the art, and thus details thereof are not further described in the embodiments of the present invention.
In other embodiments of the present invention, a detailed description will be given of how to generate a word embedding vector in which position information and heat information are fused together.
Step 102, inputting word embedded vectors of the target text into a pre-trained hotness fusion transducer model to obtain feature vectors of words in the target text.
In the embodiment of the invention, the hotness fusion transducer model is obtained by training based on a word embedding vector of a sample text and a word trend vector of the sample text; the heat fusion transducer model is obtained by replacing a self-attention layer in the transducer model with a fusion attention layer and arranging a convolution layer between the fusion attention layers. The fusion attention layer is used for calculating the attention of the word according to the self-attention and the heat attention of the word; the word trend degree vector is a vector which is obtained according to the text similarity between words and the heat difference between words and is used for describing the association degree between words.
Specifically, the hotmelt transducer model includes an encoder and a decoder.
The encoder comprises a plurality of layers, each layer comprising a fusion layer and a convolution layer except for the last layer, the last layer comprising only the fusion layer.
The decoder comprises a plurality of layers, wherein the last layer is an output layer, and each layer comprises a fusion Attention layer and a convolution layer except the last layer and the penultimate layer, and the penultimate layer only comprises the fusion Attention layer.
The fusion Attention (Attention) layer according to the embodiment of the invention is an improvement on the Self-Attention (Self-Attention) layer in the existing transducer model. For ease of understanding, the principles thereof will be described in detail.
The core of the transformation model in the prior art is a Self-Attention mechanism, the core idea is that the similarity of one word and other words in a training text is calculated through an Attention function, and the deflection weight of each word is regulated through a multi-layer network, so that the coded text feature vector is obtained.
In the embodiment of the invention, an attribute mechanism fusing word access hotness is designed based on a Self-attribute mechanism and a gating mechanism, and a layer corresponding to the attribute mechanism fusing word access hotness is called a fused attribute layer.
FIG. 2 is a schematic diagram of the internal structure of the fusion layer, referring to FIG. 2, in one embodiment, it is assumed that there are only two words in the target text, namely X1 and X2. X1 and X2 are one row of the second characteristic data (such as the fusion word embedding matrix) calculated in the previous step, namely X1 and X2 are vectors; a vector is input into a hidden layer node, namely two hidden layer nodes are arranged in the graph, and the design of the hidden layer nodes is as follows:
a. Unlike the Self-Attention structure of the prior art transducer model, in the present embodiment, the present invention is not limited to the conventional W, W of the transducer model q 、W k 、W v Besides the training weight vectors, W is added h Weight vector, weight vector W h For representing the hotness of the Attention word, the following calculation is first performed for the hidden layer node:
A1=WX1;
Q1=W q A1;
K1=W k A1;
V1=W v A1;
H1=W h A1。
through the above operation, the query vector Q1, the key value pair vector K1, the vector V1, and the heat vector H1 can be obtained. Similar calculations are performed on another hidden node to obtain Q2, K2, V2, and H2.
b. The attribute function of the Self-attribute mechanism in the prior art focuses only on the query vector Q1 and the key value pair vector K1. In the fusion layer, a heat vector H1 is also added when calculating the Attention, and the calculation mode of the Attention is changed into the following form (taking B1 as an example):
where d represents the dimension of the vector.
In the course of this calculation formula,is the calculation method of self-attention mechanism,/-for>Is the computing part of the incorporated access heat.
The idea of the Attention mechanism is to calculate similarity, and in the embodiment of the invention, the query vector is used for multiplying the heat vector by the heat vector to link the words and the heat under the Attention mechanism, so that the Attention mechanism can consider the factors of accessing the heat and identify the hot words really to be focused; the parameter vector beta is the weight proportion of the heat degree attribute value, and the magnitude of the parameter is calculated by a gating mechanism. The purpose of setting this parameter is to reasonably regulate the heat degree attribute value so that the Transformer model does not pay excessive Attention to the heat degree of the non-hot word.
In the previous paragraph, it is mentioned that the parameter vector β is a weight proportion of the hotness value, the magnitude of which is calculated by the gating mechanism. In the embodiment of the invention, the gating mechanism refers to using a gate to control whether access heat is considered or how much influence of the access heat is needed by a hidden layer in calculating the Attention.
Referring to fig. 2, a gating vector G is set for each hidden layer node, the values in the vector are initialized to be 0, a verification ratio is calculated according to the layer number where the fusion layer is located in the model, then the values at certain positions in the vector G are randomly changed to 1 according to the verification ratio, and then point multiplication is performed with A1 to obtain a vector beta, so that the inner product of the vector beta is obtained. If the inner product of the vector beta is larger than or equal to a preset threshold value, beta is directly used as a parameter vector calculated by the Attention, and if the inner product of the vector beta is smaller than the threshold value, the vector beta is set as a 0 vector, namely the Attention without considering the heat.
The calculation formula of the verification ratio is as follows:
P c =(ALL num -L num )/(2*ALL num );
wherein P is c Is a verification ratio; l (L) num The number of layers of the hierarchy; ALL (ALL) num Is the total number of layers of the transducer encoder.
The verification ratio can be used to reject the hotness Attention of words which are insensitive to the access hotness (the access hotness is not high). In judging the sensitivity of a word to the access hotness, the value in the word vector is selected randomly through the vector beta to calculate, and if the calculated value is larger than a preset threshold value in probability, the word is considered to be sensitive to the access hotness, and the hotness attribute value is calculated.
From the calculation formula of the verification ratio, the verification ratio decreases with the increase of the layer number, because the coarser the lower layer is, the coarser the filtering is, so that the information is guaranteed to be comprehensive and not lost, and the higher the verification probability is, the lower the higher the upper layer is, the accuracy of the Attention value is increased with the deep calculation, and the more accurate filtering can be performed. The layering method enables the hotness fusion transform model in the embodiment of the invention to calculate the hotness Attention value more accurately, and prevents excessive Attention of word access hotness.
Continuing with FIG. 2 as an example, after calculating the Attention value (i.e., B1 or B2 in FIG. 2), the fusion Attention layer uses a fully-connected layer to perform Soft-max mapping operation on the calculated results B1 and B2, and then calculates the final result C1 according to the value vectors V1 and V2, the heat vectors H1 and H2 and the results B1 and B2, and the calculation formula is as follows:
in the calculation formula, i is the number of hidden layer nodes, the formula only shows one hidden layer node, namely the final code C1 of the input X1, and similarly, C2 is calculated by a similar method, and finally a matrix of C is obtained. While the weight is corrected by using the heat vector in the above formula, the value of the heat vector can be normalized by using the sigmoid function, so that the heat vector can be prevented from excessively affecting the final weight.
The above is a schematic description of the fusion of the Attention layers. The fusion Attention layer can be further divided into a weight proportion setting layer and an Attention calculating layer based on the functional description; wherein,
the weight proportion setting layer is used for setting weight proportion for the heat attention of the words; the weight proportion is obtained based on a verification ratio, and the verification ratio is determined according to the layer number of the fusion attention layer in the heat fusion transducer model; the higher the number of layers of the fusion attention layer in the heat fusion transducer model is, the lower the value of the verification ratio is;
the attention calculating layer is used for calculating the attention of the word according to the self-attention of the word and the heat attention provided with the weight proportion.
The fusion of the attribute layer makes the above-mentioned adjustments to its structure based on the Self-attribute layer in the existing transducer model. In the embodiment of the invention, the fusion attribute layer is adopted in the encoder and the decoder of the hotness fusion transform model to replace the Self-attribute layer in the existing transform model, so that the hotness fusion transform model can pay Attention to correct word access hotness information during encoding.
In the encoder and decoder of the heat fusion transducer model, a convolution layer is further included, and the convolution layer is a one-dimensional convolution layer, which is located between the encoding layers (decoding layers in the decoder) of the transducer model in the prior art. The one-dimensional convolution layer can further extract coding features to reduce the influence of noise, and can enable the coded output to more accurately express the Attention value to eliminate the Attention value of the hot words.
There are convolution layers in the layers of the encoder or decoder, and there is a difference in the number of convolution kernels for the convolution layers of different layers. The calculation formula of the convolution kernel step number is as follows:
wherein CL is the number of convolution kernel steps, N is the basic length of the convolution kernel, and can be set to 2, 4, 8, etc., which is determined by the length of the word; l (L) num Layer numbers of coding layers (decoding layers in the decoder) below the convolutional layer.
The calculation formula of the number of steps of the convolution kernel can show that the length of the convolution kernel increases with the increase of the layer number, because the accuracy of encoding increases with the increase of the encoding layer, the condition of excessively paying attention to the heat of words is reduced, and the extraction of fine-granularity features is not needed.
FIG. 3 is a schematic diagram of an encoder in a thermal fusion transform model, which can be seen to include n layers, each layer including a fusion layer and a convolution layer except for the last layer (i.e., the nth layer), the last layer including only the fusion layer.
FIG. 4 is a schematic diagram of a decoder in a thermal fusion transform model, where it can be seen that the decoder includes n+1 layers, each of the first n-1 layers includes a fusion layer and a convolution layer, the nth layer includes only the fusion layer, and the n+1 layer is an output layer. The decoder is generally used when training a hotness fusion transducer model, and in the training process, the first n layers in the decoder calculate the result of the encoder and restore the appearance before encoding; the final output layer uses word trend degree matrix as the label in training, and adjusts the network parameters through word trend degree, so that the network learns the information of fusion access heat degree. The training process for the hotness fusion transducer model and the description of the word trend matrix are further described in other embodiments of the present invention.
And combining the encoder and the decoder in the heat fusion transducer model to obtain the integral structure of the heat fusion transducer model. FIG. 5 is a schematic diagram of the overall structure of a thermal fusion transducer model.
The specific implementation process of the word embedding vector of the target text to input the pre-trained hotness fusion transducer model to obtain the feature vector of the word in the target text comprises the following steps:
And S1, inputting the word embedding matrix into a pre-trained hotness fusion transducer model.
In this step, each row in the word embedding matrix corresponds to an embedding vector of a word, and the embedding vector of the word is used as an input of each hidden layer node of the hotness fusion transform model.
The parameters in the pre-trained hotness fusion transducer model (including the number of layers of the model, the verification ratio of each layer, the convolution kernel size of each convolution layer, and the parameter matrix of the fusion attribute matrix) are all determined values.
And S2, fusion Attention calculation is carried out on the input word embedding vector, a coding vector is generated, and residual connection and standardization are carried out.
S3, extracting features of the coding vector generated by fusing the Attention layer according to the convolution kernel function of each layer; after completion, the result is used as input of the next transducer coding layer.
And S4, after all the multi-stage transducer model coding layers are executed, taking the last coded result as a word feature vector.
Since the text is essentially a set of words, text features of the target text can be obtained by obtaining corresponding word feature vectors for words included in the target text, and forming the word feature vectors into a set.
The text features of the target text may be represented in the form of a matrix, with a column in the matrix representing a feature vector of a word.
And 103, calculating the similarity of the target text according to the feature vector of the word in the target text.
Based on the feature vectors of words in the target text, a cosine distance calculation method is adopted to calculate the similarity value between the target texts. The closer the similarity values between the texts are, the greater the similarity is explained.
The calculation of the target text similarity from the feature vector of the word is common knowledge to the person skilled in the art and is therefore not repeated here.
According to the text similarity calculation method provided by the embodiment of the invention, the pre-trained heat fusion transform model is obtained by training the word embedding vector based on the sample text and the word trend vector of the sample text, wherein the word trend vector is a vector for describing the relevance between words according to the text similarity between words and the heat difference between words, so that the feature vector of the word in the target text can reflect the text similarity between words and the heat difference between words at the same time through the pre-trained heat fusion transform model, and then the access heat information can be considered when the similarity of the target text is calculated according to the feature vector of the word in the target text, so that the calculated text features are more comprehensive and the similarity is more accurate.
Based on any of the above embodiments, in an embodiment of the present invention, between step 102 and step 103, the method further includes:
calculating an estimated value of the hotword probability according to the hotness probability of the words in the target text;
taking the estimated value of the hotword probability as a threshold value, and dividing words in the target text into hotwords and non-hotwords according to the threshold value;
and mapping the feature vector of the non-hotword to a preset value.
In the previous embodiment of the invention, the text features extracted by fusion of the Transformer model based on the hotness can well reflect the semantic relevance of the words and access hotness characteristics. However, since the model is based on static text to do training of semantic relevance during training, the change of word heat in a real scene is not considered. Once the hotness of a word changes, text features need to be recalculated and extracted, which increases overhead and maintenance difficulty of the system.
In view of the above problems, in the embodiment of the present invention, the hotword probability in a next time period may be estimated based on the hotword probability in the time period, so that the hotword and the non-hotword are distinguished according to the hotword probability estimation value, and the influence of the non-hotword in the similarity calculation is reduced.
Specifically, according to the word heat matrix of the target text and the method of dividing the access heat of a single word by the total access heat of all words in the file, the heat probability of each word in the target text within a period of time can be obtained and is recorded as x i 。
The calculation formula of the maximum likelihood estimation value of the hotword probability is as follows:
wherein,maximum likelihood estimates representing hotword probabilities; x is x i Representing the hotness probability of the ith word; θ is a hotword probability, which may be obtained by statistics, such as dividing the total number of hotwords by the total number of words; n represents the total number of words in the target text.
Based on the above calculation formula, a maximum likelihood estimation value of the hotword probability can be calculated. That is, when θ is used as a parameter to derive the formula and the derivative is 0, the maximum likelihood estimation value of θ is obtainedThen according to->The value distinguishes hot words and non-hot words, and the non-hot words are mapped into preset values through a sigmoid function, so that the effect of the non-hot words in similarity calculation is reduced. The value of the preset value is smaller than the original value of the non-hot word, so that the effect of the non-hot word in similarity calculation is reduced by mapping the non-hot word into the preset value.
Through the operation, the text characteristics are filtered. The obtained filtering result can be used for subsequent similarity calculation.
According to the text similarity calculation method provided by the embodiment of the invention, the word in the target text is divided into the hot word and the non-hot word by estimating the hot word probability in the next time period, so that the influence of the non-hot word on the text similarity calculation is reduced, and the accuracy of the similarity calculation in the dynamic environment where the word is used is ensured.
Based on any one of the foregoing embodiments, in an embodiment of the present invention, the obtaining a word embedding vector of a target text according to the target text includes:
obtaining a text vector, a word position vector and a word heat vector of the target text according to the target text;
obtaining a first word embedded vector of the target text according to the text vector of the target text;
inputting a first word embedding vector, a word position vector and a word heat vector of the target text into a word fusion model trained in advance to obtain a second word embedding vector fused with word position information and word heat information at the same time, and taking the second word embedding vector as the word embedding vector of the target text; wherein,
the word fusion model is trained based on a first word embedding vector, a word position vector, a word heat vector and a word trend vector of the sample text.
In the embodiment of the invention, the word2vec method in the prior art is adopted to process the text vector of the target text, so that the first word embedded vector of the target text can be obtained.
In an embodiment of the invention, the word fusion model is a stacked multi-input auto encoder CL AutoEncoder. The encoder includes: a plurality of sub-encoders, each sub-encoder capable of receiving a different input; the sub-encoder is followed by a one-dimensional convolutional layer for fusing the results of the multiple sub-encoders together; the one-dimensional convolutional layer is followed by a multi-layer decoder; the multi-layer decoder is followed by an output layer.
FIG. 6 is a schematic diagram of a stacked multi-input automatic Encoder used in an embodiment of the present invention, as shown in FIG. 6, which includes 3 independent multi-layer sub-encoders (encoders) for receiving a first word-embedded vector of a target text, a word position vector of the target text, and a word heat vector of the target text, respectively, output by word2 ve. 3 kernel functions are set in a one-dimensional convolution layer (namely Conv1 layer in the figure), 3 independent multi-layer sub-encoders are mapped respectively, and then the results of the 3 kernel functions are added to obtain an encoding result.
The calculation formula of the coding result is:
Z=η(WC T δ(W 1 X 1 )+WC L δ(W 2 X 2 )+WC H δ(W 3 X 3 ));
wherein Z is the coding result; delta and eta are the names of convolution kernels; w (W) i The weights representing the sub-encoders, the sizes of which can be determined by random initialization; x is X i Representing the output of each sub-encoder; WC (Wolfram carbide) t 、WC l And WC (Wolfram carbide) H The three weight parameters are one-dimensional convolution weight parameters, are respectively used for adjusting the fusion proportion of the coding results of the three multi-layer sub-encoders, and are automatically learned when the word fusion model is trained.
The coding result obtained by the one-dimensional convolution layer is the second word embedded vector which is required to be obtained and is fused with the position information and the heat information.
Fig. 7 is a schematic diagram of a process of generating a second word embedding vector in which position information and heat information are simultaneously fused. As shown in fig. 7, the text vector of the target text generates a first word embedded vector of the target text by a word2vec method; the first word embedded vector of the target text, the word position vector of the target text and the word popularity vector of the target text are simultaneously input into a pre-trained stacked multi-input automatic encoder to obtain a second word embedded vector which is simultaneously fused with position information and popularity information.
The second word embedding vector which is generated by the word fusion model and is fused with the position information and the heat information can accurately describe the association between the words, the word and the position and the association between the words and the heat.
According to the text similarity calculation method provided by the embodiment of the invention, the word position information and the word heat information in the target text are fused with the original word embedding vector, so that the word embedding vector which can accurately describe the association between the word and the word, between the word and the position and between the word and the heat is obtained, and the help is provided for the subsequent text feature recognition.
Based on any of the foregoing embodiments, in an embodiment of the present invention, the method further includes:
obtaining a word embedding vector and a word trend vector of the sample text according to the sample text;
and training the word embedded vector of the sample text serving as input data for training, and the word trend vector of the sample text serving as a label for training by adopting a machine learning mode to obtain a hotness fusion transducer model for generating the feature vector of the word in the target text.
In an embodiment of the invention, the sample text is text used to train a text feature recognition model. There should be some scale for the amount of sample text, and theoretically the more better. For example, more than 1000 texts are selected as sample texts.
In the previous embodiment of the present invention, the implementation process of how to obtain the word embedding vector from the text has been described in detail, and thus, a description is not repeated in the embodiment of the present invention.
The word trend vector is a vector for describing the degree of association between words, which is obtained from the text similarity between words and the heat difference between words. The word trend matrix is a collection of word trend vectors.
The word trend is calculated as follows:
Wherein EHSim represents word trend, W i And W is j Representing two different words, and the wait () function is an edit distance calculation function for calculating the similarity of the two words on the text; h i And H j The expression W i And W is j Tan () is a normalization function for adding H to the access heat of (1) i H reduction j The absolute value of (1) maps to [0,1 ]]The huge difference of calculated values caused by overlarge word heat difference is avoided; alpha represents the word W i And W is j The number of times of occurrence in the same text, 1/alpha is used as a parameter for adjusting the weight in the application, when two words co-occur in the same text, then the access hotness of the two words is likely to be close, and the semantics to be described in the text are likely to be close, so when alpha is large, the weight of the access hotness part is amplified to make the tendency degree more weight to the access hotness, and the relationship of the access hotness between the words is accurately represented.
From the calculation formula of word trend degree, it can be seen that: the word trend comprehensively considers the text similarity between words and the difference of access heat between words, so that the relationship between words can be described more accurately, and the result deviation caused by overlarge influence of a certain aspect of text or access heat in the text feature extraction process is avoided.
Based on word trend, a word trend matrix Tend can be further obtained, the line numbers in the word trend matrix Tend represent the word numbers in the word target text, the column numbers represent the word numbers of the words in a preset dictionary, and the matrixValue EHSim m,n Representing the tendency of the mth word in the target text to the nth word in the dictionary.
The following is an example of a word trend matrix Tend:
according to the word trend and the definition of the word trend matrix, a corresponding word trend matrix can be generated for the words in the target text.
The word embedding matrix of the sample text can be a word embedding matrix fused with position information, and can also be a word embedding matrix fused with position information and heat information at the same time. In the embodiment of the present invention, the specific form of the word embedding matrix is not limited.
The hotmelt transducer model includes an encoder and a decoder. Wherein the encoder comprises a plurality of layers, each layer comprising a fusion layer and a convolution layer except for the last layer, the last layer comprising only the fusion layer. The decoder comprises a plurality of layers, wherein the last layer is an output layer, and each layer comprises a fusion Attention layer and a convolution layer except the last layer and the penultimate layer, and the penultimate layer only comprises the fusion Attention layer.
The decoder in the hotmelt transducer model is mainly used for the training process of the model.
Unlike the prior art Transformer model, the decoder in the hotfusion Transformer model will not use Mask masking mechanism to calculate word relevance by predicting words one by one; but directly uses the result output by the encoder for the decoding operation. In addition, the output layer is changed into a word trend degree matrix Tend, and a cross entropy loss function is set for optimizing network parameters. By doing so, the coded result can be ensured to be fused with text relativity, position and trend degree and also fused with Attention value, and more accurate text characteristic representation is obtained. Since the decoder in the hotmelt transducer model does not need individual predicted words, parallel operation can be realized, and the performance is superior to that of the transducer model in the prior art.
As in the prior art transducer model, the decoder in the hotmelt transducer model also receives the W of the encoder q 、W k 、W h These three parameter matrices are used as inputs.
The specific steps of training the hotfusion transducer model by using the word embedding vector of the sample text include:
and S11, inputting a word embedding matrix of the sample text into a heat fusion transducer model to be trained.
In this step, each row in the word embedding matrix corresponds to an embedding vector of a word, and the embedding vector of the word is used as an input of each hidden layer node of the hotness fusion transform model.
Step S12, initializing a heat fusion transducer model to be trained, which comprises the following steps: initializing each parameter matrix fusing the Attention matrix, designing the layer number of the model, and calculating the check ratio of each layer and the convolution kernel size of each convolution layer according to the layer number of each layer.
And S13, fusion Attention calculation is carried out on the input word embedding vector, a coding vector is generated, and residual connection and standardization are carried out.
Step S14, extracting features of the coding vector generated by fusing the Attention layer according to the convolution kernel function of each layer; after completion, the result is used as input of the next transducer coding layer.
Step S15, after all the multi-stage transform model coding layers are executed, the final coding result and the final W fused with the attribute layer are processed q 、W k 、W h The three matrices are input into the decoder, the parameter matrix of the decoder is not randomly initialized, but W is used for fusing the output of the Attention layer q 、W k 、W h These three existing matrices.
And S16, the decoder starts to execute fusion attribute calculation on the input result of the encoder, generates a decoding vector, and performs residual connection and standardization.
Step S17, extracting features of a decoding vector generated by fusing the Attention layer according to a convolution kernel function of each layer; after completion, the result is used as input of the next transducer decoding layer.
And S18, calculating the loss of the output of the decoding layer through each vector of a full-connection layer and word trend degree matrix, and then optimizing parameters of the whole network.
Step S19, the above flow is repeatedly executed until the loss converges.
And determining each parameter in the heat fusion transducer model through the training process to obtain the trained heat fusion transducer model.
According to the text similarity calculation method provided by the embodiment of the invention, the pre-trained heat fusion transform model is obtained by training the word embedding vector based on the sample text and the word trend vector of the sample text, wherein the word trend vector is a vector for describing the relevance between words according to the text similarity between words and the heat difference between words, so that the feature vector of the word in the target text can reflect the text similarity between words and the heat difference between words at the same time through the pre-trained heat fusion transform model, and then the access heat information can be considered when the similarity of the target text is calculated according to the feature vector of the word in the target text, so that the calculated text features are more comprehensive and the similarity is more accurate.
Based on any of the foregoing embodiments, in an embodiment of the present invention, the method further includes:
obtaining a text vector, a word position vector, a word heat vector and a word trend vector of the sample text according to the sample text;
obtaining a first word embedded vector of the sample text according to the text vector of the sample text;
the first word embedded vector, the word position vector and the word heat vector of the sample text are used as input data for training, the word trend vector of the sample text is used as a label for training, and training is carried out in a machine learning mode to obtain a word fusion model for generating the second word embedded vector of the sample text; the second word embedding vector is simultaneously fused with word position information and word heat information.
In the embodiment of the invention, the word fusion model is untrained, so that the word fusion model needs to be trained by using sample data.
In the previous embodiment of the present invention, how the text vector, the word position vector, the word heat vector, and the word tendency vector are obtained from the sample text has been described in detail, and thus a description thereof will not be repeated here.
The word2ve method in the prior art can be used for obtaining a first word embedded vector of the sample text from the text vector of the sample text.
During training, the first word embedding vector, the word position vector and the word heat vector of the sample file are respectively input into 3 independent multi-layer sub-encoders, and the 3 encoding results are added in one-dimensional convolution layers behind the multi-layer sub-encoders to obtain encoding results. The obtained coding result is input into a multi-layer decoder, the multi-layer decoder decodes the coding result of the one-dimensional convolution layer, and the output layer calculates a loss function according to the decoding result and the word trend vector of the sample text, so that parameters in the whole coder are adjusted. Wherein multi-class cross entropy may be employed in calculating the loss function.
In the embodiment of the invention, the output layer does not use original input data (namely input data of the sub-encoder) like a general automatic encoder, but uses the word trend degree matrix Tend, because the Tend matrix fuses text and heat, the relevance, the position and the heat of the words can be fused together in the encoding and decoding process of the Tend matrix by the automatic encoder, compared with the direct addition, the method has the advantages that the relation among the relevance, the position and the access heat of the words is established, the method is more accurate, and the information can not be lost in the subsequent heat fusion Transformer model training process.
According to the text similarity calculation method provided by the embodiment of the invention, the word fusion model is trained by utilizing the feature data extracted from the sample text, and the word fusion model obtained through training can fuse word position information and word heat information with an original word embedding matrix to obtain the word embedding matrix capable of accurately describing the association between the word and the word, between the word and the position and between the word and the heat, so that the help is provided for subsequent text feature recognition.
Based on any one of the above embodiments, fig. 8 is a schematic diagram of a text similarity calculating device according to an embodiment of the present invention, where, as shown in fig. 8, the text similarity calculating device according to the embodiment of the present invention includes:
a word embedding vector generation module 801, configured to obtain a word embedding vector of a target text according to the target text;
a feature vector generation module 802, configured to input a word embedding vector of the target text into a pre-trained hotness fusion transducer model, so as to obtain a feature vector of the word in the target text; the feature vectors of the words can reflect text similarity between the words and heat difference between the words at the same time;
a similarity calculation module 803, configured to calculate a similarity of a target text according to feature vectors of words in the target text; wherein,
The hotness fusion transducer model is obtained by training based on word embedding vectors of sample texts and word trend vectors of the sample texts; the heat fusion transducer model is a model obtained by replacing a self-attention layer in the transducer model with a fusion attention layer and arranging a convolution layer between the fusion attention layers;
the fusion attention layer is used for calculating the attention of the word according to the self-attention and the heat attention of the word; the word trend degree vector is a vector which is obtained according to the text similarity between words and the heat difference between words and is used for describing the association degree between words.
According to the text similarity calculation device provided by the embodiment of the invention, the pre-trained heat fusion transform model is obtained by training the word embedding vector based on the sample text and the word trend vector of the sample text, wherein the word trend vector is a vector for describing the relevance between words according to the text similarity between words and the heat difference between words, so that the feature vector of the word in the target text can reflect the text similarity between words and the heat difference between words at the same time through the pre-trained heat fusion transform model, and the follow-up similarity calculation module can consider the access heat information when calculating the similarity of the target text according to the feature vector of the word in the target text, so that the calculated text features are more comprehensive and the similarity is more accurate.
The text similarity calculation method and the text similarity calculation device provided by the embodiment of the invention have wide application prospects. For example, when searching the title of the media asset library, the text similarity calculation method and device provided by the embodiment of the invention can be adopted when intelligent customer service searches for answers with high semantic matching degree.
FIG. 9 is a schematic diagram of a text similarity calculation system that encapsulates a hotness fusion transducer model into a service, provides an Http-based call interface to the outside, and provides text similarity calculation functionality to the outside for use by a search engine or intelligent customer service. The system comprises: the model access portal, the model instance module, the model management module and the model library module.
The model accesses a portal for converting text requested by an application accessing the service into feature vectors through the interface. The model access entry is provided with an identification field, and a topic model corresponding to the request is characterized.
And the model instance module is used for packaging the instance of the hotness fusion converter model into a Docker instance so as to provide services for the outside. Wherein each model service can load models of different topics in general, so that the models of different topics can accept requests of different topics for processing.
The model management module is mainly used for managing the loading of the model and the switching of the model examples, and in the scheme, the access amount of each theme can be monitored through the model management module, the theme model examples with low access amount are properly released and converted into the theme model examples with high access amount, so that the overall performance is improved.
And the model library module is used for storing trained hotness fusion transducer models of different topics.
FIG. 10 is a schematic diagram showing steps for implementing the service provided by the text similarity calculation system shown in FIG. 9. Comprising the following steps:
step 1001, receiving a request from a different application, where a message of the request includes two parts { text, code }. The text represents the content of the message, namely the content which needs to be subjected to variable coding; code represents the encoding of the requested subject matter, which uses the Http interface for use by the caller.
Step 1002, analyzing different request messages, and judging the used model according to the subject field code in the message.
The information of the model theme is stored in the model management module in the form of Key-Value, and the storage structure is as follows:
{ topic ID, topic information }.
The topic ID, namely the code of the model topic, and the topic information comprises the service address, namely the IP number and the port number, of the model service corresponding to the topic model.
When the system receives a request sent by a calling end, inquiring corresponding theme information and an access address of a theme model instance according to a code field; the request content text is forwarded to the corresponding model instance according to the access address.
Step 1003, when the request content text is forwarded to the theme model instance, encoding the received text data by adopting a hotness fusion transform model instance, which specifically includes:
calculating three matrixes Text, location and Hot of the text, and calculating a trend matrix Tend of the text according to the three matrixes;
inputting the three matrixes into a pre-trained CL Autoencoder network model to obtain fusion word embedding, and establishing the association between the text and the position;
embedding the fusion words into a multi-stage Transformer network structure of an attribute mechanism of the fusion access heat, and obtaining a preliminary code through calculation of multi-layer transformers;
performing sigmoid function projection on word codes by using a maximum likelihood method through historical data so as to reduce the influence of non-hotness words on text similarity calculation; so far, the encoding of the heat fusion transducer model is completed, and the encoding result is put into a return result set.
Step 1004, returning the received returned results of the different model instances to the requester.
Step 1005, the model management module monitors each model instance service in real time during service operation, and when the system is initialized, a plurality of model instances are started for each topic model by default and the number is equal; during system operation, the model management module detects the access amount of each topic model, divides model examples into a hot spot model and a non-hot spot model according to the access amount, recovers a part of non-hot spot model examples, and increases hot spot model examples so as to better cope with the access requirements of a calling party.
Fig. 11 is a schematic diagram of an entity structure of an electronic device according to an embodiment of the present invention, as shown in fig. 11, the electronic device may include: processor 1110, communication interface Communications Interface 1120, memory 1130 and communication bus 1140, wherein processor 1110, communication interface 1120 and memory 1130 communicate with each other via communication bus 1140. Processor 1110 may call logic instructions in memory 1130 to perform the following methods: obtaining word embedding vectors of the target text according to the target text; inputting word embedding vectors of the target text into a pre-trained hotness fusion transducer model to obtain feature vectors of words in the target text; the feature vectors of the words can reflect text similarity between the words and heat difference between the words at the same time; and calculating the similarity of the target text according to the feature vector of the word in the target text.
It should be noted that, in this embodiment, the electronic device may be a server, a PC, or other devices in the specific implementation, so long as the structure of the electronic device includes a processor 1110, a communication interface 1120, a memory 1130, and a communication bus 1140 as shown in fig. 11, where the processor 1110, the communication interface 1120, and the memory 1130 perform communication with each other through the communication bus 1140, and the processor 1110 may call logic instructions in the memory 1130 to execute the above method. The embodiment does not limit a specific implementation form of the electronic device.
Further, the logic instructions in the memory 1130 described above may be implemented in the form of software functional units and sold or used as a stand-alone product, stored on a computer-readable storage medium. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
Further, embodiments of the present invention disclose a computer program product comprising a computer program stored on a non-transitory computer readable storage medium, the computer program comprising program instructions which, when executed by a computer, enable the computer to perform the methods provided by the above-described method embodiments, for example comprising: obtaining word embedding vectors of the target text according to the target text; inputting word embedding vectors of the target text into a pre-trained hotness fusion transducer model to obtain feature vectors of words in the target text; the feature vectors of the words can reflect text similarity between the words and heat difference between the words at the same time; and calculating the similarity of the target text according to the feature vector of the word in the target text.
In another aspect, embodiments of the present invention also provide a non-transitory computer readable storage medium having stored thereon a computer program, which when executed by a processor is implemented to perform the method provided in the above embodiments, for example, including: obtaining word embedding vectors of the target text according to the target text; inputting word embedding vectors of the target text into a pre-trained hotness fusion transducer model to obtain feature vectors of words in the target text; the feature vectors of the words can reflect text similarity between the words and heat difference between the words at the same time; and calculating the similarity of the target text according to the feature vector of the word in the target text.
The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.
From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.
Claims (9)
1. A text similarity calculation method, comprising:
obtaining a word embedding vector of a target text according to the target text;
inputting word embedding vectors of the target text into a pre-trained hotness fusion transducer model to obtain feature vectors of words in the target text; the feature vectors of the words can reflect text similarity between the words and heat difference between the words at the same time;
calculating the similarity of the target text according to the feature vector of the word in the target text; wherein,
the hotness fusion transducer model is obtained by training based on word embedding vectors of sample texts and word trend vectors of the sample texts; the heat fusion transducer model is a model obtained by replacing a self-attention layer in the transducer model with a fusion attention layer and arranging a convolution layer between the fusion attention layers;
The fusion attention layer is used for calculating the attention of the word according to the self-attention and the heat attention of the word;
the word trend degree vector is a vector for describing the association degree between words according to the text similarity between words and the heat difference between words;
the fused attention layer comprises: a weight proportion setting layer and an attention calculating layer; wherein,
the weight proportion setting layer is used for setting weight proportion for the heat attention of the words; the weight proportion is obtained based on a verification ratio, and the verification ratio is determined according to the layer number of the fusion attention layer in the heat fusion transducer model; the higher the number of layers of the fusion attention layer in the heat fusion transducer model is, the lower the value of the verification ratio is;
the attention calculating layer is used for calculating the attention of the word according to the self-attention of the word and the heat attention provided with the weight proportion.
2. The text similarity calculation method according to claim 1, wherein before the step of calculating the similarity of the target text from the feature vectors of the words in the target text, the method further comprises:
Calculating an estimated value of the hotword probability according to the hotness probability of the words in the target text;
taking the estimated value of the hotword probability as a threshold value, and dividing words in the target text into hotwords and non-hotwords according to the threshold value;
and mapping the feature vector of the non-hotword to a preset value.
3. The text similarity calculation method according to claim 1 or 2, wherein the obtaining the word embedding vector of the target text from the target text includes:
obtaining a text vector, a word position vector and a word heat vector of the target text according to the target text;
obtaining a first word embedded vector of the target text according to the text vector of the target text;
inputting a first word embedding vector, a word position vector and a word heat vector of the target text into a word fusion model trained in advance to obtain a second word embedding vector fused with word position information and word heat information at the same time, and taking the second word embedding vector as the word embedding vector of the target text; wherein,
the word fusion model is obtained by training a first word embedding vector, a word position vector, a word heat vector and a word trend vector based on a sample text; the first word embedding vector is a vector for reflecting semantic relevance of words.
4. The text similarity calculation method according to claim 1, wherein the number of convolution kernel steps of the convolution layer is determined according to the number of layers of the convolution layer in the hotness fusion fransformer model; the higher the number of layers of the convolution layer in the heat fusion transducer model, the larger the number of convolution kernel steps.
5. The text similarity calculation method according to claim 1, characterized in that the method further comprises:
obtaining a word embedding vector and a word trend vector of the sample text according to the sample text;
and training the word embedded vector of the sample text serving as input data for training, and the word trend vector of the sample text serving as a label for training by adopting a machine learning mode to obtain a hotness fusion transducer model for generating the feature vector of the word in the target text.
6. The text similarity calculation method according to claim 3, wherein the method further comprises:
obtaining a text vector, a word position vector, a word heat vector and a word trend vector of the sample text according to the sample text;
obtaining a first word embedded vector of the sample text according to the text vector of the sample text;
And training the first word embedding vector, the word position vector and the word popularity vector of the sample text serving as input data for training, and the word tendency vector of the sample text serving as a label for training by adopting a machine learning mode to obtain a word fusion model for generating the second word embedding vector of the sample text.
7. A text similarity calculation device, comprising:
the word embedding vector generation module is used for obtaining a word embedding vector of the target text according to the target text;
the feature vector generation module is used for inputting word embedding vectors of the target text into a pre-trained hotness fusion transducer model to obtain feature vectors of words in the target text; the feature vectors of the words can reflect text similarity between the words and heat difference between the words at the same time;
the similarity calculation module is used for calculating the similarity of the target text according to the feature vector of the word in the target text; wherein,
the hotness fusion transducer model is obtained by training based on word embedding vectors of sample texts and word trend vectors of the sample texts; the heat fusion transducer model is a model obtained by replacing a self-attention layer in the transducer model with a fusion attention layer and arranging a convolution layer between the fusion attention layers;
The fusion attention layer is used for calculating the attention of the word according to the self-attention and the heat attention of the word; the word trend degree vector is a vector for describing the association degree between words according to the text similarity between words and the heat difference between words;
the fused attention layer comprises: a weight proportion setting layer and an attention calculating layer; wherein,
the weight proportion setting layer is used for setting weight proportion for the heat attention of the words; the weight proportion is obtained based on a verification ratio, and the verification ratio is determined according to the layer number of the fusion attention layer in the heat fusion transducer model; the higher the number of layers of the fusion attention layer in the heat fusion transducer model is, the lower the value of the verification ratio is;
the attention calculating layer is used for calculating the attention of the word according to the self-attention of the word and the heat attention provided with the weight proportion.
8. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the steps of the text similarity calculation method according to any one of claims 1 to 6 when the program is executed by the processor.
9. A non-transitory computer readable storage medium having stored thereon a computer program, characterized in that the computer program when executed by a processor implements the steps of the text similarity calculation method according to any of claims 1 to 6.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011010599.9A CN112269856B (en) | 2020-09-23 | 2020-09-23 | Text similarity calculation method and device, electronic equipment and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011010599.9A CN112269856B (en) | 2020-09-23 | 2020-09-23 | Text similarity calculation method and device, electronic equipment and storage medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112269856A CN112269856A (en) | 2021-01-26 |
CN112269856B true CN112269856B (en) | 2023-11-10 |
Family
ID=74349212
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202011010599.9A Active CN112269856B (en) | 2020-09-23 | 2020-09-23 | Text similarity calculation method and device, electronic equipment and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112269856B (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113851125A (en) * | 2021-09-09 | 2021-12-28 | 广州大学 | Electric vehicle speed regulation method, system, device and medium based on voice semantic recognition |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110516216A (en) * | 2019-05-15 | 2019-11-29 | 北京信息科技大学 | A kind of automatic writing template base construction method of sports news |
CN110569361A (en) * | 2019-09-06 | 2019-12-13 | 腾讯科技(深圳)有限公司 | Text recognition method and equipment |
WO2020048195A1 (en) * | 2018-09-05 | 2020-03-12 | 腾讯科技(深圳)有限公司 | Text translation method and apparatus, storage medium and computer device |
CN111428044A (en) * | 2020-03-06 | 2020-07-17 | 中国平安人寿保险股份有限公司 | Method, device, equipment and storage medium for obtaining supervision identification result in multiple modes |
-
2020
- 2020-09-23 CN CN202011010599.9A patent/CN112269856B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2020048195A1 (en) * | 2018-09-05 | 2020-03-12 | 腾讯科技(深圳)有限公司 | Text translation method and apparatus, storage medium and computer device |
CN110516216A (en) * | 2019-05-15 | 2019-11-29 | 北京信息科技大学 | A kind of automatic writing template base construction method of sports news |
CN110569361A (en) * | 2019-09-06 | 2019-12-13 | 腾讯科技(深圳)有限公司 | Text recognition method and equipment |
CN111428044A (en) * | 2020-03-06 | 2020-07-17 | 中国平安人寿保险股份有限公司 | Method, device, equipment and storage medium for obtaining supervision identification result in multiple modes |
Non-Patent Citations (1)
Title |
---|
基于深度学习的文本情感分类研究;汤雪;《中国优秀硕士学位论文全文数据库》(第12期);I138-2120 * |
Also Published As
Publication number | Publication date |
---|---|
CN112269856A (en) | 2021-01-26 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11816442B2 (en) | Multi-turn dialogue response generation with autoregressive transformer models | |
WO2021068352A1 (en) | Automatic construction method and apparatus for faq question-answer pair, and computer device and storage medium | |
US20220147715A1 (en) | Text processing method, model training method, and apparatus | |
US11468239B2 (en) | Joint intent and entity recognition using transformer models | |
JP2021128774A (en) | Multimodality-based theme classification method, device, apparatus, and storage medium | |
AU2022204669B2 (en) | Disfluency removal using machine learning | |
WO2024015323A1 (en) | Methods and systems for improved document processing and information retrieval | |
CN114118100A (en) | Method, apparatus, device, medium and program product for generating dialogue statements | |
CN112269856B (en) | Text similarity calculation method and device, electronic equipment and storage medium | |
Fei et al. | Deep Learning Structure for Cross‐Domain Sentiment Classification Based on Improved Cross Entropy and Weight | |
CN114330366A (en) | Event extraction method and related device, electronic equipment and storage medium | |
CN113178189A (en) | Information classification method and device and information classification model training method and device | |
CN112668343A (en) | Text rewriting method, electronic device and storage device | |
CN110162558B (en) | Structured data processing method and device | |
US11755671B2 (en) | Projecting queries into a content item embedding space | |
CN114817501A (en) | Data processing method, data processing device, electronic equipment and storage medium | |
CN115495579A (en) | Method and device for classifying text of 5G communication assistant, electronic equipment and storage medium | |
Anantha et al. | Learning to rank intents in voice assistants | |
CN111625579B (en) | Information processing method, device and system | |
CN114548083B (en) | Title generation method, device, equipment and medium | |
US20230161808A1 (en) | Performing image search based on user input using neural networks | |
CN117633234A (en) | Classification method and system for similar intents in optimized semantic analysis | |
CN117493555A (en) | Training method of text classification model, text classification method and related equipment | |
WO2024189326A1 (en) | Text information extraction | |
CN118797048A (en) | Text matching method and device, electronic equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |