WO2021159613A1

WO2021159613A1 - Text semantic similarity analysis method and apparatus, and computer device

Info

Publication number: WO2021159613A1
Application number: PCT/CN2020/087554
Authority: WO
Inventors: 李小娟; 徐国强
Original assignee: 深圳壹账通智能科技有限公司
Priority date: 2020-02-14
Filing date: 2020-04-28
Publication date: 2021-08-19
Also published as: CN111368024A

Abstract

The present application relates to the technical field of computers. Disclosed in the present application are a text semantic similarity analysis method and apparatus, and a computer device, which can solve the problems that when similarity analysis is carried out on short text in a target domain, short text similarity data is difficult to obtain and label, a short text similarity algorithm effect is easily affected by the data labeling quality, so that a calculation result is unstable. The method comprises: obtaining a universal data set and a target domain data set; training a semantic similarity recognition model by taking the universal data set as a training sample; adjusting the semantic similarity recognition model by using the target domain data set as a transfer learning sample; inputting target short text to be subjected to semantic similarity recognition into the adjusted semantic similarity recognition model to obtain semantic similarity; and determining a semantic similarity recognition result on the basis of the semantic similarity. The present application is suitable for analyzing text semantic similarity in a target domain.

Description

Analysis method, device and computer equipment of text semantic similarity

This application claims the priority of a Chinese patent application filed with the Chinese Patent Office on February 14, 2020, with the application number of 202010092595.3, and the application titled "Analysis Method, Apparatus and Computer Equipment for Text Semantic Similarity", the entire content of which is incorporated by reference Incorporated in this application.

Technical field

This application relates to the field of natural language processing technology, and in particular to a method, device and computer equipment for analyzing text semantic similarity.

Background technique

Semantic similarity calculation can also be called text matching. Text matching is a common problem in many natural language processing applications. Short text similarity refers to the calculation of similarity within a certain range of text length. Compared with long text, short text contains less information and has greater similarity calculations. The challenge. The current short text similarity calculation method mainly adopts the deep learning method. The depth-based short text similarity calculation first needs to manually label a large amount of data, and then use the label data to calculate the similarity.

However, the inventor found that the existing short text similarity calculation based on a specific field, if there is less public data in this field, there are problems in obtaining short text similarity data and labeling, and the effect of the short text similarity algorithm is easily affected. The influence of the quality of data labeling leads to unstable calculation results.

Summary of the invention

In view of this, this application provides a text semantic similarity analysis method, device and computer equipment, which mainly solves the difficulties in obtaining and labeling short text similarity data when performing similarity analysis on short texts in the target field. In addition, the effect of short text similarity algorithm is easily affected by the quality of data annotation, which leads to the problem of unstable analysis results.

According to one aspect of the present application, a method for analyzing text semantic similarity is provided. The method includes:

Obtain general data sets and target domain data sets;

Training a semantic similarity recognition model using the general data set as a training sample;

Adjusting the semantic similarity recognition model by using the target domain data set as a transfer learning sample;

Input the target short text to be recognized for semantic similarity into the adjusted semantic similarity recognition model to obtain the semantic similarity;

The semantic similarity recognition result is determined based on the semantic similarity.

According to another aspect of the present application, there is provided a text semantic similarity analysis device, which includes:

The acquisition module is used to acquire general data sets and target domain data sets;

A training module for training a semantic similarity recognition model using the general data set as a training sample;

An adjustment module for adjusting the semantic similarity recognition model by using the target domain data set as a migration learning sample;

The input module is used to input the target short text to be recognized for semantic similarity into the adjusted semantic similarity recognition model to obtain the semantic similarity;

The determining module is used to determine the semantic similarity recognition result based on the semantic similarity.

According to another aspect of the present application, a non-volatile readable storage medium is provided, on which a computer program is stored, and the program is executed by a processor to realize the above-mentioned text semantic similarity analysis method.

According to another aspect of the present application, a computer device is provided, including a non-volatile readable storage medium, a processor, and a computer program stored on the non-volatile readable storage medium and running on the processor, When the processor executes the program, the method for analyzing the semantic similarity of the text is realized.

With the above technical solutions, the application realizes the analysis effect in the improvement field, thereby also solving the problem of obtaining a large amount of training data in the target field.

Description of the drawings

The drawings described here are used to provide a further understanding of the application and constitute a part of the application. The exemplary embodiments and descriptions of the application are used to explain the application, and do not constitute an improper limitation of the local application. In the attached picture:

FIG. 1 shows a schematic flowchart of a method for analyzing text semantic similarity provided by an embodiment of the present application;

FIG. 2 shows a schematic flowchart of another method for analyzing text semantic similarity provided by an embodiment of the present application;

FIG. 3 shows a schematic structural diagram of a text semantic similarity analysis device provided by an embodiment of the present application;

FIG. 4 shows a schematic structural diagram of another apparatus for analyzing text semantic similarity provided by an embodiment of the present application.

Detailed ways

The embodiment of the present application provides a method for analyzing text semantic similarity. As shown in FIG. 1, the method includes:

101. Obtain general data sets and target domain data sets.

Among them, the general data set can be: 400,000 short text similarity data sets obtained by ATEC2018 Ant Financial Short Text Semantic Similarity Competition, CCKS2018 WeBank Intelligent Customer Service Question Matching Competition, Harbin Institute of Technology's data set LCQMC and other methods. ; The target field data set can be historical data records in the target field, search engines and other accumulated data.

102. Use the general data set as a training sample to train a semantic similarity recognition model.

In specific application scenarios, calculating similarity needs to indicate whether the two sentences are similar or not, and the amount of data cannot be too small, requiring a certain degree of universality, which is an arduous task for annotators. Because of this, short text similarity calculation has always been a topic worthy of research. In this application, a general data set with a large amount of data can be selected as the training sample to initially train the semantic similarity recognition model.

103. Use the target domain data set as a transfer learning sample to adjust the semantic similarity recognition model.

In specific application scenarios, algorithms can be developed to maximize the use of labeled domain knowledge to assist knowledge acquisition and learning in the target domain. The core is to find the similarities between the source domain and the target domain, and make rational use of them. This similarity is very common. For example, the model used to recognize cars can be used to improve the ability to recognize karts, and transfer learning can store and use prior knowledge of other different but related problems.

104. Input the target short text to be recognized for semantic similarity into the adjusted semantic similarity recognition model, and obtain the semantic similarity.

In a specific application scenario, after the similarity recognition model is adjusted, the similarity recognition model can be applied to the short text similarity detection in the target field, and the corresponding similarity is output according to the input short text pair.

105. Determine the semantic similarity recognition result based on the semantic similarity.

Correspondingly, the similarity recognition result corresponding to the semantic similarity can be determined by setting the similarity threshold.

Through the analysis method of text semantic similarity in this embodiment, the idea of transfer learning can be used to learn a short text similarity analysis method in a general field through a large number of existing public data sets. Then only need to label an appropriate amount of data in the target field, use this labeled data for refined learning, and realize the short text similarity analysis in the target field. Compared with the direct use of general data or financial data, or a mixture of general data and financial data, this method can not only learn the semantic information of the short text similarity of general data, but also apply this prior knowledge in a targeted manner In the short text similarity analysis of the target field, the analysis effect in the improved field is realized, and the problem of obtaining a large amount of training data in the target field is also solved.

Further, as a refinement and extension of the specific implementation of the foregoing embodiment, in order to fully illustrate the specific implementation process in this embodiment, another method for analyzing text semantic similarity is provided. As shown in FIG. 2, the method includes :

201. Obtain general data sets and target domain data sets.

For this embodiment, in a specific application scenario, since the depth-based short text similarity requires a large amount of manual annotation data, but there are few data based on the target field, the analysis effect of the short text similarity in the target field is insufficient. Ideally, a general data set can be used instead in the pre-training process, and then the acquired target field data set can be used to further modify the training. Therefore, in this application, a large number of general data sets need to be obtained in advance and collected as much as possible To a predetermined number of target field data sets that can meet the revised standards.

202. Two short texts are arbitrarily selected from the general data set to form a text pair to be tested.

For this embodiment, in a specific application scenario, in order to ensure the accuracy of training, short texts can be randomly selected from a general data set to form a text pair to be tested, which is used for multiple and comprehensive training of the semantic similarity recognition model.

203. The text pair to be tested is preprocessed and input to the Embedding layer in the semantic similarity recognition model to obtain the first sequence and the second sequence. The first sequence corresponds to the mapping result of one of the short texts in the text pair to be tested. The second sequence corresponds to the mapping result of another short text in the text pair to be tested.

For example, input two sentences A and B, after preprocessing and Embedding layer mapping, the first sequence a=(a1...a _la ) and the second sequence b=(b1...b _lb ), where ai, bj∈ Rl is the l-dimensional vector output by the Embedding layer.

204. Input the first sequence and the second sequence into the bidirectional long and short-term memory network BiLSTM, so as to obtain the corresponding first vector and the second vector.

For example, input the first sequence and the second sequence obtained in step 203 of the embodiment into the bidirectional long-short-term memory network BiLSTM. BiLSTM can learn the word in a sentence and its context to obtain a new Embedding vector. which is:

in

Represents the output of a at the i-th time step in the BiLSTM network,

Represents the output of b at the i-th time step in the BiLSTM network.

The first vector can be calculated by the formula

And the second vector

205. Calculate the difference between the first vector and the second vector, and obtain a weighted third sequence corresponding to the first vector and a weighted fourth sequence corresponding to the second vector.

For example, the first vector can be obtained based on the step 204 of the embodiment

And the second vector

And calculate the difference between the first vector and the second vector, where the attention model can be applied. The calculation method of attention weight is:

Then calculate the weighted values of a and b based on the above attention weight, namely:

in,

Is the third sequence,

For the fourth sequence.

206. Calculate the feature vector according to the first sequence, the second sequence, the third sequence, and the fourth sequence.

In a specific application scenario, in order to fully obtain the difference information between the two sentences and the sentence interaction information, the third sequence and the fourth sequence are respectively subtracted and multiplied, and the first sequence obtained above is And the second sequence for splicing operation, get

Then send the obtained value to BiLSTM again, where BiLSTM is mainly to capture local inference information

m _a and m _b and context information. Input v _a and v _b into the pooling layer in turn. The pooling layer includes the maximum pooling layer and the average pooling layer. After that, the pooled results are stitched together again to obtain the feature vector V={V _a,vue ,V _a,max ,V _b,vue ,V _b,max }.

207. Output the first similarity recognition result based on the feature vector.

Correspondingly, after the feature vector is obtained, the softmax output layer can be used, the output category is 2 types, and the output value is a number ranging from 0 to 1, that is, the similarity value. The first similarity recognition result is further determined according to the similarity value, where the closer the similarity value is to 1, the more similar the two input sentences are; otherwise, the less similar they are.

208. Determine the first accuracy loss of the first similarity recognition result relative to the first target recognition result.

In specific application scenarios, the first target recognition result can be obtained in advance according to the marks in the text pair to be tested. After the first similarity recognition result is obtained, the first similarity recognition result can be compared with the first target recognition result. Perform matching, and further determine the first accuracy loss based on the similarity between the two.

209. Determine a first loss function based on the first accuracy loss, and use the first loss function to optimize the semantic similarity recognition model.

For this embodiment, the loss function of the training process is softmaxwithloss, the learning rate can be initially 1e-3, and the learning rate is set to dynamically attenuate with training. After the training converges, the similarity recognition model is saved.

210. Adjust the semantic similarity recognition model according to the data volume of the target domain data set and the size of the text similarity.

For this embodiment, in a specific application scenario, step 210 of the embodiment may specifically include: if it is determined that the data amount of the target domain data set is less than or equal to the first preset threshold, and the text similarity is greater than the second preset threshold, modifying The output category of the softmax layer in the semantic similarity recognition model; if it is determined that the data volume of the target domain data set is less than or equal to the first preset threshold, and the text similarity is less than or equal to the second preset threshold, the semantic similarity recognition model is frozen The initial layer of, retrain the remaining layers; if it is determined that the data volume of the target domain data set is greater than the first preset threshold, and the text similarity is less than or equal to the second preset threshold, then the semantic similarity is retrained using the target domain dataset Recognition model; if it is determined that the data volume of the target domain data set is greater than the first preset threshold, and the text similarity is greater than the second preset threshold, the architecture and initial weights of the semantic similarity recognition model are retained, and the initial weights are used to retrain Semantic similarity recognition model.

In specific application scenarios, this application is applicable to situations where the amount of data is small but the data similarity is high, and the softmax output layer is the same. In the fine-tuning stage, you can directly use the pre-trained model weights and use a smaller learning rate to continue training the network (such as 1e-4) to obtain the final similarity detection model.

211. Use historical data records in the target domain data set to construct positive training samples.

For this embodiment, in a specific application scenario, the positive training samples can be labeled by user clicks and other behaviors. For example, for the same search click behavior, different query commands can be treated as similar questions.

212. Screen negative training samples based on the Jeckard similarity measurement method.

For this embodiment, in a specific application scenario, step 212 of the embodiment may specifically include: randomly selecting two short text sentences from the target domain data set to construct a sample sentence pair, and comparing the sample sentence pair based on the Jeckard similarity measurement method Perform similarity calculation to obtain the similarity calculation result; if the similarity calculation result is greater than the third preset threshold, the corresponding sample sentence pair is determined as a negative example training sample.

Among them, the calculation formula of Jaccard's similarity measurement method is:

Among them, J(A, B) is the similarity calculation result, A is a short text sentence in the sample sentence pair, and B is another short text sentence in the sample sentence pair.

In a specific application scenario, when constructing negative training samples, in order to screen out a large number of completely unrelated two sentences as negative training samples, it is necessary to calculate the similarity of the randomly selected paired sentences in advance. Filter data that does not meet the similarity threshold. At the same time, some sentence pairs with a low similarity threshold are also retained to ensure the diversity of the data. The similarity here only needs to ensure that the literal meaning is similar.

For example, sentence 1: Which company are you from and why are you looking for me? 、Sentence 2: Which company do you belong to? I am not the person you mentioned. Remove the punctuation marks in Sentence 1 and Sentence 2 to be converted to set A={you, yes, which, person, company, company,, find, me, why, why}, B={you, yes, which , A, company, company, of, I, no, yes, you, say, of, this, of, person}, get the union A∪B: {you, yes, which, a, company, company, of , Find, me, why, no, say, that, person, person}, get the intersection A∩B: {you, yes, which, person, company, company, the, me}, and further can be calculated The Carder coefficient is: the number of intersections/the number of unions=8/16=0.5, that is, the Jackard similarity of sentence 1 and sentence 2 is 0.5. Afterwards, the similarity of two sentences whose Jaccard similarity is greater than or equal to a preset threshold can be determined as 1, otherwise it can be determined as 0, and the two sentences with similarity of 1 can be further retained as negative example training samples.

213. Input the positive training sample and the negative training sample into the adjusted semantic similarity recognition model, and obtain a second similarity recognition result.

In specific application scenarios, the positive training samples and negative training samples can be input into the adjusted semantic similarity recognition model, and the semantic similarity recognition model can be further trained and revised to obtain the corresponding second similarity recognition result.

214. Determine the second accuracy loss of the second similarity recognition result relative to the second target recognition result.

In a specific application scenario, the second target recognition result can be obtained in advance according to the marks in the positive training sample and the negative training sample. After the second similarity recognition result is obtained, the second similarity recognition result can be compared with The second target recognition result is matched, and the second accuracy loss is further determined according to the similarity between the two.

215. Determine the second loss function based on the second accuracy loss, and optimize the semantic similarity recognition model adjusted by the second loss function, so that the recognition accuracy of the semantic similarity recognition model meets the preset standard.

For this embodiment, the loss function of the training process is softmaxwithloss, the learning rate can be initially 1e-4, and the learning rate is set to dynamically attenuate with training, the training converges and when the recognition accuracy is greater than or equal to the recognition accuracy set in the preset standard , Save the semantic similarity recognition model.

216. Input the target short text to be subjected to semantic similarity recognition into the adjusted semantic similarity recognition model, and obtain the semantic similarity.

In specific application scenarios, after the semantic similarity recognition model is adjusted, the two target short texts to be recognized for semantic similarity can be input into the semantic similarity recognition model to obtain the difference between the two target short texts. Similarity.

217. Determine a semantic similarity recognition result based on the semantic similarity.

For this embodiment, in a specific application scenario, step 217 of the embodiment may specifically include: comparing the similarity value with the fourth preset threshold and the fifth preset threshold; if it is determined that the similarity value is less than the fourth preset threshold , The semantic similarity recognition result is determined to be dissimilar; if the similarity value is determined to be greater than or equal to the fourth preset threshold and less than the fifth preset threshold, the semantic similarity recognition result is determined to be moderately similar; if the similarity value is determined If it is greater than or equal to the fifth preset threshold, it is determined that the semantic similarity recognition result is highly similar; and the similarity recognition result is output.

For this embodiment, it should be noted that the method of determining the semantic similarity recognition result according to the similarity value is not limited to the above-mentioned case, and can also include multiple implementation methods. For example, only one preset threshold can be set. When the degree value is greater than the preset threshold, the semantic similarity recognition result is judged to be similar, otherwise it is judged to be dissimilar.

Through the above-mentioned text semantic similarity analysis method, the data of the labeled field can be used to the maximum to train the semantic similarity recognition model, and then the semantic similarity recognition model is applied to the target field based on the idea of transfer learning, and only the appropriate amount of labeling is required. Use the target domain data to adjust the semantic similarity recognition model, train to obtain a similarity detection model suitable for the target domain, and then realize the recognition and judgment of the short text similarity in the target domain. Compared with the direct use of general data or target field data, or a mixture of general data and target field data, this method can not only learn the semantic information of the short text similarity of general data, but also can target this priori Knowledge is applied to the calculation of short text similarity in the target field to improve the calculation effect in the field, which also solves the problem of obtaining a large amount of training data in the target field, and improves the accuracy and work efficiency of semantic similarity calculation.

Further, as a specific embodiment of the method shown in FIG. 1 and FIG. 2, an embodiment of the present application provides a text semantic similarity analysis device. As shown in FIG. 3, the device includes: an acquisition module 31, a training module 32, The adjustment module 33, the input module 34, and the determination module 35.

The obtaining module 31 can be used to obtain a general data set and a target field data set;

The training module 32 can be used to train a semantic similarity recognition model using a general data set as a training sample;

The adjustment module 33 can be used to adjust the semantic similarity recognition model by using the target domain data set as the transfer learning sample;

The input module 34 can be used to input the target short text to be recognized for semantic similarity into the adjusted semantic similarity recognition model to obtain the semantic similarity;

The determining module 35 can be used to determine the semantic similarity recognition result based on the semantic similarity.

In a specific application scenario, in order to use a general data set to train to obtain a training semantic similarity recognition model, the training module 32 can be specifically used to arbitrarily filter out two short texts from the general data set to form a text pair to be tested; The text pair to be tested is preprocessed and input to the Embedding layer in the semantic similarity recognition model to obtain the first sequence and the second sequence, and the first sequence corresponds to the mapping result of one of the short texts in the text pair to be tested The second sequence corresponds to the mapping result of another short text in the pair of texts to be tested; the first sequence and the second sequence are input into the bidirectional long-short-term memory network BiLSTM, so as to obtain the corresponding first sequence A vector and a second vector; calculate the difference between the first vector and the second vector, and obtain the weighted third sequence corresponding to the first vector and the weighted second vector The fourth sequence; the feature vector is calculated according to the first sequence, the second sequence, the third sequence, and the fourth sequence; the first similarity recognition result is output based on the feature vector; the first sequence is determined A first accuracy loss of a similarity recognition result relative to a first target recognition result; a first loss function is determined based on the first accuracy loss, and the semantic similarity recognition model is optimized by using the first loss function

Correspondingly, in order to adjust to obtain a semantic similarity recognition model suitable for the target domain, the adjustment module 33 can be specifically used to adjust the semantic similarity recognition model according to the data volume of the target domain data set and the size of the text similarity; Constructing positive training samples based on historical data records in the target field data set; screening negative training samples based on the Jackard similarity measurement method; inputting the positive training samples and the negative training samples to the adjusted semantics In the similarity recognition model, the second similarity recognition result is obtained; the second accuracy loss of the second similarity recognition result relative to the second target recognition result is determined; the second loss is determined based on the second accuracy loss Function, using the adjusted semantic similarity recognition model of the second loss function to optimize, so that the recognition accuracy of the semantic similarity recognition model meets a preset standard.

In a specific application scenario, in order to adjust the similarity recognition model according to the data volume of the target field data set and the size of the text similarity, the adjustment module 33 can be specifically used to determine if the data volume of the target field data set is less than or equal to the first A preset threshold and the text similarity is greater than a second preset threshold, then the output category of the softmax layer in the semantic similarity recognition model is modified; if it is determined that the data volume of the target domain data set is less than or equal to the first preset Set a threshold and the text similarity is less than or equal to the second preset threshold, then freeze the initial layer in the semantic similarity recognition model, and retrain the remaining layers; if the data of the target domain data set is determined If the amount is greater than the first preset threshold, and the text similarity is less than or equal to the second preset threshold, the semantic similarity recognition model is retrained using the target domain data set; if the target domain is determined If the data amount of the data set is greater than the first preset threshold, and the text similarity is greater than the second preset threshold, the architecture and initial weights of the semantic similarity recognition model are retained, and the initial weights are used To retrain the semantic similarity recognition model.

Correspondingly, in order to screen out negative training samples based on the Jaccard similarity measurement method, the adjustment module 33 can be specifically used to randomly select two short text sentences from the target field data set to construct sample sentence pairs, based on the Jaccard similarity The measurement method performs similarity calculation on the sample sentence pairs to obtain the similarity calculation result; if the similarity calculation result is greater than the third preset threshold, the corresponding sample sentence pair is determined as a negative training sample.

In a specific application scenario, in order to determine the semantic similarity recognition result based on the semantic similarity, the determining module 35 may be specifically configured to compare the similarity value with the fourth preset threshold and the fifth preset threshold; if If it is determined that the similarity value is less than the fourth preset threshold, it is determined that the semantic similarity recognition result is dissimilar; if it is determined that the similarity value is greater than or equal to the fourth preset threshold and less than the first Five preset thresholds, determine that the semantic similarity recognition result is moderately similar; if it is determined that the similarity value is greater than or equal to the fifth preset threshold, determine that the semantic similarity recognition result is highly similar;

In a specific application scenario, in order to display the semantic similarity recognition result on the display page, as shown in FIG. 4, the device further includes: an output module 36.

The output module 36 is used to output the similarity recognition result.

It should be noted that, for other corresponding descriptions of the functional units involved in the apparatus for analyzing text semantic similarity provided in this embodiment, reference may be made to the corresponding descriptions in FIGS. 1 to 2, and details are not repeated here.

Based on the above-mentioned methods shown in Figures 1 and 2, correspondingly, an embodiment of the present application also provides a storage medium on which a computer program is stored. The analysis method of the semantic similarity of the text shown. The storage medium may be non-volatile or volatile. Based on this understanding, the technical solution of the present application can be embodied in the form of a software product. The software product can be stored in a non-volatile storage medium (which can be a CD-ROM, U disk, mobile hard disk, etc.), including several The instructions are used to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute the methods in each implementation scenario of the present application.

Based on the above methods shown in Figures 1 and 2 and the virtual device embodiments shown in Figures 3 and 4, in order to achieve the above objectives, the embodiments of the present application also provide a computer device, which may be a personal computer, Servers, network devices, etc., the physical device includes a storage medium and a processor; the storage medium is used to store a computer program; the processor is used to execute the computer program to realize the semantic similarity of the text as shown in FIG. 1 and FIG. Analytical method.

Claims

A method for analyzing text semantic similarity, including:

Obtain general data sets and target domain data sets;

Training a semantic similarity recognition model using the general data set as a training sample;

Adjusting the semantic similarity recognition model by using the target domain data set as a transfer learning sample;

Input the target short text to be recognized for semantic similarity into the adjusted semantic similarity recognition model to obtain the semantic similarity;

The semantic similarity recognition result is determined based on the semantic similarity.
The method according to claim 1, wherein the training a semantic similarity recognition model using the general data set as a training sample specifically comprises:

Two short texts are arbitrarily selected from the general data set to form a text pair to be tested;

Preprocess the test text pair and input it to the Embedding layer in the semantic similarity recognition model to obtain a first sequence and a second sequence. The first sequence corresponds to one of the short texts in the test text pair The mapping result of the second sequence corresponds to the mapping result of another short text in the pair of texts to be tested;

Inputting the first sequence and the second sequence into the bidirectional long-short-term memory network BiLSTM, so as to obtain the corresponding first vector and the second vector;

Calculating the difference between the first vector and the second vector, and obtaining a weighted third sequence corresponding to the first vector and a weighted fourth sequence corresponding to the second vector;

Calculating a feature vector according to the first sequence, the second sequence, the third sequence, and the fourth sequence;

Outputting a first similarity recognition result based on the feature vector;

Determining the first accuracy loss of the first similarity recognition result relative to the first target recognition result;

A first loss function is determined based on the first accuracy loss, and the semantic similarity recognition model is optimized by using the first loss function.
The method according to claim 2, wherein said adjusting said semantic similarity recognition model by using said target domain data set as a transfer learning sample specifically comprises:

Adjusting the semantic similarity recognition model according to the data volume of the target domain data set and the size of the text similarity;

Use historical data records in the target field data set to construct a positive training sample;

Screen negative training samples based on Jaccard's similarity measurement method;

Input the positive training sample and the negative training sample into the adjusted semantic similarity recognition model to obtain a second similarity recognition result;

Determining the second accuracy loss of the second similarity recognition result relative to the second target recognition result;

A second loss function is determined based on the second accuracy loss, and the adjusted semantic similarity recognition model of the second loss function is used for optimization, so that the recognition accuracy of the semantic similarity recognition model meets a preset standard.
The method according to claim 3, wherein the adjusting the semantic similarity recognition model according to the data volume of the target domain data set and the size of the text similarity specifically comprises:

If it is determined that the data amount of the target domain data set is less than or equal to the first preset threshold, and the text similarity is greater than the second preset threshold, modify the output category of the softmax layer in the semantic similarity recognition model;

If it is determined that the data volume of the target domain data set is less than or equal to the first preset threshold, and the text similarity is less than or equal to the second preset threshold, freeze the initial semantic similarity recognition model Layer, train the remaining layers again;

If it is determined that the amount of data in the target domain data set is greater than the first preset threshold, and the text similarity is less than or equal to the second preset threshold, use the target domain dataset to retrain the semantic similarity Degree recognition model;

If it is determined that the data volume of the target domain data set is greater than the first preset threshold, and the text similarity is greater than the second preset threshold, then the architecture and initial weight of the semantic similarity recognition model are retained, And use the initial weight to retrain the semantic similarity recognition model.
The method according to claim 3, wherein the screening of negative training samples based on the Jackard similarity measurement method specifically comprises:

Two short text sentences are randomly selected from the target domain data set to construct a sample sentence pair, and similarity calculation is performed on the sample sentence pair based on the Jaccard similarity measurement method to obtain a similarity calculation result;

If the similarity calculation result is greater than the third preset threshold, the corresponding sample sentence pair is determined as a negative example training sample.
The method according to claim 5, wherein the calculation formula of the Jaccard similarity measurement method is:
Where, J(A, B) is the similarity calculation result, A is a short text sentence in the sample sentence pair, and B is another short text sentence in the sample sentence pair.
The method according to claim 6, wherein the determining the semantic similarity recognition result based on the semantic similarity specifically comprises:

Comparing the similarity value with a fourth preset threshold and a fifth preset threshold;

If it is determined that the similarity value is less than the fourth preset threshold, determining that the semantic similarity recognition result is not similar;

If it is determined that the similarity value is greater than or equal to the fourth preset threshold and less than the fifth preset threshold, determining that the semantic similarity recognition result is moderately similar;

If it is determined that the similarity value is greater than or equal to the fifth preset threshold, determining that the semantic similarity recognition result is highly similar;

After the semantic similarity recognition result is determined based on the semantic similarity, it specifically further includes:

Output the similarity recognition result.
A device for analyzing text semantic similarity, which includes:

The acquisition module is used to acquire general data sets and target domain data sets;

A training module for training a semantic similarity recognition model using the general data set as a training sample;

An adjustment module for adjusting the semantic similarity recognition model by using the target domain data set as a migration learning sample;

The input module is used to input the target short text to be recognized for semantic similarity into the adjusted semantic similarity recognition model to obtain the semantic similarity;

The determining module is used to determine the semantic similarity recognition result based on the semantic similarity.
A non-volatile readable storage medium with a computer program stored thereon, and when the program is executed by a processor, the following steps of the claims are realized:

Obtain general data sets and target domain data sets;

Training a semantic similarity recognition model using the general data set as a training sample;

Adjusting the semantic similarity recognition model by using the target domain data set as a transfer learning sample;

Input the target short text to be recognized for semantic similarity into the adjusted semantic similarity recognition model to obtain the semantic similarity;

The semantic similarity recognition result is determined based on the semantic similarity.
The non-volatile readable storage medium according to claim 9, wherein the training a semantic similarity recognition model using the general data set as a training sample specifically comprises:

Two short texts are arbitrarily selected from the general data set to form a text pair to be tested;

Preprocess the test text pair and input it to the Embedding layer in the semantic similarity recognition model to obtain a first sequence and a second sequence. The first sequence corresponds to one of the short texts in the test text pair The mapping result of the second sequence corresponds to the mapping result of another short text in the pair of texts to be tested;

Inputting the first sequence and the second sequence into the bidirectional long-short-term memory network BiLSTM, so as to obtain the corresponding first vector and the second vector;

Calculating the difference between the first vector and the second vector, and obtaining a weighted third sequence corresponding to the first vector and a weighted fourth sequence corresponding to the second vector;

Calculating a feature vector according to the first sequence, the second sequence, the third sequence, and the fourth sequence;

Outputting a first similarity recognition result based on the feature vector;

Determining the first accuracy loss of the first similarity recognition result relative to the first target recognition result;

A first loss function is determined based on the first accuracy loss, and the semantic similarity recognition model is optimized by using the first loss function.
The non-volatile readable storage medium according to claim 10, wherein said adjusting said semantic similarity recognition model by using said target domain data set as a transfer learning sample specifically comprises:

Adjusting the semantic similarity recognition model according to the data volume of the target domain data set and the size of the text similarity;

Use historical data records in the target field data set to construct a positive training sample;

Screen negative training samples based on Jaccard's similarity measurement method;

Input the positive training sample and the negative training sample into the adjusted semantic similarity recognition model to obtain a second similarity recognition result;

Determining the second accuracy loss of the second similarity recognition result relative to the second target recognition result;

A second loss function is determined based on the second accuracy loss, and the adjusted semantic similarity recognition model of the second loss function is used for optimization, so that the recognition accuracy of the semantic similarity recognition model meets a preset standard.
The non-volatile readable storage medium according to claim 11, wherein the adjusting the semantic similarity recognition model according to the data volume of the target domain data set and the size of the text similarity specifically comprises:

If it is determined that the data amount of the target domain data set is less than or equal to the first preset threshold, and the text similarity is greater than the second preset threshold, modify the output category of the softmax layer in the semantic similarity recognition model;

If it is determined that the data volume of the target domain data set is less than or equal to the first preset threshold, and the text similarity is less than or equal to the second preset threshold, freeze the initial semantic similarity recognition model Layer, train the remaining layers again;

If it is determined that the amount of data in the target domain data set is greater than the first preset threshold, and the text similarity is less than or equal to the second preset threshold, use the target domain dataset to retrain the semantic similarity Degree recognition model;

If it is determined that the data volume of the target domain data set is greater than the first preset threshold, and the text similarity is greater than the second preset threshold, then the architecture and initial weight of the semantic similarity recognition model are retained, And use the initial weight to retrain the semantic similarity recognition model.
The non-volatile readable storage medium according to claim 11, wherein the screening of negative training samples based on the Jackard similarity metric method specifically comprises:

Two short text sentences are randomly selected from the target domain data set to construct a sample sentence pair, and similarity calculation is performed on the sample sentence pair based on the Jaccard similarity measurement method to obtain a similarity calculation result;

If the similarity calculation result is greater than the third preset threshold, the corresponding sample sentence pair is determined as a negative example training sample.
The non-volatile readable storage medium according to claim 13, wherein the calculation formula of the Jaccard similarity measurement method is:
Where, J(A, B) is the similarity calculation result, A is a short text sentence in the sample sentence pair, and B is another short text sentence in the sample sentence pair.
A computer device, including a non-volatile readable storage medium, a processor, and a computer program stored on the non-volatile readable storage medium and running on the processor, the processor implements The following steps:

Obtain general data sets and target domain data sets;

Training a semantic similarity recognition model using the general data set as a training sample;

Adjusting the semantic similarity recognition model by using the target domain data set as a transfer learning sample;

Input the target short text to be recognized for semantic similarity into the adjusted semantic similarity recognition model to obtain the semantic similarity;

The semantic similarity recognition result is determined based on the semantic similarity.
The computer device according to claim 15, wherein the training a semantic similarity recognition model using the general data set as a training sample specifically comprises:

Two short texts are arbitrarily selected from the general data set to form a text pair to be tested;

Preprocess the test text pair and input it to the Embedding layer in the semantic similarity recognition model to obtain a first sequence and a second sequence. The first sequence corresponds to one of the short texts in the test text pair The mapping result of the second sequence corresponds to the mapping result of another short text in the pair of texts to be tested;

Inputting the first sequence and the second sequence into the bidirectional long-short-term memory network BiLSTM, so as to obtain the corresponding first vector and the second vector;

Calculating the difference between the first vector and the second vector, and obtaining a weighted third sequence corresponding to the first vector and a weighted fourth sequence corresponding to the second vector;

Calculating a feature vector according to the first sequence, the second sequence, the third sequence, and the fourth sequence;

Outputting a first similarity recognition result based on the feature vector;

Determining the first accuracy loss of the first similarity recognition result relative to the first target recognition result;

A first loss function is determined based on the first accuracy loss, and the semantic similarity recognition model is optimized by using the first loss function.
The computer device according to claim 16, wherein said adjusting said semantic similarity recognition model by using said target domain data set as a migration learning sample specifically comprises:

Adjusting the semantic similarity recognition model according to the data volume of the target domain data set and the size of the text similarity;

Use historical data records in the target field data set to construct a positive training sample;

Screen negative training samples based on Jaccard's similarity measurement method;

Inputting the positive training sample and the negative training sample into the adjusted semantic similarity recognition model to obtain a second similarity recognition result;

Determining the second accuracy loss of the second similarity recognition result relative to the second target recognition result;

A second loss function is determined based on the second accuracy loss, and the adjusted semantic similarity recognition model of the second loss function is used for optimization, so that the recognition accuracy of the semantic similarity recognition model meets a preset standard.
18. The computer device according to claim 17, wherein the adjusting the semantic similarity recognition model according to the data volume of the target domain data set and the size of the text similarity specifically comprises:

If it is determined that the data amount of the target domain data set is less than or equal to the first preset threshold, and the text similarity is greater than the second preset threshold, modify the output category of the softmax layer in the semantic similarity recognition model;

If it is determined that the data volume of the target domain data set is less than or equal to the first preset threshold, and the text similarity is less than or equal to the second preset threshold, freeze the initial semantic similarity recognition model Layer, train the remaining layers again;

If it is determined that the amount of data in the target domain data set is greater than the first preset threshold, and the text similarity is less than or equal to the second preset threshold, use the target domain dataset to retrain the semantic similarity Degree recognition model;

If it is determined that the data volume of the target domain data set is greater than the first preset threshold, and the text similarity is greater than the second preset threshold, then the architecture and initial weight of the semantic similarity recognition model are retained, And use the initial weight to retrain the semantic similarity recognition model.
18. The computer device according to claim 17, wherein the screening of negative training samples based on the Jackard similarity measurement method specifically comprises:

Two short text sentences are randomly selected from the target domain data set to construct a sample sentence pair, and similarity calculation is performed on the sample sentence pair based on the Jaccard similarity measurement method to obtain a similarity calculation result;

If the similarity calculation result is greater than the third preset threshold, the corresponding sample sentence pair is determined as a negative example training sample.
The computer device according to claim 19, wherein the calculation formula of the Jaccard similarity measurement method is:
Where, J(A, B) is the similarity calculation result, A is a short text sentence in the sample sentence pair, and B is another short text sentence in the sample sentence pair.