CN111368037A

CN111368037A - Text similarity calculation method and device based on Bert model

Info

Publication number: CN111368037A
Application number: CN202010151330.6A
Authority: CN
Inventors: 周宸; 骆加维; 周宝; 陈远旭
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2020-03-06
Filing date: 2020-03-06
Publication date: 2020-07-03

Abstract

The invention discloses a text similarity calculation method and device based on a Bert model, computer equipment and a storage medium, and relates to the technical field of artificial intelligence. The text similarity calculation method based on the Bert model comprises the following steps: determining text sentence segments to be compared; obtaining a first text matrix based on text sentence segments to be compared by adopting a word frequency word occurrence rate algorithm; obtaining a second text matrix based on the text sentence segments to be compared through a pre-trained Bert model; splicing the first text matrix and the second text matrix to obtain a spliced text matrix; performing characteristic optimization on the spliced text matrix to obtain a target text matrix; and obtaining the text similarity between the text sentence segments to be compared according to the target text matrix by adopting a preset similarity algorithm. The text similarity calculation method based on the Bert model can improve the accuracy of text similarity calculation.

Description

Text similarity calculation method and device based on Bert model

[ technical field ] A method for producing a semiconductor device

The invention relates to the technical field of artificial intelligence, in particular to a text similarity calculation method and device based on a Bert model.

[ background of the invention ]

Text similarity calculation is one of the branches of the natural language processing field. At present, the problems of weak semantic recognition capability, weak correlation between words and texts and the like still exist in text similarity prediction and calculation. The accuracy of the text similarity calculation cannot meet the user's expectations.

[ summary of the invention ]

In view of this, embodiments of the present invention provide a text similarity calculation method and apparatus based on a Bert model, a computer device, and a storage medium, so as to solve the problem that the accuracy of the current text similarity calculation is low.

In a first aspect, an embodiment of the present invention provides a text similarity calculation method based on a Bert model, including:

determining text sentence segments to be compared;

obtaining a first text matrix based on the text sentence segments to be compared by adopting a word frequency word occurrence rate algorithm;

obtaining a second text matrix based on the text sentence segments to be compared through a pre-trained Bert model;

splicing the first text matrix and the second text matrix to obtain a spliced text matrix;

performing characteristic optimization on the spliced text matrix to obtain a target text matrix;

and obtaining the text similarity between the text sentence segments to be compared according to the target text matrix by adopting a preset similarity algorithm.

As to the above-mentioned aspects and any possible implementation manner, there is further provided an implementation manner, where the obtaining, by using a word frequency and word occurrence rate algorithm, a first text matrix based on the text segments to be compared includes:

establishing a word bag according to the text sentence segments to be compared, wherein the word bag comprises words appearing in the text sentence segments to be compared;

calculating the word frequency of each word in the text sentence segment to be compared according to the word bag, and expressing the word frequency as

Where t represents words, d represents sentence segments, tf_t,dRepresenting whether the word t appears in the sentence segment d, if it appears, it takes 1, if it does not appear, g_t,dTake 0, g_t,dRepresenting the proportion of a word in the text sentence segment to be compared;

calculating the reverse file frequency according to the word bag, and expressing the reverse file frequency into a formula

Where N represents the total number of files, said files being predetermined, df_tN files contain words t, wherein n is an integer greater than or equal to 0;

obtaining the first text matrix according to the word frequency and the reverse file frequency, wherein elements in the first text matrix adopt a formula

And (4) calculating.

The foregoing aspect and any possible implementation manner further provide an implementation manner, where before obtaining the second text matrix based on the text sentence segments to be compared through a pre-trained Bert model, a training process of the Bert model is further included, and the method includes the following steps:

acquiring an original corpus;

performing character-level segmentation on the original corpus;

constructing a sentence pair according to the original corpus, wherein the sentence pair comprises a positive sample sentence pair and a negative sample sentence pair, the positive sample sentence pair has a context relationship between sentences, and the negative sample sentence pair does not have a context relationship between sentences;

connecting the sentence pairs based on the original corpus after the character-level segmentation;

randomly masking ten percent of characters in the sentence pair to obtain a training corpus;

and inputting the training corpus into an initial Bert model for training to obtain the Bert model.

The above-mentioned aspect and any possible implementation manner further provide an implementation manner, where the splicing is performed on the first text matrix and the second text matrix to obtain a spliced text matrix, and the splicing includes:

judging whether the matrix sizes of the first text matrix and the second text matrix are the same or not;

if the first text matrix and the second text matrix are the same, splicing the first text matrix and the second text matrix to obtain a spliced text matrix;

and if the first text matrix is different from the second text matrix, reducing the dimension of the first text matrix by adopting a principal component analysis method to ensure that the matrix size of the first text matrix is equal to the matrix size of the second text matrix, and splicing the first file matrix and the second text matrix after dimension reduction to obtain the spliced text matrix.

The above-mentioned aspect and any possible implementation manner further provide an implementation manner, where performing feature optimization on the spliced text matrix to obtain a target text matrix includes:

calculating a word vector v of the stitched text matrix based on a principal component analysis method_sIs expressed by formula as

Wherein S is the spliced text matrix, v_tIs the vector of the word t in the spliced text matrix, α is a preset smoothing parameter, p_tIs the probability of a word appearing in a document;

obtaining the word vector v by adopting a truncated singular value decomposition method_sThe main component u of (1);

according to the word vector v_sAnd said principal component u to said word vector v_sPerforming feature optimization to obtain the updated word vector which is expressed as v 'by adopting a formula'_s＝v_s-u(u^T)v_sWherein T represents a transposed matrix operation;

according to the updated word vector v_s' obtaining the target text matrix.

The above-mentioned aspects and any possible implementation manner further provide an implementation manner, where obtaining the text similarity between the text paragraphs to be compared according to the target text matrix by using a preset similarity algorithm includes:

according to the updated word vector v in the text sentence segment to be compared_s' obtaining a target text vector corresponding to the text sentence segment to be compared from the target text matrix;

calculating the similarity between the target text vectors by adopting a pre-similarity algorithm to obtain the similarity between the text sentence segments to be compared, and expressing the similarity as

And q represents a target text vector of a first text sentence segment to be compared, c represents a target text vector of a second text sentence segment to be compared, and the text sentence segments to be compared comprise the first text sentence segment to be compared and the second text sentence segment to be compared.

In a second aspect, an embodiment of the present invention provides a text similarity calculation apparatus based on a Bert model, including:

the text determining module is used for determining text sentence segments to be compared;

the first obtaining module is used for obtaining a first text matrix based on the text sentence segments to be compared by adopting a word frequency word occurrence rate algorithm;

the second acquisition module is used for acquiring a second text matrix based on the text sentence segments to be compared through a pre-trained Bert model;

the spliced text matrix acquisition module is used for splicing the first text matrix and the second text matrix to obtain a spliced text matrix;

the target text matrix acquisition module is used for carrying out feature optimization on the spliced text matrix to obtain a target text matrix;

and the text similarity calculation module is used for obtaining the text similarity between the text sentence segments to be compared according to the target text matrix by adopting a preset similarity calculation method.

In a third aspect, a computer device includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and the processor implements the steps of the text similarity calculation method based on the Bert model when executing the computer program.

In a fourth aspect, an embodiment of the present invention provides a computer-readable storage medium, including: a computer program which, when being executed by a processor, implements the steps of the above-described text similarity calculation method based on the Bert model.

In the embodiment of the invention, firstly, based on text sentence segments to be compared, a word frequency word occurrence rate algorithm and a Bert model trained in advance are adopted to respectively obtain a first text matrix and a second text matrix, wherein the first text matrix can embody the word frequency and the word occurrence rate in the text sentence segments to be compared, and the second text matrix can embody the inter-word relationship, the word position relationship and the inter-sentence relationship in the text sentence segments to be compared; secondly, the first text matrix and the second text matrix are spliced to obtain a spliced text matrix, and the features embodied by the first text matrix and the second text matrix are combined to form a comprehensive matrix which embodies the features in multiple directions, so that the accuracy of text similarity can be improved subsequently; then, performing feature optimization on the spliced text matrix to obtain a target text matrix, so that the spliced text matrix can remove some noise data through the feature optimization, and the feature expression capability of the target text matrix is improved; and finally, a preset similarity calculation method is adopted, the text similarity between the text sentence segments to be compared is obtained according to the target text matrix, and calculation is carried out based on the word frequency and the word occurrence rate of the words in the text sentence segments to be compared embodied by the target text matrix, and the characteristics of the relationship between the words, the position relationship between the words, the relationship between the sentences and the like in the text sentence segments to be compared, so that the accuracy of the text similarity can be effectively improved.

[ description of the drawings ]

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive labor.

FIG. 1 is a flowchart of a text similarity calculation method based on the Bert model according to an embodiment of the present invention;

FIG. 2 is a schematic block diagram of a text similarity calculation apparatus based on a Bert model according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a computer device according to an embodiment of the invention.

[ detailed description ] embodiments

For better understanding of the technical solutions of the present invention, the following detailed descriptions of the embodiments of the present invention are provided with reference to the accompanying drawings.

It should be understood that the described embodiments are only some embodiments of the invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The terminology used in the embodiments of the invention is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in the examples of the present invention and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.

It should be understood that the term "and/or" as used herein is merely a field that describes the same of an associated object, meaning that three relationships may exist, e.g., A and/or B, may indicate: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" herein generally indicates that the former and latter related objects are in an "or" relationship.

It should be understood that although the terms first, second, third, etc. may be used to describe preset ranges, etc. in embodiments of the present invention, these preset ranges should not be limited to these terms. These terms are only used to distinguish preset ranges from each other. For example, the first preset range may also be referred to as a second preset range, and similarly, the second preset range may also be referred to as the first preset range, without departing from the scope of the embodiments of the present invention.

The word "if" as used herein may be interpreted as "at … …" or "when … …" or "in response to a determination" or "in response to a detection", depending on the context. Similarly, the phrases "if determined" or "if detected (a stated condition or event)" may be interpreted as "when determined" or "in response to a determination" or "when detected (a stated condition or event)" or "in response to a detection (a stated condition or event)", depending on the context.

Fig. 1 shows a flowchart of a text similarity calculation method based on the Bert model in the present embodiment. The text similarity calculation method based on the Bert model can be applied to a text similarity calculation system, and can be realized by adopting the text similarity calculation system when text similarity calculation between sentences is carried out. The text similarity calculation system can be particularly applied to computer equipment, wherein the computer equipment is equipment capable of performing human-computer interaction with a user, and the equipment comprises but is not limited to computers, smart phones, tablets and the like. As shown in fig. 1, the text similarity calculation method based on the Bert model includes:

s10: and determining text sentence segments to be compared.

The text sentence segments to be compared may be a plurality of sentence segments used for text similarity comparison, and two sentence segments are taken from the text sentence segments to be compared for comparison each time text similarity comparison is performed.

It can be understood that if it is desired to compare any two sentences in a novel, the text segments to be compared may be all the fields contained in the novel, and all the segments in the novel are the text segments to be compared.

S20: and obtaining a first text matrix based on the text sentence segments to be compared by adopting a word frequency word occurrence rate algorithm.

The word frequency and word occurrence rate algorithm is an algorithm for calculating word occurrence frequency and reverse file frequency (word occurrence rate), and can reflect the importance degree of words in periods. In one embodiment, the word frequency and word occurrence rate of each word in the text sentence segment to be compared are calculated by the word frequency and word occurrence algorithm, a first text matrix is constructed, and the features of the word frequency and the word occurrence rate of the words are reflected in a matrix mode.

Further, in step S20, a word frequency word occurrence rate algorithm is adopted, and a first text matrix is obtained based on the text periods to be compared, which specifically includes:

s21: and establishing a word bag according to the text sentence segments to be compared, wherein the word bag comprises words appearing in the text sentence segments to be compared.

It is understood that a bag of words refers to a collection of words.

In an embodiment, a bag of words corresponding to the text sentence segment to be compared is established, and word frequency and word occurrence rate of the words in the text sentence segment to be compared can be calculated.

S22: calculating the word frequency of each word in the text sentence segment to be compared according to the word bag, and expressing the word frequency as

Where t represents words, d represents sentence segments, tf_t,dRepresenting whether the word t appears in the sentence segment d, if it appears, it takes 1, if it does not appear, g_t,dTake 0, g_t,dRepresenting the proportion of a word in a text period to be compared.

In one embodiment, the word frequency of each word in the text period to be compared is calculated by the bag of words. Specifically, if the text period a to be compared includes a word a, a word B, a word c, and a word d, and the text period B to be compared includes a word a, a word B, a word e, and a word f, there is g_A,a＝1，g_A,c＝1，g_B,a＝1，g_B,c＝0。

In this embodiment, the word frequency of each word in the text period to be compared is calculated, so that the comparison condition of one word in the text period to be compared can be highlighted.

S23: calculating the frequency of the reverse file according to the word bag, and expressing the frequency as

Where N represents the total number of files, the files being predetermined, df_tIt means that n files contain words t, and n is an integer greater than or equal to 0.

The file is predetermined, and for example, a chapter of a novel can be stored in a file manner. In one embodiment, the inverse file frequency is calculated according to the bag of words, so that the proportion of words in the file dimension can be reflected.

S24: obtaining a first text matrix according to the word frequency and the reverse file frequency, wherein elements in the first text matrix adopt a formula

And (4) calculating.

In particular, the first text matrix may be represented as

Formula w_t,dCalculated is each specific element in the first text matrix, where v_ijAnd the ith word in the jth text sentence segment to be compared is represented.

It will be appreciated that multiplying the word frequency by the word occurrence rate (inverse document frequency) will result in each particular element of the first text matrix being calculated. In this embodiment, the importance of the words is represented by the word frequency and the word occurrence rate by using the first text matrix.

The word frequency is represented by the frequency of occurrence in sentences, and the word occurrence rate is mainly represented by the frequency of occurrence in documents, so that the situation that a certain word appears in a large amount in a certain document and appears in a small amount in other documents is prevented. The word frequency and the reverse file frequency are multiplied to synthesize the word characteristics embodied by the word frequency and the reverse file frequency, and more accurately embody the importance degree of the words (which can be understood as different weights given to the words).

In steps S21-S24, a word frequency word occurrence rate algorithm is used to obtain a first text matrix based on the text segments to be compared, and the importance of the words in the text segments to be compared is represented in the first text matrix by the word frequency and the word occurrence rate, which is helpful to improve the accuracy of the subsequent calculation of the text similarity.

S30: and obtaining a second text matrix based on the text sentence segments to be compared through a pre-trained Bert model.

In one embodiment, the text sentence segments to be compared are imported into a Bert model trained in advance, and a second text matrix is output, wherein the second text matrix can dynamically express the inter-word relationship, the word position relationship and the inter-sentence relationship in the text sentence segments to be compared by means of the characteristic of characteristic extraction of the Bert model, the characteristics of the text sentence segments to be compared are embodied from multiple aspects such as words and sentences, and the accuracy of subsequent text similarity calculation can be improved.

Specifically, in this embodiment, an initial Bert model of a 24-layer training network layer and 1024 network nodes is used, where the training network layer includes a Trm, and the Trm is a bidirectional Transformer encoder and corresponds to each network node. Further, with E₁,E₂…E_nAs an input layer for the training model, T₁,T₂…T_nAs an output layer of the training model, wherein E₁,E₂…E_nAnd the Trm is in a fully connected connection relation.

It can be seen that the structure of the Bert model is not complex in itself, and the key point for training the Bert model is the preprocessing of the corpus.

Further, before step S30, that is, before obtaining the second text matrix based on the text sentence segments to be compared through the pre-trained Bert model, the method further includes a training process of the Bert model, where the training process of the Bert model specifically includes the following steps:

s31: and acquiring the original corpus.

The original corpus can be obtained from an open-source corpus.

Further, the original corpus may include corpora in the open-source corpus and text sentence segments to be compared, which are the same as the corpora in the open-source corpus in number, and both of the corpora and the text sentence segments respectively account for half of the original corpus. By adopting the method for acquiring the original corpus, the Bert model can keep certain generalization capability when having stronger feature extraction capability on text sentence segments to be compared, and the accuracy of extracting features by the Bert model can be improved.

S32: and performing character-level segmentation on the original corpus.

Wherein the character-level segmentation includes segmenting characters such as punctuation marks. The character-level segmentation is mainly used for subsequent sentence pair connection and removal of non-word characters such as punctuation marks.

S33: and constructing sentence pairs according to the original corpus, wherein the sentence pairs comprise positive sample sentence pairs and negative sample sentence pairs, the positive sample sentence pairs have context relations among sentences, and the negative sample sentence pairs do not have context relations among sentences.

Wherein a sentence pair refers to a set of sentences. In this embodiment, the positive sample sentence pair and the negative sample sentence pair are constructed according to whether a context relationship exists between sentences, so that the initial Bert model has the capability of learning the relationship between sentences when training the initial Bert model, wherein the initial Bert model is the Bert model using the initialization weight.

S34: and connecting sentence pairs based on the original corpus after the character-level segmentation.

It is to be understood that the sentence pair positive and negative samples constructed in step S33 are used to determine whether a context relationship exists in the sentence pair. In this embodiment, the sentence pair connection is performed by using the character as the minimum unit, so that the initial Bert model can perform feature learning by using the character as the minimum unit in the training process. Specifically, [ SEP ] tags are adopted for connecting sentences, [ CLS ] is adopted as tags for the beginning of the sentence, and [ SEP ] is adopted as tags for the end of the sentence. The positions of sentences and the front-back relations among the sentences are marked in the form of labels, so that the initial Bert model can learn the characteristics when the initial Bert model is trained.

S35: and randomly covering ten percent of characters in the sentence pair to obtain the training corpus.

Here, a mask mechanism trained by the Bert model is used, wherein the masked character can be predicted by using a pre-trained prediction model. By adopting the random character masking mode, the model can be judged in the training process, so that the trained model has stronger generalization capability and stronger feature extraction capability.

S36: and inputting the training corpus into the initial Bert model for training to obtain the Bert model.

In steps S31-S36, a specific embodiment of training the Bert model is provided. It can be seen that, in this embodiment, the key point of training the Bert model is to preprocess the corpus, input the preprocessed training corpus into the initial Bert model for training, and update the initial weight in the initial Bert model, so as to obtain the Bert model.

S40: and splicing the first text matrix and the second text matrix to obtain a spliced text matrix.

Splicing refers to matrix splicing operation.

In an embodiment, the spliced text matrix obtained by splicing the first text matrix and the second text matrix simultaneously contains the characteristics of the first text matrix and the second text matrix, and can embody the characteristics of the word frequency and the word occurrence rate in the text period to be compared, the inter-word relationship, the word position relationship, the inter-sentence relationship and the like in the text period to be compared.

Further, in step S40, the first text matrix and the second text matrix are spliced to obtain a spliced text matrix, which specifically includes:

s41: and judging whether the matrix sizes of the first text matrix and the second text matrix are the same or not.

S42: and if the first text matrix and the second text matrix are the same, splicing the first text matrix and the second text matrix to obtain a spliced text matrix.

S43: and if the two text matrixes are different, reducing the dimension of the first text matrix by adopting a principal component analysis method to ensure that the matrix size of the first text matrix is equal to the matrix size of the second text matrix, and splicing the first file matrix and the second text matrix after dimension reduction to obtain a spliced text matrix.

It will be appreciated that the output of the Bert model may set the dimensions of the different output elements according to the task requirements. In general, the first text matrix generated by the word frequency and word rate algorithm is a sparse matrix, and may be dimension-reduced and adjusted to the same dimension size as the second text matrix output by the Bert model by using a principal component analysis method, where the principal component analysis method is the same as the method for performing feature optimization by using principal component analysis in steps S51-S54, and reference may be made to the method for using principal component analysis in steps S51-S54.

In steps S41-S43, the first text matrix and the second text matrix are ensured to be identical in dimension, thereby facilitating computer operation and subsequent further processing.

S50: and performing characteristic optimization on the spliced text matrix to obtain a target text matrix.

In an embodiment, the feature optimization refers to performing dimension reduction on the spliced text matrix to remove noise data in the spliced text matrix, which is beneficial to improving the accuracy of text similarity calculation.

Further, in step S50, performing feature optimization on the spliced text matrix to obtain a target text matrix, which specifically includes:

s51: calculating word vector v of spliced text matrix based on principal component analysis method_sIs expressed by formula as

Wherein S is a spliced text matrix, v_tFor the vector of the word t in the stitched text matrix, α is a preset smoothing parameter, p_tIs the probability of a word appearing in a document.

Specifically, α may be 0.0001.

S52: word vector v obtained by adopting truncated singular value decomposition method_sThe main component u of (1).

S53: according to the word vector v_sAnd principal component u to word vector v_sOptimizing the characteristics to obtainThe updated word vector is expressed as v by adopting a formula_s′＝v_s-u(u^T)v_sWhere T denotes a transpose matrix operation.

In one embodiment, feature optimization uses principal component analysis to remove noisy data in the word-spelling vector.

S54: according to the updated word vector v_s' get the target text matrix.

Understandably, from the resulting word vector v_sAnd constructing a target text matrix according to the text sentence segments to be compared and the index numbers of the fields in the text sentence segments to be compared.

In steps S51-S54, a specific embodiment of obtaining a target text matrix is provided, and the method of principal component analysis can remove noise data, which is helpful for improving the accuracy of text similarity calculation.

S60: and obtaining the text similarity between the text sentence segments to be compared according to the target text matrix by adopting a preset similarity algorithm.

Further, in step S60, a preset similarity calculation method is adopted to obtain the text similarity between text paragraphs to be compared according to the target text matrix, which specifically includes:

s61: according to the updated word vector v in the text sentence segment to be compared_s' obtaining a target text vector corresponding to the text sentence segment to be compared from the target text matrix.

It is understood that a text period to be compared includes different words, a word vector v_s' corresponding to a word, it is necessary to compare the words included in the text segment to be compared with the updated word vector v_s' selecting matrix elements related to the text sentence segments to be compared from the target text matrix to obtain target text vectors corresponding to the text sentence segments to be compared.

S62: calculating the similarity between target text vectors by adopting a pre-similarity calculation method to obtain the similarity between text sentence segments to be compared, and expressing the similarity as

And q represents a target text vector of the first text sentence segment to be compared, c represents a target text vector of the second text sentence segment to be compared, and the text sentence segment to be compared comprises the first text sentence segment to be compared and the second text sentence segment to be compared.

In one embodiment, the similarity calculation method specifically adopts cosine similarity, and can highlight the similarity of the text by taking the direction as a key point for measuring the similarity, so that the accuracy of the similarity of the text is improved.

It should be understood that, the sequence numbers of the steps in the foregoing embodiments do not imply an execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present invention.

Based on the method for calculating text similarity based on the Bert model provided in the embodiment, the embodiment of the invention further provides an embodiment of a device for realizing the steps and the method in the embodiment of the method.

Fig. 2 is a schematic block diagram of a Bert model-based text similarity calculation apparatus in one-to-one correspondence with the Bert model-based text similarity calculation method in the embodiment. As shown in fig. 2, the Bert model-based text similarity calculation apparatus includes a text determination module 10, a first obtaining module 20, a second obtaining module 30, a stitched text matrix obtaining module 40, a target text matrix obtaining module 50, and a text similarity calculation module 60. The implementation functions of the text determining module 10, the first obtaining module 20, the second obtaining module 30, the stitched text matrix obtaining module 40, the target text matrix obtaining module 50, and the text similarity calculating module 60 correspond to the steps corresponding to the text similarity calculating method based on the Bert model in the embodiment one by one, and for avoiding repeated description, detailed description is not needed in this embodiment.

And the text determining module 10 is used for determining text sentence segments to be compared.

The first obtaining module 20 is configured to obtain a first text matrix based on text segments to be compared by using a word frequency and word occurrence rate algorithm.

And the second obtaining module 30 is configured to obtain a second text matrix based on the text sentence segments to be compared through a pre-trained Bert model.

Specifically, in the present embodiment, an initial Bert model of a 24-layer training network layer and 1024 network nodes is used, where the training network layer includes Trm, and the Trm represents a bidirectional Transformer encoder, corresponding to each network node. Further, with E₁,E₂…E_nAs an input layer for the training model, T₁,T₂…T_nAs an output layer of the training model, wherein E₁,E₂…E_nAnd the Trm is in a fully connected connection relation.

And the spliced text matrix obtaining module 40 is configured to splice the first text matrix and the second text matrix to obtain a spliced text matrix.

Splicing refers to matrix splicing operation.

And the target text matrix obtaining module 50 is configured to perform feature optimization on the spliced text matrix to obtain a target text matrix.

In an embodiment, the feature optimization specifically refers to performing dimension reduction on the spliced text matrix to remove noise data in the spliced text matrix, which is beneficial to improving the accuracy of text similarity calculation.

And the text similarity calculation module 60 is configured to obtain the text similarity between text sentence segments to be compared according to the target text matrix by using a preset similarity calculation method.

Optionally, the first obtaining module 20 is specifically configured to:

Where t represents words, d represents sentence segments, tf_t,dRepresenting whether the word t appears in the sentence segment d, if it appears, it takes 1, if it does not appear, g_t,dTake 0, g_t,dRepresenting the proportion of a word in a text sentence segment to be compared;

calculating the frequency of the reverse file according to the word bag, and expressing the frequency as

Where N represents the total number of files, the files being predetermined, df_tN files contain words t, and n is an integer greater than or equal to 0;

obtaining a first text matrix according to the word frequency and the reverse file frequency, wherein elements in the first text matrix adopt a formula

And (4) calculating.

Optionally, the Bert model-based text similarity calculation apparatus further includes a training module, configured to:

acquiring an original corpus;

performing character-level segmentation on the original corpus;

constructing sentence pairs according to the original corpus, wherein the sentence pairs comprise positive sample sentence pairs and negative sample sentence pairs, the positive sample sentence pairs have context relations among sentences, and the negative sample sentence pairs do not have context relations among sentences;

connecting sentence pairs based on the original corpus after the character-level segmentation;

randomly covering ten percent of characters in the sentence pair to obtain a training corpus;

and inputting the training corpus into the initial Bert model for training to obtain the Bert model.

Optionally, the concatenated text matrix obtaining module 40 is specifically configured to:

and if the two text matrixes are different, reducing the dimension of the first text matrix by adopting a principal component analysis method to ensure that the matrix size of the first text matrix is equal to the matrix size of the second text matrix, and splicing the first file matrix and the second text matrix after dimension reduction to obtain a spliced text matrix.

Optionally, the target text matrix obtaining module 50 is specifically configured to:

calculating word vector v of spliced text matrix based on principal component analysis method_sIs expressed by formula as

Wherein S is a spliced text matrix, v_tFor the vector of the word t in the stitched text matrix, α is a preset smoothing parameter, p_tIs the probability of a word appearing in a document;

word vector v obtained by adopting truncated singular value decomposition method_sThe main component u of (1);

according to the word vector v_sAnd principal component u to word vector v_sOptimizing the characteristics to obtain an updated word vector which is expressed as v by adopting a formula_s′＝v_s-u(u^T)v_sWherein T represents a transposed matrix operation;

according to the updated word vector v_s' get the target text matrix.

Optionally, the text similarity calculation module 60 is specifically configured to:

calculating the similarity between target text vectors by adopting a pre-similarity calculation method to obtain the similarity between text sentence segments to be compared, and expressing the similarity as

The present embodiment provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the method for calculating text similarity based on a Bert model in the embodiments is implemented, and in order to avoid repetition, details of the method are not repeated here. Alternatively, the computer program, when executed by the processor, implements the functions of each module/unit in the text similarity calculation apparatus based on the Bert model in the embodiment, and in order to avoid repetition, the details are not repeated here.

Fig. 3 is a schematic diagram of a computer device according to an embodiment of the present invention. As shown in fig. 3, the computer device 70 of this embodiment includes: a processor 71, a memory 72 and a computer program 73 stored in the memory 72 and operable on the processor 71, the computer program 73, when executed by the processor 71, implementing the Bert model-based text similarity calculation method in the embodiments. Alternatively, the computer program 73, when executed by the processor 71, implements the functions of each model/unit in the Bert model-based text similarity calculation apparatus in one-to-one correspondence with the Bert model-based text similarity calculation method in the embodiment.

The computing device 70 may be a desktop computer, a notebook computer, a palm top computer, a cloud server, or other computing devices. The computer device 70 may include, but is not limited to, a processor 71, a memory 72. Those skilled in the art will appreciate that fig. 3 is merely an example of a computing device 70 and is not intended to limit computing device 70 and that it may include more or fewer components than shown, or some of the components may be combined, or different components, e.g., the computing device may also include input output devices, network access devices, buses, etc.

The Processor 71 may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, discrete hardware component, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The memory 72 may be an internal storage unit of the computer device 70, such as a hard disk or a memory of the computer device 70. The memory 72 may also be an external storage device of the computer device 70, such as a plug-in hard disk provided on the computer device 70, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like. Further, the memory 72 may also include both internal and external storage units of the computer device 70. The memory 72 is used to store computer programs and other programs and data required by the computer device. The memory 72 may also be used to temporarily store data that has been output or is to be output.

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-mentioned division of the functional units and modules is illustrated, and in practical applications, the above-mentioned function distribution may be performed by different functional units and modules according to needs, that is, the internal structure of the apparatus is divided into different functional units or modules, so as to perform all or part of the functions described above.

The above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present invention, and are intended to be included within the scope of the present invention.

Claims

1. A text similarity calculation method based on a Bert model is characterized by comprising the following steps:

determining text sentence segments to be compared;

2. The method of claim 1, wherein said obtaining a first text matrix based on said text segments to be compared using a word frequency word occurrence algorithm comprises:

Where t represents words, d represents sentence segments, tf_t，dRepresenting whether the word t appears in the sentence segment d, if it appears, it takes 1, if it does not appear, g_t，dTake 0, g_t，dRepresenting the proportion of a word in the text sentence segment to be compared;

And (4) calculating.

3. The method as claimed in claim 1, further comprising a training process of the Bert model before obtaining the second text matrix based on the text sentence segments to be compared by the pre-trained Bert model, comprising the steps of:

acquiring an original corpus;

performing character-level segmentation on the original corpus;

4. The method of claim 1, wherein the concatenating the first text matrix and the second text matrix to obtain a concatenated text matrix comprises:

5. The method according to any one of claims 1 to 4, wherein the performing feature optimization on the spliced text matrix to obtain a target text matrix comprises:

according to the word vector v_SAnd said principal component u to said word vector v_SPerforming feature optimization to obtain the updated word vector which is expressed as v 'by adopting a formula'_S＝v_S-u(u^T)v_SWhere T represents a transposed matrix operation:

according to the updated word vector v'_SAnd obtaining the target text matrix.

6. The method of claim 5, wherein obtaining the text similarity between the text segments to be compared according to the target text matrix by using a preset similarity algorithm comprises:

according to the updated word vector v 'in the text sentence segment to be compared'_SObtaining the corresponding purpose of the text sentence segment to be compared from the target text matrixMarking a text vector;

7. An apparatus for calculating similarity of texts based on a Bert model, the apparatus comprising:

8. The apparatus of claim 7, wherein the first obtaining module is specifically configured to:

according to whatThe term bag calculates the term frequency of each term in the text sentence segment to be compared, and the term frequency is expressed by a formula

And (4) calculating.

9. A computer device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, wherein the processor implements the steps of the Bert model-based text similarity calculation method according to any one of claims 1 to 6 when executing the computer program.

10. A computer-readable storage medium, in which a computer program is stored, which, when being executed by a processor, carries out the steps of the Bert model-based text similarity calculation method according to any one of claims 1 to 6.