CN112183111A

CN112183111A - Long text semantic similarity matching method and device, electronic equipment and storage medium

Info

Publication number: CN112183111A
Application number: CN202011042061.6A
Authority: CN
Inventors: 徐晨兴; 张雷
Original assignee: Asiainfo Technologies China Inc
Current assignee: Asiainfo Technologies China Inc
Priority date: 2020-09-28
Filing date: 2020-09-28
Publication date: 2021-01-05
Anticipated expiration: 2040-09-28
Also published as: CN112183111B

Abstract

The embodiment of the application provides a long text semantic similarity matching method and device, electronic equipment and a storage medium. The method comprises the following steps: respectively preprocessing the long text and the reference text to obtain a plurality of first word vectors corresponding to a plurality of sentences of the long text and a plurality of second word vectors corresponding to one sentence of the reference text; pooling the plurality of first word vectors and the plurality of second word vectors respectively to obtain a plurality of first semantic vectors corresponding to a plurality of sentences of the long text respectively and a second semantic vector corresponding to one sentence of the second text; inputting a plurality of first semantic vectors into a preset entity recognition model to determine paragraph types of paragraphs included in the long text; determining the weight corresponding to the first semantic vector in each paragraph according to the paragraph type; and calculating the similarity of the long text relative to the reference text based on the first semantic vector, the weight corresponding to the first semantic vector and the second semantic vector.

Description

Long text semantic similarity matching method and device, electronic equipment and storage medium

Technical Field

The application relates to the technical field of natural language processing, in particular to a long text semantic similarity matching method and device, electronic equipment and a storage medium.

Background

Natural Language Processing (NLP) is an important direction in the fields of computer science and artificial intelligence. It studies various theories and methods that enable efficient communication between humans and computers using natural language. Natural language processing is a science integrating linguistics, computer science and mathematics. Therefore, the research in this field will involve natural language, i.e. the language that people use everyday, so it is closely related to the research of linguistics. In natural language processing, semantic similarity matching between different texts is sometimes required.

The existing semantic matching is the semantic matching between the short text and the short text, and the existing scheme does not have a scheme which can realize the semantic matching between the long text and the short text.

Disclosure of Invention

The purpose of the present application is to solve at least one of the above technical drawbacks, and to provide the following solutions:

in a first aspect, a method for matching semantic similarity of long texts is provided, and the method includes:

respectively preprocessing a long text and a reference text to obtain a plurality of first word vectors corresponding to a plurality of sentences of the long text and a plurality of second word vectors corresponding to one sentence of the reference text, wherein the long text comprises a plurality of sentences, and the reference text comprises one sentence;

pooling the plurality of first word vectors and the plurality of second word vectors respectively to obtain a plurality of first semantic vectors corresponding to a plurality of sentences of the long text respectively and a second semantic vector corresponding to one sentence of the second text;

inputting the plurality of first semantic vectors into a preset entity recognition model to determine paragraph types of paragraphs included in the long text;

determining the weight corresponding to the first semantic vector in each paragraph according to the paragraph type;

and calculating the similarity of the long text relative to the reference text based on the first semantic vector, the weight corresponding to the first semantic vector and the second semantic vector.

In a second aspect, an apparatus for semantic similarity matching of long texts is provided, the apparatus comprising:

the preprocessing module is used for respectively preprocessing the long text and the reference text to obtain a plurality of first word vectors corresponding to a plurality of sentences of the long text and a plurality of second word vectors corresponding to one sentence of the reference text;

the pooling module is used for respectively pooling the first word vectors and the second word vectors to obtain a plurality of first semantic vectors corresponding to a plurality of sentences of the long text and a second semantic vector corresponding to one sentence of the second text;

the classification module is used for determining paragraph types of paragraphs included in the long text according to a preset entity recognition model and the plurality of first semantic vectors;

the weighting module is used for determining the weight corresponding to the first semantic vector in each paragraph according to the paragraph type;

and the similarity calculation module is used for calculating the similarity of the long text relative to the reference text based on the first semantic vector, the weight corresponding to the first semantic vector, the second semantic vector and a preset algorithm model. .

In a third aspect, an electronic device is provided, which includes:

one or more processors;

a memory;

one or more application programs, wherein the one or more application programs are stored in the memory and configured to be executed by the one or more processors, the one or more programs configured to: the long text semantic similarity matching method according to the first aspect of the present application is performed.

In a fourth aspect, a computer-readable storage medium is provided, on which a computer program is stored, which when executed by a processor, implements the long text semantic similarity matching method shown in the first aspect of the present application.

The beneficial effect that technical scheme that this application provided brought is: the method comprises the steps of obtaining a first word vector corresponding to a long text and a second word vector corresponding to a reference text, performing pooling processing on the first word vector and the second word vector respectively to obtain a first semantic vector corresponding to each sentence of the long text and a second semantic vector corresponding to the reference text, and determining a paragraph type of each paragraph to determine the weight of each sentence of the long text, so that the semantic similarity between the long text and the reference text can be obtained, the obtained similarity is related to the paragraph type of the text, and the obtained similarity is more accurate.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings used in the description of the embodiments of the present application will be briefly described below.

Fig. 1 is a schematic flowchart of a long text semantic similarity matching method according to an embodiment of the present disclosure;

FIG. 2 is a detailed flowchart of step S101 in FIG. 1;

fig. 3 is a schematic structural diagram of a long text semantic similarity matching apparatus according to an embodiment of the present disclosure;

fig. 4 is a schematic structural diagram of an electronic device for matching semantic similarity of long texts according to an embodiment of the present application.

Detailed Description

Reference will now be made in detail to embodiments of the present application, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are exemplary only for the purpose of explaining the present application and are not to be construed as limiting the present invention.

As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may also be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or wirelessly coupled. As used herein, the term "and/or" includes all or any element and all combinations of one or more of the associated listed items.

To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

The long text semantic similarity matching method, device, electronic equipment and computer readable storage medium provided by the application aim to solve the above technical problems in the prior art.

The following describes the technical solutions of the present application and how to solve the above technical problems with specific embodiments. The following several specific embodiments may be combined with each other, and details of the same or similar concepts or processes may not be repeated in some embodiments. Embodiments of the present application will be described below with reference to the accompanying drawings.

Those skilled in the art will understand that the "terminal" used in the present application may be a Mobile phone, a tablet computer, a PDA (Personal Digital Assistant), an MID (Mobile Internet Device), etc.; a "server" may be implemented as a stand-alone server or as a server cluster comprised of multiple servers.

Referring to fig. 1, an embodiment of the present application provides a long text semantic similarity matching method, where the long text semantic similarity matching method may be applied to a terminal or a server, and the method includes:

s101: the method comprises the steps of preprocessing a long text and a reference text respectively to obtain a plurality of first word vectors corresponding to a plurality of sentences of the long text respectively and a plurality of second word vectors corresponding to one sentence of the reference text, wherein the long text comprises a plurality of sentences, and the reference text comprises one sentence.

The long text is a long text, and the long text comprises at least two paragraph types, each paragraph type at least comprises one paragraph, and each paragraph comprises one or more sentences. The paragraph types included in the long text are not limited, and the classification method of the paragraph types is not limited. Paragraph types such as long text may include summary paragraphs and general paragraphs. Paragraph types may also be classified in other ways to be divided into different paragraph types. The reference text comprises a sentence, which is a complete meaning. Sentences spaced by punctuation marks may be included in the reference text. If a sentence of the reference text is "i love dad, i love mom. "or one sentence of the reference text is" I love home. ".

The purpose of the pre-processing is to vectorize the long text and the reference text. During preprocessing, word segmentation and vectorization can be performed on the long text and the reference text respectively to obtain a plurality of first word vectors corresponding to a plurality of sentences of the long text respectively, and a plurality of second word vectors corresponding to one sentence of the reference text. Wherein each sentence of the long text comprises one or more first word vectors.

It can be understood that when the vectorization is performed on the long text and the reference text respectively, the same preset vectorization model is used for vectorization, so that the dimensions of the obtained first word vector and the second word vector are consistent.

S102: and performing pooling processing on the plurality of first word vectors and the plurality of second word vectors respectively to obtain a plurality of first semantic vectors corresponding to a plurality of sentences of the long text respectively and a second semantic vector corresponding to one sentence of the second text.

When the plurality of first word vectors and the plurality of second word vectors are respectively pooled, the plurality of first word vectors and the plurality of second word vectors may be respectively pooled through a maximum pooling layer or an average pooling layer. The maximum Pooling layer is a Max-Pooling layer, and the Average Pooling layer is an Average-Pooling layer. The word vectors are processed through the maximum pooling layer or the average pooling layer, which is a prior art and will not be described in detail in this application. If the plurality of first word vectors are pooled by the average pooling layer, the plurality of first word vectors may be pooled by:

wherein Z is a first semantic vector corresponding to a sentence of the long text, K is the number of first word vectors included in the sentence, v_iIs the ith word vector in the sentence. Each first semantic vector corresponding to each sentence of the long text can be obtained, wherein one sentence corresponds to one semantic vector, and therefore a plurality of semantic vectors are obtained. Similarly, a second semantic vector corresponding to a sentence of the second text can also be obtained. How to perform pooling processing on the plurality of first word vectors and the plurality of second word vectors through the maximum pooling layer is not described in this application.

S103: and inputting the plurality of first semantic vectors into a preset entity recognition model so as to determine the paragraph type of the paragraph included in the long text.

An entity recognition model comprising any one of the following models: long-short memory (LSTM) -Conditional Random Field (CRF) models; or a bidirectional Long-Short Term Memory (Bi-directional Long-Short Term Memory) -conditional random field CRF model. Both the LSTM-CRF model and the BiLSTM-CRF are prior art and are only briefly described in this application. The basic entity recognition model can be established through a large number of samples, and then the basic entity recognition model is trained to obtain the entity recognition model. The determined paragraph type of the long text is not limited, if the long text includes a first paragraph and a second paragraph, the probability that the first paragraph is a summary paragraph is determined to be 0.9, the probability 0.9 is greater than a preset probability threshold, if 0.6, the first paragraph is determined to be a summary paragraph, the probability that the second paragraph is a normal paragraph is determined to be 0.8, the probability 0.8 is greater than a preset probability threshold, if 0.6, the second paragraph is determined to be a normal paragraph. Each paragraph of the long text corresponds to a paragraph type, and different paragraphs may correspond to the same paragraph type. For example, the long text comprises 5 paragraphs, wherein the paragraph type of 1 paragraph is a summary paragraph, and the paragraph type of 4 paragraphs is a general paragraph.

S104: and determining the weight corresponding to the first semantic vector in each paragraph according to the paragraph type.

A type weight correspondence table may be preset, where the type weight correspondence table includes a mapping relationship between paragraph types and weights, and each paragraph type corresponds to one weight. The weight size corresponding to the paragraph type can be set according to needs or experience. Namely, determining the weight corresponding to the first semantic vector in each paragraph according to the paragraph type includes: determining the weight corresponding to the first semantic vector in each paragraph according to a preset type weight corresponding table and the paragraph type, wherein the type weight corresponding table comprises a mapping relation between the paragraph type and the weight.

Different paragraph types correspond to different weights. If the weight of the summary paragraph is 1, the weight of the general paragraph is 0.6. After the weight corresponding to a paragraph type is determined, the weight corresponding to a paragraph is also determined, the weights corresponding to all sentences in a paragraph are also determined, and the weight corresponding to a sentence in a paragraph is the weight corresponding to the paragraph type of the paragraph. The scheme of the application can be applied to the fields of search engines, question-answering conversations, repeated text matching and the like.

Specifically, the weight corresponding to the summary paragraph is 1, and the weight corresponding to the common paragraph is 0.6, so that when a paragraph D1 is a summary paragraph, the weights of all sentences in the paragraph D1 are all 1, and when a paragraph D2 is a common paragraph, the weights of all sentences in the paragraph D2 are all 0.6.

S105: and calculating the similarity of the long text relative to the reference text based on the first semantic vector, the weight corresponding to the first semantic vector and the second semantic vector.

And after the first semantic vector, the weight corresponding to the first semantic vector and the second semantic vector are determined, calculating to obtain the similarity of the long text relative to the reference text. The similarity of the long text to the reference text can be calculated based on the first semantic vector, the weight corresponding to the first semantic vector and the second semantic vector according to the following formula:

wherein N is a second semantic vector of the reference text, M is a plurality of first semantic vectors of the long text, L is the number of sentences of the long text, and M is_iA first semantic vector for the ith sentence in the long text, said λ_miAnd the weight corresponding to the first semantic vector of the ith sentence in the long text. cos (chemical oxygen demand)<N，M>That is, the cosine similarity between N and M is obtained. Wherein cos<N，M>The obtained value is in the range of [ -1,1]And (4) the following steps. If cos<N，M>The closer S is to 1, the more similar the long text is to the reference text, and the closer S is to-1, the more dissimilar the long text and the reference text is. It will be appreciated that s may also be normalized to indicate similarity from another way. Specifically, the similarity may be

When s is 0.5, the similarity is 75%.

cos<N，m_i>For each cos<N，m>Calculating cos based on<N，m>，

Where k is the dimension of the vector, N_jIs the second semantic directionProjection vector of quantity in j dimension, m_jAnd a projection vector of the first semantic vector in the j dimension is obtained. Wherein cos<N，m>The cosine similarity between a sentence in the long text and a sentence in the reference text is obtained.

It can be understood that when determining the similarity of the long text with respect to the reference text, a calculation model can be preset through the above formula to quickly determine the similarity of the long text with respect to the reference text. When the model is calculated, in order that the first semantic vector, the weight corresponding to the first semantic vector and the second semantic vector conform to a preset format, the calculation model can be input, and after the paragraph type of the paragraph included in the long text and the weight corresponding to the first semantic vector in each paragraph of the long text are determined, the paragraph of the long text is re-segmented. Each paragraph needs to include a preset first number of words, such as 512. After the number of words in a paragraph exceeds 512, the paragraph is divided, i.e. re-segmented, so that the number of words in the paragraph is less than or equal to 512. If the number of words in a paragraph is less than a predetermined second number of words, for example 256 words, the paragraph is filled with the paragraph immediately after the paragraph, so that the number of words in the paragraph is greater than the second number and less than or equal to the first number. And supplementing 0 to the paragraphs smaller than the first number of words so that the number of words plus 0 in one paragraph equals the first number of words.

The long text semantic similarity matching method provided by the embodiment of the application comprises the steps of obtaining a first word vector corresponding to a long text and a second word vector corresponding to a reference text, performing pooling processing on the first word vector and the second word vector respectively to obtain a first semantic vector corresponding to each sentence of the long text and a second semantic vector corresponding to the reference text, and determining a paragraph type of each paragraph to determine the weight of each sentence of the long text, so that the semantic similarity between the long text and the reference text can be obtained, the obtained similarity is related to the paragraph type of the text, and the obtained similarity is more accurate.

Referring to fig. 2, a possible implementation manner is further provided in an embodiment of the present application, where the preprocessing the long text and the reference text to obtain a plurality of first word vectors corresponding to a plurality of sentences of the long text and a plurality of second word vectors corresponding to a sentence of the reference text respectively includes:

s201: and performing word segmentation processing on the long text and the reference text respectively through a preset word segmentation algorithm to obtain a plurality of first word segmentations corresponding to the long text and a plurality of second word segmentations corresponding to the reference text.

The word segmentation algorithm is not limited, for example, the word segmentation algorithm may include a bar segmentation algorithm, NLPIR, and the like. The Sedan segmentation algorithm and NLPIR are prior art, and detailed description is not provided in the application. In the foregoing re-segmenting of a paragraph of a long text, the number of words refers to the number of segmented words. As in a paragraph, the number of segments cannot exceed the first number.

S202: and respectively carrying out vectorization processing on the first participle and the second participle through a preset vectorization model so as to obtain a plurality of first word vectors corresponding to a plurality of sentences of the long text and a plurality of second word vectors corresponding to one sentence of the reference text.

The vectorization model comprises a Word2vec model, a Glove model or a Bert model. In the present application, the same vectorization model is used when the vectorization processing is performed on the first participle and the second participle, respectively. If a Word2vec model is adopted to carry out vectorization processing on the first participle and the second participle respectively, the obtained first Word vector and the second Word vector have consistent dimensions. Parameters in the Word2vec model, the Glove model or the Bert model are set as required, and are not limited in the application.

By performing word segmentation processing and vectorization processing on the long text and the reference text respectively, the subsequent determination of semantic vectors of the text and the determination of the weight of sentences in each paragraph can be facilitated, so that the similarity determination is more accurate.

Referring to fig. 3, an embodiment of the present application provides a long text semantic similarity matching apparatus 30, where the long text semantic similarity matching apparatus 30 may include:

a preprocessing module 301, configured to preprocess the long text and the reference text, respectively, to obtain a plurality of first word vectors corresponding to a plurality of sentences of the long text, respectively, and a plurality of second word vectors corresponding to one sentence of the reference text;

a pooling module 302, configured to pool the plurality of first word vectors and the plurality of second word vectors, respectively, so as to obtain a plurality of first semantic vectors corresponding to a plurality of sentences of the long text, respectively, and a second semantic vector corresponding to one sentence of the second text;

a classification module 303, configured to determine a paragraph type of a paragraph included in the long text according to a preset entity recognition model and the plurality of first semantic vectors;

a weight module 304, configured to determine, according to the paragraph type, a weight corresponding to the first semantic vector in each paragraph;

a similarity calculation module 305, configured to calculate a similarity of the long text with respect to the reference text based on the first semantic vector, the weight corresponding to the first semantic vector, the second semantic vector, and a preset algorithm model.

The long text semantic similarity matching device provided by the embodiment of the application obtains a first word vector corresponding to a long text and a second word vector corresponding to a reference text, performs pooling processing on the first word vector and the second word vector respectively to obtain a first semantic vector corresponding to each sentence of the long text and a second semantic vector corresponding to the reference text, and determines a paragraph type of each paragraph to determine the weight of each sentence of the long text, so that the semantic similarity between the long text and the reference text can be obtained, the obtained similarity is related to the paragraph type of the text, and the obtained similarity is more accurate.

The weight module 304 is specifically configured to determine a weight corresponding to the first semantic vector in each paragraph according to a preset type weight correspondence table and the paragraph type, where the type weight correspondence table includes a mapping relationship between the paragraph type and the weight.

The pooling module 302 is specifically configured to pool the plurality of first word vectors and the plurality of second word vectors through a maximum pooling layer or an average pooling layer.

Wherein, the preprocessing module 301 comprises:

the word segmentation unit is used for performing word segmentation processing on the long text and the reference text respectively through a preset word segmentation algorithm so as to obtain a plurality of first word segmentations corresponding to the long text and a plurality of second word segmentations corresponding to the reference text;

and the vectorization unit is used for respectively carrying out vectorization processing on the first participle and the second participle through a preset vectorization model so as to obtain a plurality of first word vectors corresponding to a plurality of sentences of the long text and a plurality of second word vectors corresponding to one sentence of the reference text.

Referring to fig. 4, in an alternative embodiment, an electronic device is provided, and an electronic device 4000 includes: a processor 4001 and a memory 4003. Processor 4001 is coupled to memory 4003, such as via bus 4002. Optionally, the electronic device 4000 may further comprise a transceiver 4004. In addition, the transceiver 4004 is not limited to one in practical applications, and the structure of the electronic device 4000 is not limited to the embodiment of the present application.

The Processor 4001 may be a CPU (Central Processing Unit), a general-purpose Processor, a DSP (Digital Signal Processor), an ASIC (Application Specific Integrated Circuit), an FPGA (Field Programmable Gate Array) or other Programmable logic device, a transistor logic device, a hardware component, or any combination thereof. Which may implement or perform the various illustrative logical blocks, modules, and circuits described in connection with the disclosure. The processor 4001 may also be a combination that performs a computational function, including, for example, a combination of one or more microprocessors, a combination of a DSP and a microprocessor, or the like.

Bus 4002 may include a path that carries information between the aforementioned components. The bus 4002 may be a PCI (Peripheral Component Interconnect) bus, an EISA (Extended Industry Standard Architecture) bus, or the like. The bus 4002 may be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown, but this does not mean that there is only one bus or one type of bus.

The Memory 4003 may be a ROM (Read Only Memory) or other types of static storage devices that can store static information and instructions, a RAM (Random Access Memory) or other types of dynamic storage devices that can store information and instructions, an EEPROM (Electrically Erasable Programmable Read Only Memory), a CD-ROM (Compact Disc Read Only Memory) or other optical Disc storage, optical Disc storage (including Compact Disc, laser Disc, optical Disc, digital versatile Disc, blu-ray Disc, etc.), a magnetic Disc storage medium or other magnetic storage devices, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer, but is not limited to these.

The memory 4003 is used for storing application codes for executing the scheme of the present application, and the execution is controlled by the processor 4001. Processor 4001 is configured to execute application code stored in memory 4003 to implement what is shown in the foregoing method embodiments.

Among them, electronic devices include but are not limited to: a terminal and a server.

An embodiment of the present application provides an electronic device, including: a memory and a processor; at least one program stored in the memory for execution by the processor to implement the corresponding aspects of the foregoing method embodiments, compared with the prior art, can implement: the method comprises the steps of obtaining a first word vector corresponding to a long text and a second word vector corresponding to a reference text, performing pooling processing on the first word vector and the second word vector respectively to obtain a first semantic vector corresponding to each sentence of the long text and a second semantic vector corresponding to the reference text, and determining a paragraph type of each paragraph to determine the weight of each sentence of the long text, so that the semantic similarity between the long text and the reference text can be obtained, the obtained similarity is related to the paragraph type of the text, and the obtained similarity is more accurate.

The embodiment of the present application provides a storage medium, which is a computer-readable storage medium, and a computer program is stored on the computer-readable storage medium, and when the computer program runs on a computer, the computer is enabled to execute the corresponding content in the foregoing method embodiment. Compared with the prior art, the method comprises the steps of obtaining a first word vector corresponding to the long text and a second word vector corresponding to the reference text, performing pooling processing on the first word vector and the second word vector respectively to obtain a first semantic vector corresponding to each sentence of the long text and a second semantic vector corresponding to the reference text, and determining the paragraph type of each paragraph to determine the weight of each sentence of the long text, so that the semantic similarity between the long text and the reference text can be obtained, the obtained similarity is related to the paragraph type of the text, and the obtained similarity is more accurate.

It should be understood that, although the steps in the flowcharts of the figures are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and may be performed in other orders unless explicitly stated herein. Moreover, at least a portion of the steps in the flow chart of the figure may include multiple sub-steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, which are not necessarily performed in sequence, but may be performed alternately or alternately with other steps or at least a portion of the sub-steps or stages of other steps.

The foregoing is only a partial embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims

1. A long text semantic similarity matching method, the method comprising:

2. The long text semantic similarity matching method according to claim 1, wherein the entity recognition model comprises any one of the following models:

long and short time memory LSTM-conditional random field CRF model; or

The BiLSTM-conditional random field CRF model is memorized in two directions and in long time.

3. The long text semantic similarity matching method according to claim 1, wherein the determining a weight corresponding to the first semantic vector in each paragraph according to the paragraph type comprises:

determining the weight corresponding to the first semantic vector in each paragraph according to a preset type weight corresponding table and the paragraph type, wherein the type weight corresponding table comprises a mapping relation between the paragraph type and the weight.

4. The long text semantic similarity matching method according to claim 1, wherein the similarity of the long text with respect to the reference text is calculated based on the first semantic vector, the weight corresponding to the first semantic vector, and the second semantic vector according to the following formula:

wherein N is a second semantic vector of the reference text, M is a plurality of first semantic vectors of the long text, L is the number of sentences of the long text, and M is_iA first semantic vector for the ith sentence in the long text, said λ_miThe weight corresponding to the first semantic vector of the ith sentence in the long text;

wherein cos < N, m > is calculated based on the following formula,

where k is the dimension of the vector, N_jA projection vector in j dimension of the second semantic vector, the m_jAnd a projection vector of the first semantic vector in the j dimension is obtained.

5. The long text semantic similarity matching method according to claim 1, wherein the pooling of the plurality of first word vectors and the plurality of second word vectors respectively comprises:

and performing pooling processing on the plurality of first word vectors and the plurality of second word vectors through a maximum pooling layer or an average pooling layer respectively.

6. The method according to claim 1, wherein the preprocessing the long text and the reference text to obtain a plurality of first word vectors corresponding to a plurality of sentences of the long text and a plurality of second word vectors corresponding to a sentence of the reference text respectively comprises:

performing word segmentation processing on the long text and the reference text respectively through a preset word segmentation algorithm to obtain a plurality of first word segmentations corresponding to the long text and a plurality of second word segmentations corresponding to the reference text;

and respectively carrying out vectorization processing on the first participle and the second participle through a preset vectorization model so as to obtain a plurality of first word vectors corresponding to a plurality of sentences of the long text and a plurality of second word vectors corresponding to one sentence of the reference text.

7. The long text semantic similarity matching method according to claim 6, wherein the segmentation algorithm comprises a ending segmentation algorithm, and the vectorization model comprises a Word2vec model, a Glove model or a Bert model.

8. A long text semantic similarity matching device, comprising:

and the similarity calculation module is used for calculating the similarity of the long text relative to the reference text based on the first semantic vector, the weight corresponding to the first semantic vector, the second semantic vector and a preset algorithm model.

9. An electronic device, comprising:

one or more processors;

a memory;

one or more applications, wherein the one or more applications are stored in the memory and configured to be executed by the one or more processors, the one or more programs configured to: executing the long text semantic similarity matching method according to any one of claims 1 to 7.

10. A storage medium, which is a computer-readable storage medium, on which a computer program is stored, wherein the program, when executed by a processor, implements the long text semantic similarity matching method according to any one of claims 1 to 7.