CN112183111A - Long text semantic similarity matching method and device, electronic equipment and storage medium - Google Patents
Long text semantic similarity matching method and device, electronic equipment and storage medium Download PDFInfo
- Publication number
- CN112183111A CN112183111A CN202011042061.6A CN202011042061A CN112183111A CN 112183111 A CN112183111 A CN 112183111A CN 202011042061 A CN202011042061 A CN 202011042061A CN 112183111 A CN112183111 A CN 112183111A
- Authority
- CN
- China
- Prior art keywords
- text
- semantic
- long text
- paragraph
- long
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 35
- 239000013598 vector Substances 0.000 claims abstract description 182
- 238000011176 pooling Methods 0.000 claims abstract description 32
- 238000007781 pre-processing Methods 0.000 claims abstract description 14
- 230000011218 segmentation Effects 0.000 claims description 21
- 238000012545 processing Methods 0.000 claims description 19
- 238000004422 calculation algorithm Methods 0.000 claims description 12
- 238000004364 calculation method Methods 0.000 claims description 5
- 238000004590 computer program Methods 0.000 claims description 4
- 238000013507 mapping Methods 0.000 claims description 4
- 238000003058 natural language processing Methods 0.000 description 5
- 230000003287 optical effect Effects 0.000 description 3
- 238000005034 decoration Methods 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 230000006870 function Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 230000003068 static effect Effects 0.000 description 2
- 238000013473 artificial intelligence Methods 0.000 description 1
- QVGXLLKOCUKJST-UHFFFAOYSA-N atomic oxygen Chemical compound [O] QVGXLLKOCUKJST-UHFFFAOYSA-N 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000002457 bidirectional effect Effects 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 230000003203 everyday effect Effects 0.000 description 1
- 229910052760 oxygen Inorganic materials 0.000 description 1
- 239000001301 oxygen Substances 0.000 description 1
- 230000002093 peripheral effect Effects 0.000 description 1
- 239000000126 substance Substances 0.000 description 1
- 230000001502 supplementing effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/194—Calculation of difference between files
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/211—Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
- G06F40/295—Named entity recognition
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Machine Translation (AREA)
Abstract
The embodiment of the application provides a long text semantic similarity matching method and device, electronic equipment and a storage medium. The method comprises the following steps: respectively preprocessing the long text and the reference text to obtain a plurality of first word vectors corresponding to a plurality of sentences of the long text and a plurality of second word vectors corresponding to one sentence of the reference text; pooling the plurality of first word vectors and the plurality of second word vectors respectively to obtain a plurality of first semantic vectors corresponding to a plurality of sentences of the long text respectively and a second semantic vector corresponding to one sentence of the second text; inputting a plurality of first semantic vectors into a preset entity recognition model to determine paragraph types of paragraphs included in the long text; determining the weight corresponding to the first semantic vector in each paragraph according to the paragraph type; and calculating the similarity of the long text relative to the reference text based on the first semantic vector, the weight corresponding to the first semantic vector and the second semantic vector.
Description
Technical Field
The application relates to the technical field of natural language processing, in particular to a long text semantic similarity matching method and device, electronic equipment and a storage medium.
Background
Natural Language Processing (NLP) is an important direction in the fields of computer science and artificial intelligence. It studies various theories and methods that enable efficient communication between humans and computers using natural language. Natural language processing is a science integrating linguistics, computer science and mathematics. Therefore, the research in this field will involve natural language, i.e. the language that people use everyday, so it is closely related to the research of linguistics. In natural language processing, semantic similarity matching between different texts is sometimes required.
The existing semantic matching is the semantic matching between the short text and the short text, and the existing scheme does not have a scheme which can realize the semantic matching between the long text and the short text.
Disclosure of Invention
The purpose of the present application is to solve at least one of the above technical drawbacks, and to provide the following solutions:
in a first aspect, a method for matching semantic similarity of long texts is provided, and the method includes:
respectively preprocessing a long text and a reference text to obtain a plurality of first word vectors corresponding to a plurality of sentences of the long text and a plurality of second word vectors corresponding to one sentence of the reference text, wherein the long text comprises a plurality of sentences, and the reference text comprises one sentence;
pooling the plurality of first word vectors and the plurality of second word vectors respectively to obtain a plurality of first semantic vectors corresponding to a plurality of sentences of the long text respectively and a second semantic vector corresponding to one sentence of the second text;
inputting the plurality of first semantic vectors into a preset entity recognition model to determine paragraph types of paragraphs included in the long text;
determining the weight corresponding to the first semantic vector in each paragraph according to the paragraph type;
and calculating the similarity of the long text relative to the reference text based on the first semantic vector, the weight corresponding to the first semantic vector and the second semantic vector.
In a second aspect, an apparatus for semantic similarity matching of long texts is provided, the apparatus comprising:
the preprocessing module is used for respectively preprocessing the long text and the reference text to obtain a plurality of first word vectors corresponding to a plurality of sentences of the long text and a plurality of second word vectors corresponding to one sentence of the reference text;
the pooling module is used for respectively pooling the first word vectors and the second word vectors to obtain a plurality of first semantic vectors corresponding to a plurality of sentences of the long text and a second semantic vector corresponding to one sentence of the second text;
the classification module is used for determining paragraph types of paragraphs included in the long text according to a preset entity recognition model and the plurality of first semantic vectors;
the weighting module is used for determining the weight corresponding to the first semantic vector in each paragraph according to the paragraph type;
and the similarity calculation module is used for calculating the similarity of the long text relative to the reference text based on the first semantic vector, the weight corresponding to the first semantic vector, the second semantic vector and a preset algorithm model. .
In a third aspect, an electronic device is provided, which includes:
one or more processors;
a memory;
one or more application programs, wherein the one or more application programs are stored in the memory and configured to be executed by the one or more processors, the one or more programs configured to: the long text semantic similarity matching method according to the first aspect of the present application is performed.
In a fourth aspect, a computer-readable storage medium is provided, on which a computer program is stored, which when executed by a processor, implements the long text semantic similarity matching method shown in the first aspect of the present application.
The beneficial effect that technical scheme that this application provided brought is: the method comprises the steps of obtaining a first word vector corresponding to a long text and a second word vector corresponding to a reference text, performing pooling processing on the first word vector and the second word vector respectively to obtain a first semantic vector corresponding to each sentence of the long text and a second semantic vector corresponding to the reference text, and determining a paragraph type of each paragraph to determine the weight of each sentence of the long text, so that the semantic similarity between the long text and the reference text can be obtained, the obtained similarity is related to the paragraph type of the text, and the obtained similarity is more accurate.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings used in the description of the embodiments of the present application will be briefly described below.
Fig. 1 is a schematic flowchart of a long text semantic similarity matching method according to an embodiment of the present disclosure;
FIG. 2 is a detailed flowchart of step S101 in FIG. 1;
fig. 3 is a schematic structural diagram of a long text semantic similarity matching apparatus according to an embodiment of the present disclosure;
fig. 4 is a schematic structural diagram of an electronic device for matching semantic similarity of long texts according to an embodiment of the present application.
Detailed Description
Reference will now be made in detail to embodiments of the present application, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are exemplary only for the purpose of explaining the present application and are not to be construed as limiting the present invention.
As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may also be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or wirelessly coupled. As used herein, the term "and/or" includes all or any element and all combinations of one or more of the associated listed items.
To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.
The long text semantic similarity matching method, device, electronic equipment and computer readable storage medium provided by the application aim to solve the above technical problems in the prior art.
The following describes the technical solutions of the present application and how to solve the above technical problems with specific embodiments. The following several specific embodiments may be combined with each other, and details of the same or similar concepts or processes may not be repeated in some embodiments. Embodiments of the present application will be described below with reference to the accompanying drawings.
Those skilled in the art will understand that the "terminal" used in the present application may be a Mobile phone, a tablet computer, a PDA (Personal Digital Assistant), an MID (Mobile Internet Device), etc.; a "server" may be implemented as a stand-alone server or as a server cluster comprised of multiple servers.
Referring to fig. 1, an embodiment of the present application provides a long text semantic similarity matching method, where the long text semantic similarity matching method may be applied to a terminal or a server, and the method includes:
s101: the method comprises the steps of preprocessing a long text and a reference text respectively to obtain a plurality of first word vectors corresponding to a plurality of sentences of the long text respectively and a plurality of second word vectors corresponding to one sentence of the reference text, wherein the long text comprises a plurality of sentences, and the reference text comprises one sentence.
The long text is a long text, and the long text comprises at least two paragraph types, each paragraph type at least comprises one paragraph, and each paragraph comprises one or more sentences. The paragraph types included in the long text are not limited, and the classification method of the paragraph types is not limited. Paragraph types such as long text may include summary paragraphs and general paragraphs. Paragraph types may also be classified in other ways to be divided into different paragraph types. The reference text comprises a sentence, which is a complete meaning. Sentences spaced by punctuation marks may be included in the reference text. If a sentence of the reference text is "i love dad, i love mom. "or one sentence of the reference text is" I love home. ".
The purpose of the pre-processing is to vectorize the long text and the reference text. During preprocessing, word segmentation and vectorization can be performed on the long text and the reference text respectively to obtain a plurality of first word vectors corresponding to a plurality of sentences of the long text respectively, and a plurality of second word vectors corresponding to one sentence of the reference text. Wherein each sentence of the long text comprises one or more first word vectors.
It can be understood that when the vectorization is performed on the long text and the reference text respectively, the same preset vectorization model is used for vectorization, so that the dimensions of the obtained first word vector and the second word vector are consistent.
S102: and performing pooling processing on the plurality of first word vectors and the plurality of second word vectors respectively to obtain a plurality of first semantic vectors corresponding to a plurality of sentences of the long text respectively and a second semantic vector corresponding to one sentence of the second text.
When the plurality of first word vectors and the plurality of second word vectors are respectively pooled, the plurality of first word vectors and the plurality of second word vectors may be respectively pooled through a maximum pooling layer or an average pooling layer. The maximum Pooling layer is a Max-Pooling layer, and the Average Pooling layer is an Average-Pooling layer. The word vectors are processed through the maximum pooling layer or the average pooling layer, which is a prior art and will not be described in detail in this application. If the plurality of first word vectors are pooled by the average pooling layer, the plurality of first word vectors may be pooled by:
wherein Z is a first semantic vector corresponding to a sentence of the long text, K is the number of first word vectors included in the sentence, viIs the ith word vector in the sentence. Each first semantic vector corresponding to each sentence of the long text can be obtained, wherein one sentence corresponds to one semantic vector, and therefore a plurality of semantic vectors are obtained. Similarly, a second semantic vector corresponding to a sentence of the second text can also be obtained. How to perform pooling processing on the plurality of first word vectors and the plurality of second word vectors through the maximum pooling layer is not described in this application.
S103: and inputting the plurality of first semantic vectors into a preset entity recognition model so as to determine the paragraph type of the paragraph included in the long text.
An entity recognition model comprising any one of the following models: long-short memory (LSTM) -Conditional Random Field (CRF) models; or a bidirectional Long-Short Term Memory (Bi-directional Long-Short Term Memory) -conditional random field CRF model. Both the LSTM-CRF model and the BiLSTM-CRF are prior art and are only briefly described in this application. The basic entity recognition model can be established through a large number of samples, and then the basic entity recognition model is trained to obtain the entity recognition model. The determined paragraph type of the long text is not limited, if the long text includes a first paragraph and a second paragraph, the probability that the first paragraph is a summary paragraph is determined to be 0.9, the probability 0.9 is greater than a preset probability threshold, if 0.6, the first paragraph is determined to be a summary paragraph, the probability that the second paragraph is a normal paragraph is determined to be 0.8, the probability 0.8 is greater than a preset probability threshold, if 0.6, the second paragraph is determined to be a normal paragraph. Each paragraph of the long text corresponds to a paragraph type, and different paragraphs may correspond to the same paragraph type. For example, the long text comprises 5 paragraphs, wherein the paragraph type of 1 paragraph is a summary paragraph, and the paragraph type of 4 paragraphs is a general paragraph.
S104: and determining the weight corresponding to the first semantic vector in each paragraph according to the paragraph type.
A type weight correspondence table may be preset, where the type weight correspondence table includes a mapping relationship between paragraph types and weights, and each paragraph type corresponds to one weight. The weight size corresponding to the paragraph type can be set according to needs or experience. Namely, determining the weight corresponding to the first semantic vector in each paragraph according to the paragraph type includes: determining the weight corresponding to the first semantic vector in each paragraph according to a preset type weight corresponding table and the paragraph type, wherein the type weight corresponding table comprises a mapping relation between the paragraph type and the weight.
Different paragraph types correspond to different weights. If the weight of the summary paragraph is 1, the weight of the general paragraph is 0.6. After the weight corresponding to a paragraph type is determined, the weight corresponding to a paragraph is also determined, the weights corresponding to all sentences in a paragraph are also determined, and the weight corresponding to a sentence in a paragraph is the weight corresponding to the paragraph type of the paragraph. The scheme of the application can be applied to the fields of search engines, question-answering conversations, repeated text matching and the like.
Specifically, the weight corresponding to the summary paragraph is 1, and the weight corresponding to the common paragraph is 0.6, so that when a paragraph D1 is a summary paragraph, the weights of all sentences in the paragraph D1 are all 1, and when a paragraph D2 is a common paragraph, the weights of all sentences in the paragraph D2 are all 0.6.
S105: and calculating the similarity of the long text relative to the reference text based on the first semantic vector, the weight corresponding to the first semantic vector and the second semantic vector.
And after the first semantic vector, the weight corresponding to the first semantic vector and the second semantic vector are determined, calculating to obtain the similarity of the long text relative to the reference text. The similarity of the long text to the reference text can be calculated based on the first semantic vector, the weight corresponding to the first semantic vector and the second semantic vector according to the following formula:
wherein N is a second semantic vector of the reference text, M is a plurality of first semantic vectors of the long text, L is the number of sentences of the long text, and M isiA first semantic vector for the ith sentence in the long text, said λmiAnd the weight corresponding to the first semantic vector of the ith sentence in the long text. cos (chemical oxygen demand)<N,M>That is, the cosine similarity between N and M is obtained. Wherein cos<N,M>The obtained value is in the range of [ -1,1]And (4) the following steps. If cos<N,M>The closer S is to 1, the more similar the long text is to the reference text, and the closer S is to-1, the more dissimilar the long text and the reference text is. It will be appreciated that s may also be normalized to indicate similarity from another way. Specifically, the similarity may be
When s is 0.5, the similarity is 75%.
cos<N,mi>For each cos<N,m>Calculating cos based on<N,m>,
Where k is the dimension of the vector, NjIs the second semantic directionProjection vector of quantity in j dimension, mjAnd a projection vector of the first semantic vector in the j dimension is obtained. Wherein cos<N,m>The cosine similarity between a sentence in the long text and a sentence in the reference text is obtained.
It can be understood that when determining the similarity of the long text with respect to the reference text, a calculation model can be preset through the above formula to quickly determine the similarity of the long text with respect to the reference text. When the model is calculated, in order that the first semantic vector, the weight corresponding to the first semantic vector and the second semantic vector conform to a preset format, the calculation model can be input, and after the paragraph type of the paragraph included in the long text and the weight corresponding to the first semantic vector in each paragraph of the long text are determined, the paragraph of the long text is re-segmented. Each paragraph needs to include a preset first number of words, such as 512. After the number of words in a paragraph exceeds 512, the paragraph is divided, i.e. re-segmented, so that the number of words in the paragraph is less than or equal to 512. If the number of words in a paragraph is less than a predetermined second number of words, for example 256 words, the paragraph is filled with the paragraph immediately after the paragraph, so that the number of words in the paragraph is greater than the second number and less than or equal to the first number. And supplementing 0 to the paragraphs smaller than the first number of words so that the number of words plus 0 in one paragraph equals the first number of words.
The long text semantic similarity matching method provided by the embodiment of the application comprises the steps of obtaining a first word vector corresponding to a long text and a second word vector corresponding to a reference text, performing pooling processing on the first word vector and the second word vector respectively to obtain a first semantic vector corresponding to each sentence of the long text and a second semantic vector corresponding to the reference text, and determining a paragraph type of each paragraph to determine the weight of each sentence of the long text, so that the semantic similarity between the long text and the reference text can be obtained, the obtained similarity is related to the paragraph type of the text, and the obtained similarity is more accurate.
Referring to fig. 2, a possible implementation manner is further provided in an embodiment of the present application, where the preprocessing the long text and the reference text to obtain a plurality of first word vectors corresponding to a plurality of sentences of the long text and a plurality of second word vectors corresponding to a sentence of the reference text respectively includes:
s201: and performing word segmentation processing on the long text and the reference text respectively through a preset word segmentation algorithm to obtain a plurality of first word segmentations corresponding to the long text and a plurality of second word segmentations corresponding to the reference text.
The word segmentation algorithm is not limited, for example, the word segmentation algorithm may include a bar segmentation algorithm, NLPIR, and the like. The Sedan segmentation algorithm and NLPIR are prior art, and detailed description is not provided in the application. In the foregoing re-segmenting of a paragraph of a long text, the number of words refers to the number of segmented words. As in a paragraph, the number of segments cannot exceed the first number.
S202: and respectively carrying out vectorization processing on the first participle and the second participle through a preset vectorization model so as to obtain a plurality of first word vectors corresponding to a plurality of sentences of the long text and a plurality of second word vectors corresponding to one sentence of the reference text.
The vectorization model comprises a Word2vec model, a Glove model or a Bert model. In the present application, the same vectorization model is used when the vectorization processing is performed on the first participle and the second participle, respectively. If a Word2vec model is adopted to carry out vectorization processing on the first participle and the second participle respectively, the obtained first Word vector and the second Word vector have consistent dimensions. Parameters in the Word2vec model, the Glove model or the Bert model are set as required, and are not limited in the application.
By performing word segmentation processing and vectorization processing on the long text and the reference text respectively, the subsequent determination of semantic vectors of the text and the determination of the weight of sentences in each paragraph can be facilitated, so that the similarity determination is more accurate.
Referring to fig. 3, an embodiment of the present application provides a long text semantic similarity matching apparatus 30, where the long text semantic similarity matching apparatus 30 may include:
a preprocessing module 301, configured to preprocess the long text and the reference text, respectively, to obtain a plurality of first word vectors corresponding to a plurality of sentences of the long text, respectively, and a plurality of second word vectors corresponding to one sentence of the reference text;
a pooling module 302, configured to pool the plurality of first word vectors and the plurality of second word vectors, respectively, so as to obtain a plurality of first semantic vectors corresponding to a plurality of sentences of the long text, respectively, and a second semantic vector corresponding to one sentence of the second text;
a classification module 303, configured to determine a paragraph type of a paragraph included in the long text according to a preset entity recognition model and the plurality of first semantic vectors;
a weight module 304, configured to determine, according to the paragraph type, a weight corresponding to the first semantic vector in each paragraph;
a similarity calculation module 305, configured to calculate a similarity of the long text with respect to the reference text based on the first semantic vector, the weight corresponding to the first semantic vector, the second semantic vector, and a preset algorithm model.
The long text semantic similarity matching device provided by the embodiment of the application obtains a first word vector corresponding to a long text and a second word vector corresponding to a reference text, performs pooling processing on the first word vector and the second word vector respectively to obtain a first semantic vector corresponding to each sentence of the long text and a second semantic vector corresponding to the reference text, and determines a paragraph type of each paragraph to determine the weight of each sentence of the long text, so that the semantic similarity between the long text and the reference text can be obtained, the obtained similarity is related to the paragraph type of the text, and the obtained similarity is more accurate.
The weight module 304 is specifically configured to determine a weight corresponding to the first semantic vector in each paragraph according to a preset type weight correspondence table and the paragraph type, where the type weight correspondence table includes a mapping relationship between the paragraph type and the weight.
The pooling module 302 is specifically configured to pool the plurality of first word vectors and the plurality of second word vectors through a maximum pooling layer or an average pooling layer.
Wherein, the preprocessing module 301 comprises:
the word segmentation unit is used for performing word segmentation processing on the long text and the reference text respectively through a preset word segmentation algorithm so as to obtain a plurality of first word segmentations corresponding to the long text and a plurality of second word segmentations corresponding to the reference text;
and the vectorization unit is used for respectively carrying out vectorization processing on the first participle and the second participle through a preset vectorization model so as to obtain a plurality of first word vectors corresponding to a plurality of sentences of the long text and a plurality of second word vectors corresponding to one sentence of the reference text.
Referring to fig. 4, in an alternative embodiment, an electronic device is provided, and an electronic device 4000 includes: a processor 4001 and a memory 4003. Processor 4001 is coupled to memory 4003, such as via bus 4002. Optionally, the electronic device 4000 may further comprise a transceiver 4004. In addition, the transceiver 4004 is not limited to one in practical applications, and the structure of the electronic device 4000 is not limited to the embodiment of the present application.
The Processor 4001 may be a CPU (Central Processing Unit), a general-purpose Processor, a DSP (Digital Signal Processor), an ASIC (Application Specific Integrated Circuit), an FPGA (Field Programmable Gate Array) or other Programmable logic device, a transistor logic device, a hardware component, or any combination thereof. Which may implement or perform the various illustrative logical blocks, modules, and circuits described in connection with the disclosure. The processor 4001 may also be a combination that performs a computational function, including, for example, a combination of one or more microprocessors, a combination of a DSP and a microprocessor, or the like.
The Memory 4003 may be a ROM (Read Only Memory) or other types of static storage devices that can store static information and instructions, a RAM (Random Access Memory) or other types of dynamic storage devices that can store information and instructions, an EEPROM (Electrically Erasable Programmable Read Only Memory), a CD-ROM (Compact Disc Read Only Memory) or other optical Disc storage, optical Disc storage (including Compact Disc, laser Disc, optical Disc, digital versatile Disc, blu-ray Disc, etc.), a magnetic Disc storage medium or other magnetic storage devices, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer, but is not limited to these.
The memory 4003 is used for storing application codes for executing the scheme of the present application, and the execution is controlled by the processor 4001. Processor 4001 is configured to execute application code stored in memory 4003 to implement what is shown in the foregoing method embodiments.
Among them, electronic devices include but are not limited to: a terminal and a server.
An embodiment of the present application provides an electronic device, including: a memory and a processor; at least one program stored in the memory for execution by the processor to implement the corresponding aspects of the foregoing method embodiments, compared with the prior art, can implement: the method comprises the steps of obtaining a first word vector corresponding to a long text and a second word vector corresponding to a reference text, performing pooling processing on the first word vector and the second word vector respectively to obtain a first semantic vector corresponding to each sentence of the long text and a second semantic vector corresponding to the reference text, and determining a paragraph type of each paragraph to determine the weight of each sentence of the long text, so that the semantic similarity between the long text and the reference text can be obtained, the obtained similarity is related to the paragraph type of the text, and the obtained similarity is more accurate.
The embodiment of the present application provides a storage medium, which is a computer-readable storage medium, and a computer program is stored on the computer-readable storage medium, and when the computer program runs on a computer, the computer is enabled to execute the corresponding content in the foregoing method embodiment. Compared with the prior art, the method comprises the steps of obtaining a first word vector corresponding to the long text and a second word vector corresponding to the reference text, performing pooling processing on the first word vector and the second word vector respectively to obtain a first semantic vector corresponding to each sentence of the long text and a second semantic vector corresponding to the reference text, and determining the paragraph type of each paragraph to determine the weight of each sentence of the long text, so that the semantic similarity between the long text and the reference text can be obtained, the obtained similarity is related to the paragraph type of the text, and the obtained similarity is more accurate.
It should be understood that, although the steps in the flowcharts of the figures are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and may be performed in other orders unless explicitly stated herein. Moreover, at least a portion of the steps in the flow chart of the figure may include multiple sub-steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, which are not necessarily performed in sequence, but may be performed alternately or alternately with other steps or at least a portion of the sub-steps or stages of other steps.
The foregoing is only a partial embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.
Claims (10)
1. A long text semantic similarity matching method, the method comprising:
respectively preprocessing a long text and a reference text to obtain a plurality of first word vectors corresponding to a plurality of sentences of the long text and a plurality of second word vectors corresponding to one sentence of the reference text, wherein the long text comprises a plurality of sentences, and the reference text comprises one sentence;
pooling the plurality of first word vectors and the plurality of second word vectors respectively to obtain a plurality of first semantic vectors corresponding to a plurality of sentences of the long text respectively and a second semantic vector corresponding to one sentence of the second text;
inputting the plurality of first semantic vectors into a preset entity recognition model to determine paragraph types of paragraphs included in the long text;
determining the weight corresponding to the first semantic vector in each paragraph according to the paragraph type;
and calculating the similarity of the long text relative to the reference text based on the first semantic vector, the weight corresponding to the first semantic vector and the second semantic vector.
2. The long text semantic similarity matching method according to claim 1, wherein the entity recognition model comprises any one of the following models:
long and short time memory LSTM-conditional random field CRF model; or
The BiLSTM-conditional random field CRF model is memorized in two directions and in long time.
3. The long text semantic similarity matching method according to claim 1, wherein the determining a weight corresponding to the first semantic vector in each paragraph according to the paragraph type comprises:
determining the weight corresponding to the first semantic vector in each paragraph according to a preset type weight corresponding table and the paragraph type, wherein the type weight corresponding table comprises a mapping relation between the paragraph type and the weight.
4. The long text semantic similarity matching method according to claim 1, wherein the similarity of the long text with respect to the reference text is calculated based on the first semantic vector, the weight corresponding to the first semantic vector, and the second semantic vector according to the following formula:
wherein N is a second semantic vector of the reference text, M is a plurality of first semantic vectors of the long text, L is the number of sentences of the long text, and M isiA first semantic vector for the ith sentence in the long text, said λmiThe weight corresponding to the first semantic vector of the ith sentence in the long text;
wherein cos < N, m > is calculated based on the following formula,
where k is the dimension of the vector, NjA projection vector in j dimension of the second semantic vector, the mjAnd a projection vector of the first semantic vector in the j dimension is obtained.
5. The long text semantic similarity matching method according to claim 1, wherein the pooling of the plurality of first word vectors and the plurality of second word vectors respectively comprises:
and performing pooling processing on the plurality of first word vectors and the plurality of second word vectors through a maximum pooling layer or an average pooling layer respectively.
6. The method according to claim 1, wherein the preprocessing the long text and the reference text to obtain a plurality of first word vectors corresponding to a plurality of sentences of the long text and a plurality of second word vectors corresponding to a sentence of the reference text respectively comprises:
performing word segmentation processing on the long text and the reference text respectively through a preset word segmentation algorithm to obtain a plurality of first word segmentations corresponding to the long text and a plurality of second word segmentations corresponding to the reference text;
and respectively carrying out vectorization processing on the first participle and the second participle through a preset vectorization model so as to obtain a plurality of first word vectors corresponding to a plurality of sentences of the long text and a plurality of second word vectors corresponding to one sentence of the reference text.
7. The long text semantic similarity matching method according to claim 6, wherein the segmentation algorithm comprises a ending segmentation algorithm, and the vectorization model comprises a Word2vec model, a Glove model or a Bert model.
8. A long text semantic similarity matching device, comprising:
the preprocessing module is used for respectively preprocessing the long text and the reference text to obtain a plurality of first word vectors corresponding to a plurality of sentences of the long text and a plurality of second word vectors corresponding to one sentence of the reference text;
the pooling module is used for respectively pooling the first word vectors and the second word vectors to obtain a plurality of first semantic vectors corresponding to a plurality of sentences of the long text and a second semantic vector corresponding to one sentence of the second text;
the classification module is used for determining paragraph types of paragraphs included in the long text according to a preset entity recognition model and the plurality of first semantic vectors;
the weighting module is used for determining the weight corresponding to the first semantic vector in each paragraph according to the paragraph type;
and the similarity calculation module is used for calculating the similarity of the long text relative to the reference text based on the first semantic vector, the weight corresponding to the first semantic vector, the second semantic vector and a preset algorithm model.
9. An electronic device, comprising:
one or more processors;
a memory;
one or more applications, wherein the one or more applications are stored in the memory and configured to be executed by the one or more processors, the one or more programs configured to: executing the long text semantic similarity matching method according to any one of claims 1 to 7.
10. A storage medium, which is a computer-readable storage medium, on which a computer program is stored, wherein the program, when executed by a processor, implements the long text semantic similarity matching method according to any one of claims 1 to 7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011042061.6A CN112183111B (en) | 2020-09-28 | 2020-09-28 | Long text semantic similarity matching method, device, electronic equipment and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011042061.6A CN112183111B (en) | 2020-09-28 | 2020-09-28 | Long text semantic similarity matching method, device, electronic equipment and storage medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112183111A true CN112183111A (en) | 2021-01-05 |
CN112183111B CN112183111B (en) | 2024-08-23 |
Family
ID=73943871
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202011042061.6A Active CN112183111B (en) | 2020-09-28 | 2020-09-28 | Long text semantic similarity matching method, device, electronic equipment and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112183111B (en) |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113255369A (en) * | 2021-06-10 | 2021-08-13 | 平安国际智慧城市科技股份有限公司 | Text similarity analysis method and device and storage medium |
CN113553848A (en) * | 2021-07-19 | 2021-10-26 | 北京奇艺世纪科技有限公司 | Long text classification method, system, electronic equipment and computer readable storage medium |
CN114741499A (en) * | 2022-06-08 | 2022-07-12 | 杭州费尔斯通科技有限公司 | Text abstract generation method and system based on sentence semantic model |
CN117235546A (en) * | 2023-11-14 | 2023-12-15 | 国泰新点软件股份有限公司 | Multi-version file comparison method, device, system and storage medium |
CN117688138A (en) * | 2024-02-02 | 2024-03-12 | 中船凌久高科(武汉)有限公司 | Long text similarity comparison method based on paragraph division |
WO2024098636A1 (en) * | 2022-11-08 | 2024-05-16 | 华院计算技术(上海)股份有限公司 | Text matching method and apparatus, computer-readable storage medium, and terminal |
Citations (23)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7356461B1 (en) * | 2002-01-14 | 2008-04-08 | Nstein Technologies Inc. | Text categorization method and apparatus |
CN101620596A (en) * | 2008-06-30 | 2010-01-06 | 东北大学 | Multi-document auto-abstracting method facing to inquiry |
US20120072220A1 (en) * | 2010-09-20 | 2012-03-22 | Alibaba Group Holding Limited | Matching text sets |
CN107273391A (en) * | 2016-04-08 | 2017-10-20 | 北京国双科技有限公司 | Document recommends method and apparatus |
CN107463605A (en) * | 2017-06-21 | 2017-12-12 | 北京百度网讯科技有限公司 | The recognition methods and device of low-quality News Resources, computer equipment and computer-readable recording medium |
CN107679144A (en) * | 2017-09-25 | 2018-02-09 | 平安科技(深圳)有限公司 | News sentence clustering method, device and storage medium based on semantic similarity |
CN108256539A (en) * | 2016-12-28 | 2018-07-06 | 北京智能管家科技有限公司 | Man-machine interaction method, interactive system and Intelligent story device based on semantic matches |
CN108628825A (en) * | 2018-04-10 | 2018-10-09 | 平安科技(深圳)有限公司 | Text message Similarity Match Method, device, computer equipment and storage medium |
CN109117474A (en) * | 2018-06-25 | 2019-01-01 | 广州多益网络股份有限公司 | Calculation method, device and the storage medium of statement similarity |
CN109213999A (en) * | 2018-08-20 | 2019-01-15 | 成都佳发安泰教育科技股份有限公司 | A kind of subjective item methods of marking |
CN109388786A (en) * | 2018-09-30 | 2019-02-26 | 武汉斗鱼网络科技有限公司 | A kind of Documents Similarity calculation method, device, equipment and medium |
CN109977203A (en) * | 2019-03-07 | 2019-07-05 | 北京九狐时代智能科技有限公司 | Statement similarity determines method, apparatus, electronic equipment and readable storage medium storing program for executing |
CN110033022A (en) * | 2019-03-08 | 2019-07-19 | 腾讯科技(深圳)有限公司 | Processing method, device and the storage medium of text |
CN110134942A (en) * | 2019-04-01 | 2019-08-16 | 北京中科闻歌科技股份有限公司 | Text hot spot extracting method and device |
CN110298035A (en) * | 2019-06-04 | 2019-10-01 | 平安科技(深圳)有限公司 | Word vector based on artificial intelligence defines method, apparatus, equipment and storage medium |
US10459962B1 (en) * | 2018-09-19 | 2019-10-29 | Servicenow, Inc. | Selectively generating word vector and paragraph vector representations of fields for machine learning |
CN110399484A (en) * | 2019-06-25 | 2019-11-01 | 平安科技(深圳)有限公司 | Sentiment analysis method, apparatus, computer equipment and the storage medium of long text |
CN110704621A (en) * | 2019-09-25 | 2020-01-17 | 北京大米科技有限公司 | Text processing method and device, storage medium and electronic equipment |
CN110941961A (en) * | 2019-11-29 | 2020-03-31 | 秒针信息技术有限公司 | Information clustering method and device, electronic equipment and storage medium |
CN110968664A (en) * | 2018-09-30 | 2020-04-07 | 北京国双科技有限公司 | Document retrieval method, device, equipment and medium |
US20200125804A1 (en) * | 2017-06-30 | 2020-04-23 | Fujitsu Limited | Non-transitory computer readable recording medium, semantic vector generation method, and semantic vector generation device |
WO2020111314A1 (en) * | 2018-11-27 | 2020-06-04 | 한국과학기술원 | Conceptual graph-based query-response apparatus and method |
CN111444700A (en) * | 2020-04-02 | 2020-07-24 | 山东山大鸥玛软件股份有限公司 | Text similarity measurement method based on semantic document expression |
-
2020
- 2020-09-28 CN CN202011042061.6A patent/CN112183111B/en active Active
Patent Citations (23)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7356461B1 (en) * | 2002-01-14 | 2008-04-08 | Nstein Technologies Inc. | Text categorization method and apparatus |
CN101620596A (en) * | 2008-06-30 | 2010-01-06 | 东北大学 | Multi-document auto-abstracting method facing to inquiry |
US20120072220A1 (en) * | 2010-09-20 | 2012-03-22 | Alibaba Group Holding Limited | Matching text sets |
CN107273391A (en) * | 2016-04-08 | 2017-10-20 | 北京国双科技有限公司 | Document recommends method and apparatus |
CN108256539A (en) * | 2016-12-28 | 2018-07-06 | 北京智能管家科技有限公司 | Man-machine interaction method, interactive system and Intelligent story device based on semantic matches |
CN107463605A (en) * | 2017-06-21 | 2017-12-12 | 北京百度网讯科技有限公司 | The recognition methods and device of low-quality News Resources, computer equipment and computer-readable recording medium |
US20200125804A1 (en) * | 2017-06-30 | 2020-04-23 | Fujitsu Limited | Non-transitory computer readable recording medium, semantic vector generation method, and semantic vector generation device |
CN107679144A (en) * | 2017-09-25 | 2018-02-09 | 平安科技(深圳)有限公司 | News sentence clustering method, device and storage medium based on semantic similarity |
CN108628825A (en) * | 2018-04-10 | 2018-10-09 | 平安科技(深圳)有限公司 | Text message Similarity Match Method, device, computer equipment and storage medium |
CN109117474A (en) * | 2018-06-25 | 2019-01-01 | 广州多益网络股份有限公司 | Calculation method, device and the storage medium of statement similarity |
CN109213999A (en) * | 2018-08-20 | 2019-01-15 | 成都佳发安泰教育科技股份有限公司 | A kind of subjective item methods of marking |
US10459962B1 (en) * | 2018-09-19 | 2019-10-29 | Servicenow, Inc. | Selectively generating word vector and paragraph vector representations of fields for machine learning |
CN110968664A (en) * | 2018-09-30 | 2020-04-07 | 北京国双科技有限公司 | Document retrieval method, device, equipment and medium |
CN109388786A (en) * | 2018-09-30 | 2019-02-26 | 武汉斗鱼网络科技有限公司 | A kind of Documents Similarity calculation method, device, equipment and medium |
WO2020111314A1 (en) * | 2018-11-27 | 2020-06-04 | 한국과학기술원 | Conceptual graph-based query-response apparatus and method |
CN109977203A (en) * | 2019-03-07 | 2019-07-05 | 北京九狐时代智能科技有限公司 | Statement similarity determines method, apparatus, electronic equipment and readable storage medium storing program for executing |
CN110033022A (en) * | 2019-03-08 | 2019-07-19 | 腾讯科技(深圳)有限公司 | Processing method, device and the storage medium of text |
CN110134942A (en) * | 2019-04-01 | 2019-08-16 | 北京中科闻歌科技股份有限公司 | Text hot spot extracting method and device |
CN110298035A (en) * | 2019-06-04 | 2019-10-01 | 平安科技(深圳)有限公司 | Word vector based on artificial intelligence defines method, apparatus, equipment and storage medium |
CN110399484A (en) * | 2019-06-25 | 2019-11-01 | 平安科技(深圳)有限公司 | Sentiment analysis method, apparatus, computer equipment and the storage medium of long text |
CN110704621A (en) * | 2019-09-25 | 2020-01-17 | 北京大米科技有限公司 | Text processing method and device, storage medium and electronic equipment |
CN110941961A (en) * | 2019-11-29 | 2020-03-31 | 秒针信息技术有限公司 | Information clustering method and device, electronic equipment and storage medium |
CN111444700A (en) * | 2020-04-02 | 2020-07-24 | 山东山大鸥玛软件股份有限公司 | Text similarity measurement method based on semantic document expression |
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113255369A (en) * | 2021-06-10 | 2021-08-13 | 平安国际智慧城市科技股份有限公司 | Text similarity analysis method and device and storage medium |
CN113255369B (en) * | 2021-06-10 | 2023-02-03 | 平安国际智慧城市科技股份有限公司 | Text similarity analysis method and device and storage medium |
CN113553848A (en) * | 2021-07-19 | 2021-10-26 | 北京奇艺世纪科技有限公司 | Long text classification method, system, electronic equipment and computer readable storage medium |
CN113553848B (en) * | 2021-07-19 | 2024-02-02 | 北京奇艺世纪科技有限公司 | Long text classification method, system, electronic device, and computer-readable storage medium |
CN114741499A (en) * | 2022-06-08 | 2022-07-12 | 杭州费尔斯通科技有限公司 | Text abstract generation method and system based on sentence semantic model |
WO2024098636A1 (en) * | 2022-11-08 | 2024-05-16 | 华院计算技术(上海)股份有限公司 | Text matching method and apparatus, computer-readable storage medium, and terminal |
CN117235546A (en) * | 2023-11-14 | 2023-12-15 | 国泰新点软件股份有限公司 | Multi-version file comparison method, device, system and storage medium |
CN117235546B (en) * | 2023-11-14 | 2024-03-12 | 国泰新点软件股份有限公司 | Multi-version file comparison method, device, system and storage medium |
CN117688138A (en) * | 2024-02-02 | 2024-03-12 | 中船凌久高科(武汉)有限公司 | Long text similarity comparison method based on paragraph division |
Also Published As
Publication number | Publication date |
---|---|
CN112183111B (en) | 2024-08-23 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112183111B (en) | Long text semantic similarity matching method, device, electronic equipment and storage medium | |
CN110298035B (en) | Word vector definition method, device, equipment and storage medium based on artificial intelligence | |
CN106570141B (en) | Approximate repeated image detection method | |
CN113849648B (en) | Classification model training method, device, computer equipment and storage medium | |
CN112329460B (en) | Text topic clustering method, device, equipment and storage medium | |
CN110941951B (en) | Text similarity calculation method, text similarity calculation device, text similarity calculation medium and electronic equipment | |
US20200364216A1 (en) | Method, apparatus and storage medium for updating model parameter | |
CN112085091B (en) | Short text matching method, device, equipment and storage medium based on artificial intelligence | |
CN111368037A (en) | Text similarity calculation method and device based on Bert model | |
CN114064852A (en) | Method and device for extracting relation of natural language, electronic equipment and storage medium | |
CN111611796A (en) | Hypernym determination method and device for hyponym, electronic device and storage medium | |
CN113723077B (en) | Sentence vector generation method and device based on bidirectional characterization model and computer equipment | |
CN114818986A (en) | Text similarity calculation duplication-removing method, system, medium and equipment | |
CN113326383B (en) | Short text entity linking method, device, computing equipment and storage medium | |
CN113239693A (en) | Method, device and equipment for training intention recognition model and storage medium | |
CN113435531A (en) | Zero sample image classification method and system, electronic equipment and storage medium | |
CN113988085B (en) | Text semantic similarity matching method and device, electronic equipment and storage medium | |
CN116089586A (en) | Question generation method based on text and training method of question generation model | |
CN106021346B (en) | Retrieval processing method and device | |
CN113761934B (en) | Word vector representation method based on self-attention mechanism and self-attention model | |
CN112528646B (en) | Word vector generation method, terminal device and computer-readable storage medium | |
CN113177406B (en) | Text processing method, text processing device, electronic equipment and computer readable medium | |
CN112579774B (en) | Model training method, model training device and terminal equipment | |
CN114417891A (en) | Reply sentence determination method and device based on rough semantics and electronic equipment | |
CN112507081A (en) | Similar sentence matching method and device, computer equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |