CN112183111A - Long text semantic similarity matching method and device, electronic equipment and storage medium - Google Patents

Long text semantic similarity matching method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN112183111A
CN112183111A CN202011042061.6A CN202011042061A CN112183111A CN 112183111 A CN112183111 A CN 112183111A CN 202011042061 A CN202011042061 A CN 202011042061A CN 112183111 A CN112183111 A CN 112183111A
Authority
CN
China
Prior art keywords
text
semantic
long text
paragraph
long
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011042061.6A
Other languages
Chinese (zh)
Other versions
CN112183111B (en
Inventor
徐晨兴
张雷
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Asiainfo Technologies China Inc
Original Assignee
Asiainfo Technologies China Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Asiainfo Technologies China Inc filed Critical Asiainfo Technologies China Inc
Priority to CN202011042061.6A priority Critical patent/CN112183111B/en
Publication of CN112183111A publication Critical patent/CN112183111A/en
Application granted granted Critical
Publication of CN112183111B publication Critical patent/CN112183111B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/194Calculation of difference between files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

The embodiment of the application provides a long text semantic similarity matching method and device, electronic equipment and a storage medium. The method comprises the following steps: respectively preprocessing the long text and the reference text to obtain a plurality of first word vectors corresponding to a plurality of sentences of the long text and a plurality of second word vectors corresponding to one sentence of the reference text; pooling the plurality of first word vectors and the plurality of second word vectors respectively to obtain a plurality of first semantic vectors corresponding to a plurality of sentences of the long text respectively and a second semantic vector corresponding to one sentence of the second text; inputting a plurality of first semantic vectors into a preset entity recognition model to determine paragraph types of paragraphs included in the long text; determining the weight corresponding to the first semantic vector in each paragraph according to the paragraph type; and calculating the similarity of the long text relative to the reference text based on the first semantic vector, the weight corresponding to the first semantic vector and the second semantic vector.

Description

Long text semantic similarity matching method and device, electronic equipment and storage medium
Technical Field
The application relates to the technical field of natural language processing, in particular to a long text semantic similarity matching method and device, electronic equipment and a storage medium.
Background
Natural Language Processing (NLP) is an important direction in the fields of computer science and artificial intelligence. It studies various theories and methods that enable efficient communication between humans and computers using natural language. Natural language processing is a science integrating linguistics, computer science and mathematics. Therefore, the research in this field will involve natural language, i.e. the language that people use everyday, so it is closely related to the research of linguistics. In natural language processing, semantic similarity matching between different texts is sometimes required.
The existing semantic matching is the semantic matching between the short text and the short text, and the existing scheme does not have a scheme which can realize the semantic matching between the long text and the short text.
Disclosure of Invention
The purpose of the present application is to solve at least one of the above technical drawbacks, and to provide the following solutions:
in a first aspect, a method for matching semantic similarity of long texts is provided, and the method includes:
respectively preprocessing a long text and a reference text to obtain a plurality of first word vectors corresponding to a plurality of sentences of the long text and a plurality of second word vectors corresponding to one sentence of the reference text, wherein the long text comprises a plurality of sentences, and the reference text comprises one sentence;
pooling the plurality of first word vectors and the plurality of second word vectors respectively to obtain a plurality of first semantic vectors corresponding to a plurality of sentences of the long text respectively and a second semantic vector corresponding to one sentence of the second text;
inputting the plurality of first semantic vectors into a preset entity recognition model to determine paragraph types of paragraphs included in the long text;
determining the weight corresponding to the first semantic vector in each paragraph according to the paragraph type;
and calculating the similarity of the long text relative to the reference text based on the first semantic vector, the weight corresponding to the first semantic vector and the second semantic vector.
In a second aspect, an apparatus for semantic similarity matching of long texts is provided, the apparatus comprising:
the preprocessing module is used for respectively preprocessing the long text and the reference text to obtain a plurality of first word vectors corresponding to a plurality of sentences of the long text and a plurality of second word vectors corresponding to one sentence of the reference text;
the pooling module is used for respectively pooling the first word vectors and the second word vectors to obtain a plurality of first semantic vectors corresponding to a plurality of sentences of the long text and a second semantic vector corresponding to one sentence of the second text;
the classification module is used for determining paragraph types of paragraphs included in the long text according to a preset entity recognition model and the plurality of first semantic vectors;
the weighting module is used for determining the weight corresponding to the first semantic vector in each paragraph according to the paragraph type;
and the similarity calculation module is used for calculating the similarity of the long text relative to the reference text based on the first semantic vector, the weight corresponding to the first semantic vector, the second semantic vector and a preset algorithm model. .
In a third aspect, an electronic device is provided, which includes:
one or more processors;
a memory;
one or more application programs, wherein the one or more application programs are stored in the memory and configured to be executed by the one or more processors, the one or more programs configured to: the long text semantic similarity matching method according to the first aspect of the present application is performed.
In a fourth aspect, a computer-readable storage medium is provided, on which a computer program is stored, which when executed by a processor, implements the long text semantic similarity matching method shown in the first aspect of the present application.
The beneficial effect that technical scheme that this application provided brought is: the method comprises the steps of obtaining a first word vector corresponding to a long text and a second word vector corresponding to a reference text, performing pooling processing on the first word vector and the second word vector respectively to obtain a first semantic vector corresponding to each sentence of the long text and a second semantic vector corresponding to the reference text, and determining a paragraph type of each paragraph to determine the weight of each sentence of the long text, so that the semantic similarity between the long text and the reference text can be obtained, the obtained similarity is related to the paragraph type of the text, and the obtained similarity is more accurate.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings used in the description of the embodiments of the present application will be briefly described below.
Fig. 1 is a schematic flowchart of a long text semantic similarity matching method according to an embodiment of the present disclosure;
FIG. 2 is a detailed flowchart of step S101 in FIG. 1;
fig. 3 is a schematic structural diagram of a long text semantic similarity matching apparatus according to an embodiment of the present disclosure;
fig. 4 is a schematic structural diagram of an electronic device for matching semantic similarity of long texts according to an embodiment of the present application.
Detailed Description
Reference will now be made in detail to embodiments of the present application, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are exemplary only for the purpose of explaining the present application and are not to be construed as limiting the present invention.
As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may also be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or wirelessly coupled. As used herein, the term "and/or" includes all or any element and all combinations of one or more of the associated listed items.
To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.
The long text semantic similarity matching method, device, electronic equipment and computer readable storage medium provided by the application aim to solve the above technical problems in the prior art.
The following describes the technical solutions of the present application and how to solve the above technical problems with specific embodiments. The following several specific embodiments may be combined with each other, and details of the same or similar concepts or processes may not be repeated in some embodiments. Embodiments of the present application will be described below with reference to the accompanying drawings.
Those skilled in the art will understand that the "terminal" used in the present application may be a Mobile phone, a tablet computer, a PDA (Personal Digital Assistant), an MID (Mobile Internet Device), etc.; a "server" may be implemented as a stand-alone server or as a server cluster comprised of multiple servers.
Referring to fig. 1, an embodiment of the present application provides a long text semantic similarity matching method, where the long text semantic similarity matching method may be applied to a terminal or a server, and the method includes:
s101: the method comprises the steps of preprocessing a long text and a reference text respectively to obtain a plurality of first word vectors corresponding to a plurality of sentences of the long text respectively and a plurality of second word vectors corresponding to one sentence of the reference text, wherein the long text comprises a plurality of sentences, and the reference text comprises one sentence.
The long text is a long text, and the long text comprises at least two paragraph types, each paragraph type at least comprises one paragraph, and each paragraph comprises one or more sentences. The paragraph types included in the long text are not limited, and the classification method of the paragraph types is not limited. Paragraph types such as long text may include summary paragraphs and general paragraphs. Paragraph types may also be classified in other ways to be divided into different paragraph types. The reference text comprises a sentence, which is a complete meaning. Sentences spaced by punctuation marks may be included in the reference text. If a sentence of the reference text is "i love dad, i love mom. "or one sentence of the reference text is" I love home. ".
The purpose of the pre-processing is to vectorize the long text and the reference text. During preprocessing, word segmentation and vectorization can be performed on the long text and the reference text respectively to obtain a plurality of first word vectors corresponding to a plurality of sentences of the long text respectively, and a plurality of second word vectors corresponding to one sentence of the reference text. Wherein each sentence of the long text comprises one or more first word vectors.
It can be understood that when the vectorization is performed on the long text and the reference text respectively, the same preset vectorization model is used for vectorization, so that the dimensions of the obtained first word vector and the second word vector are consistent.
S102: and performing pooling processing on the plurality of first word vectors and the plurality of second word vectors respectively to obtain a plurality of first semantic vectors corresponding to a plurality of sentences of the long text respectively and a second semantic vector corresponding to one sentence of the second text.
When the plurality of first word vectors and the plurality of second word vectors are respectively pooled, the plurality of first word vectors and the plurality of second word vectors may be respectively pooled through a maximum pooling layer or an average pooling layer. The maximum Pooling layer is a Max-Pooling layer, and the Average Pooling layer is an Average-Pooling layer. The word vectors are processed through the maximum pooling layer or the average pooling layer, which is a prior art and will not be described in detail in this application. If the plurality of first word vectors are pooled by the average pooling layer, the plurality of first word vectors may be pooled by:
Figure BDA0002706930420000051
wherein Z is a first semantic vector corresponding to a sentence of the long text, K is the number of first word vectors included in the sentence, viIs the ith word vector in the sentence. Each first semantic vector corresponding to each sentence of the long text can be obtained, wherein one sentence corresponds to one semantic vector, and therefore a plurality of semantic vectors are obtained. Similarly, a second semantic vector corresponding to a sentence of the second text can also be obtained. How to perform pooling processing on the plurality of first word vectors and the plurality of second word vectors through the maximum pooling layer is not described in this application.
S103: and inputting the plurality of first semantic vectors into a preset entity recognition model so as to determine the paragraph type of the paragraph included in the long text.
An entity recognition model comprising any one of the following models: long-short memory (LSTM) -Conditional Random Field (CRF) models; or a bidirectional Long-Short Term Memory (Bi-directional Long-Short Term Memory) -conditional random field CRF model. Both the LSTM-CRF model and the BiLSTM-CRF are prior art and are only briefly described in this application. The basic entity recognition model can be established through a large number of samples, and then the basic entity recognition model is trained to obtain the entity recognition model. The determined paragraph type of the long text is not limited, if the long text includes a first paragraph and a second paragraph, the probability that the first paragraph is a summary paragraph is determined to be 0.9, the probability 0.9 is greater than a preset probability threshold, if 0.6, the first paragraph is determined to be a summary paragraph, the probability that the second paragraph is a normal paragraph is determined to be 0.8, the probability 0.8 is greater than a preset probability threshold, if 0.6, the second paragraph is determined to be a normal paragraph. Each paragraph of the long text corresponds to a paragraph type, and different paragraphs may correspond to the same paragraph type. For example, the long text comprises 5 paragraphs, wherein the paragraph type of 1 paragraph is a summary paragraph, and the paragraph type of 4 paragraphs is a general paragraph.
S104: and determining the weight corresponding to the first semantic vector in each paragraph according to the paragraph type.
A type weight correspondence table may be preset, where the type weight correspondence table includes a mapping relationship between paragraph types and weights, and each paragraph type corresponds to one weight. The weight size corresponding to the paragraph type can be set according to needs or experience. Namely, determining the weight corresponding to the first semantic vector in each paragraph according to the paragraph type includes: determining the weight corresponding to the first semantic vector in each paragraph according to a preset type weight corresponding table and the paragraph type, wherein the type weight corresponding table comprises a mapping relation between the paragraph type and the weight.
Different paragraph types correspond to different weights. If the weight of the summary paragraph is 1, the weight of the general paragraph is 0.6. After the weight corresponding to a paragraph type is determined, the weight corresponding to a paragraph is also determined, the weights corresponding to all sentences in a paragraph are also determined, and the weight corresponding to a sentence in a paragraph is the weight corresponding to the paragraph type of the paragraph. The scheme of the application can be applied to the fields of search engines, question-answering conversations, repeated text matching and the like.
Specifically, the weight corresponding to the summary paragraph is 1, and the weight corresponding to the common paragraph is 0.6, so that when a paragraph D1 is a summary paragraph, the weights of all sentences in the paragraph D1 are all 1, and when a paragraph D2 is a common paragraph, the weights of all sentences in the paragraph D2 are all 0.6.
S105: and calculating the similarity of the long text relative to the reference text based on the first semantic vector, the weight corresponding to the first semantic vector and the second semantic vector.
And after the first semantic vector, the weight corresponding to the first semantic vector and the second semantic vector are determined, calculating to obtain the similarity of the long text relative to the reference text. The similarity of the long text to the reference text can be calculated based on the first semantic vector, the weight corresponding to the first semantic vector and the second semantic vector according to the following formula:
Figure BDA0002706930420000071
wherein N is a second semantic vector of the reference text, M is a plurality of first semantic vectors of the long text, L is the number of sentences of the long text, and M isiA first semantic vector for the ith sentence in the long text, said λmiAnd the weight corresponding to the first semantic vector of the ith sentence in the long text. cos (chemical oxygen demand)<N,M>That is, the cosine similarity between N and M is obtained. Wherein cos<N,M>The obtained value is in the range of [ -1,1]And (4) the following steps. If cos<N,M>The closer S is to 1, the more similar the long text is to the reference text, and the closer S is to-1, the more dissimilar the long text and the reference text is. It will be appreciated that s may also be normalized to indicate similarity from another way. Specifically, the similarity may be
Figure BDA0002706930420000072
When s is 0.5, the similarity is 75%.
cos<N,mi>For each cos<N,m>Calculating cos based on<N,m>,
Figure BDA0002706930420000073
Where k is the dimension of the vector, NjIs the second semantic directionProjection vector of quantity in j dimension, mjAnd a projection vector of the first semantic vector in the j dimension is obtained. Wherein cos<N,m>The cosine similarity between a sentence in the long text and a sentence in the reference text is obtained.
It can be understood that when determining the similarity of the long text with respect to the reference text, a calculation model can be preset through the above formula to quickly determine the similarity of the long text with respect to the reference text. When the model is calculated, in order that the first semantic vector, the weight corresponding to the first semantic vector and the second semantic vector conform to a preset format, the calculation model can be input, and after the paragraph type of the paragraph included in the long text and the weight corresponding to the first semantic vector in each paragraph of the long text are determined, the paragraph of the long text is re-segmented. Each paragraph needs to include a preset first number of words, such as 512. After the number of words in a paragraph exceeds 512, the paragraph is divided, i.e. re-segmented, so that the number of words in the paragraph is less than or equal to 512. If the number of words in a paragraph is less than a predetermined second number of words, for example 256 words, the paragraph is filled with the paragraph immediately after the paragraph, so that the number of words in the paragraph is greater than the second number and less than or equal to the first number. And supplementing 0 to the paragraphs smaller than the first number of words so that the number of words plus 0 in one paragraph equals the first number of words.
The long text semantic similarity matching method provided by the embodiment of the application comprises the steps of obtaining a first word vector corresponding to a long text and a second word vector corresponding to a reference text, performing pooling processing on the first word vector and the second word vector respectively to obtain a first semantic vector corresponding to each sentence of the long text and a second semantic vector corresponding to the reference text, and determining a paragraph type of each paragraph to determine the weight of each sentence of the long text, so that the semantic similarity between the long text and the reference text can be obtained, the obtained similarity is related to the paragraph type of the text, and the obtained similarity is more accurate.
Referring to fig. 2, a possible implementation manner is further provided in an embodiment of the present application, where the preprocessing the long text and the reference text to obtain a plurality of first word vectors corresponding to a plurality of sentences of the long text and a plurality of second word vectors corresponding to a sentence of the reference text respectively includes:
s201: and performing word segmentation processing on the long text and the reference text respectively through a preset word segmentation algorithm to obtain a plurality of first word segmentations corresponding to the long text and a plurality of second word segmentations corresponding to the reference text.
The word segmentation algorithm is not limited, for example, the word segmentation algorithm may include a bar segmentation algorithm, NLPIR, and the like. The Sedan segmentation algorithm and NLPIR are prior art, and detailed description is not provided in the application. In the foregoing re-segmenting of a paragraph of a long text, the number of words refers to the number of segmented words. As in a paragraph, the number of segments cannot exceed the first number.
S202: and respectively carrying out vectorization processing on the first participle and the second participle through a preset vectorization model so as to obtain a plurality of first word vectors corresponding to a plurality of sentences of the long text and a plurality of second word vectors corresponding to one sentence of the reference text.
The vectorization model comprises a Word2vec model, a Glove model or a Bert model. In the present application, the same vectorization model is used when the vectorization processing is performed on the first participle and the second participle, respectively. If a Word2vec model is adopted to carry out vectorization processing on the first participle and the second participle respectively, the obtained first Word vector and the second Word vector have consistent dimensions. Parameters in the Word2vec model, the Glove model or the Bert model are set as required, and are not limited in the application.
By performing word segmentation processing and vectorization processing on the long text and the reference text respectively, the subsequent determination of semantic vectors of the text and the determination of the weight of sentences in each paragraph can be facilitated, so that the similarity determination is more accurate.
Referring to fig. 3, an embodiment of the present application provides a long text semantic similarity matching apparatus 30, where the long text semantic similarity matching apparatus 30 may include:
a preprocessing module 301, configured to preprocess the long text and the reference text, respectively, to obtain a plurality of first word vectors corresponding to a plurality of sentences of the long text, respectively, and a plurality of second word vectors corresponding to one sentence of the reference text;
a pooling module 302, configured to pool the plurality of first word vectors and the plurality of second word vectors, respectively, so as to obtain a plurality of first semantic vectors corresponding to a plurality of sentences of the long text, respectively, and a second semantic vector corresponding to one sentence of the second text;
a classification module 303, configured to determine a paragraph type of a paragraph included in the long text according to a preset entity recognition model and the plurality of first semantic vectors;
a weight module 304, configured to determine, according to the paragraph type, a weight corresponding to the first semantic vector in each paragraph;
a similarity calculation module 305, configured to calculate a similarity of the long text with respect to the reference text based on the first semantic vector, the weight corresponding to the first semantic vector, the second semantic vector, and a preset algorithm model.
The long text semantic similarity matching device provided by the embodiment of the application obtains a first word vector corresponding to a long text and a second word vector corresponding to a reference text, performs pooling processing on the first word vector and the second word vector respectively to obtain a first semantic vector corresponding to each sentence of the long text and a second semantic vector corresponding to the reference text, and determines a paragraph type of each paragraph to determine the weight of each sentence of the long text, so that the semantic similarity between the long text and the reference text can be obtained, the obtained similarity is related to the paragraph type of the text, and the obtained similarity is more accurate.
The weight module 304 is specifically configured to determine a weight corresponding to the first semantic vector in each paragraph according to a preset type weight correspondence table and the paragraph type, where the type weight correspondence table includes a mapping relationship between the paragraph type and the weight.
The pooling module 302 is specifically configured to pool the plurality of first word vectors and the plurality of second word vectors through a maximum pooling layer or an average pooling layer.
Wherein, the preprocessing module 301 comprises:
the word segmentation unit is used for performing word segmentation processing on the long text and the reference text respectively through a preset word segmentation algorithm so as to obtain a plurality of first word segmentations corresponding to the long text and a plurality of second word segmentations corresponding to the reference text;
and the vectorization unit is used for respectively carrying out vectorization processing on the first participle and the second participle through a preset vectorization model so as to obtain a plurality of first word vectors corresponding to a plurality of sentences of the long text and a plurality of second word vectors corresponding to one sentence of the reference text.
Referring to fig. 4, in an alternative embodiment, an electronic device is provided, and an electronic device 4000 includes: a processor 4001 and a memory 4003. Processor 4001 is coupled to memory 4003, such as via bus 4002. Optionally, the electronic device 4000 may further comprise a transceiver 4004. In addition, the transceiver 4004 is not limited to one in practical applications, and the structure of the electronic device 4000 is not limited to the embodiment of the present application.
The Processor 4001 may be a CPU (Central Processing Unit), a general-purpose Processor, a DSP (Digital Signal Processor), an ASIC (Application Specific Integrated Circuit), an FPGA (Field Programmable Gate Array) or other Programmable logic device, a transistor logic device, a hardware component, or any combination thereof. Which may implement or perform the various illustrative logical blocks, modules, and circuits described in connection with the disclosure. The processor 4001 may also be a combination that performs a computational function, including, for example, a combination of one or more microprocessors, a combination of a DSP and a microprocessor, or the like.
Bus 4002 may include a path that carries information between the aforementioned components. The bus 4002 may be a PCI (Peripheral Component Interconnect) bus, an EISA (Extended Industry Standard Architecture) bus, or the like. The bus 4002 may be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown, but this does not mean that there is only one bus or one type of bus.
The Memory 4003 may be a ROM (Read Only Memory) or other types of static storage devices that can store static information and instructions, a RAM (Random Access Memory) or other types of dynamic storage devices that can store information and instructions, an EEPROM (Electrically Erasable Programmable Read Only Memory), a CD-ROM (Compact Disc Read Only Memory) or other optical Disc storage, optical Disc storage (including Compact Disc, laser Disc, optical Disc, digital versatile Disc, blu-ray Disc, etc.), a magnetic Disc storage medium or other magnetic storage devices, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer, but is not limited to these.
The memory 4003 is used for storing application codes for executing the scheme of the present application, and the execution is controlled by the processor 4001. Processor 4001 is configured to execute application code stored in memory 4003 to implement what is shown in the foregoing method embodiments.
Among them, electronic devices include but are not limited to: a terminal and a server.
An embodiment of the present application provides an electronic device, including: a memory and a processor; at least one program stored in the memory for execution by the processor to implement the corresponding aspects of the foregoing method embodiments, compared with the prior art, can implement: the method comprises the steps of obtaining a first word vector corresponding to a long text and a second word vector corresponding to a reference text, performing pooling processing on the first word vector and the second word vector respectively to obtain a first semantic vector corresponding to each sentence of the long text and a second semantic vector corresponding to the reference text, and determining a paragraph type of each paragraph to determine the weight of each sentence of the long text, so that the semantic similarity between the long text and the reference text can be obtained, the obtained similarity is related to the paragraph type of the text, and the obtained similarity is more accurate.
The embodiment of the present application provides a storage medium, which is a computer-readable storage medium, and a computer program is stored on the computer-readable storage medium, and when the computer program runs on a computer, the computer is enabled to execute the corresponding content in the foregoing method embodiment. Compared with the prior art, the method comprises the steps of obtaining a first word vector corresponding to the long text and a second word vector corresponding to the reference text, performing pooling processing on the first word vector and the second word vector respectively to obtain a first semantic vector corresponding to each sentence of the long text and a second semantic vector corresponding to the reference text, and determining the paragraph type of each paragraph to determine the weight of each sentence of the long text, so that the semantic similarity between the long text and the reference text can be obtained, the obtained similarity is related to the paragraph type of the text, and the obtained similarity is more accurate.
It should be understood that, although the steps in the flowcharts of the figures are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and may be performed in other orders unless explicitly stated herein. Moreover, at least a portion of the steps in the flow chart of the figure may include multiple sub-steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, which are not necessarily performed in sequence, but may be performed alternately or alternately with other steps or at least a portion of the sub-steps or stages of other steps.
The foregoing is only a partial embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims (10)

1. A long text semantic similarity matching method, the method comprising:
respectively preprocessing a long text and a reference text to obtain a plurality of first word vectors corresponding to a plurality of sentences of the long text and a plurality of second word vectors corresponding to one sentence of the reference text, wherein the long text comprises a plurality of sentences, and the reference text comprises one sentence;
pooling the plurality of first word vectors and the plurality of second word vectors respectively to obtain a plurality of first semantic vectors corresponding to a plurality of sentences of the long text respectively and a second semantic vector corresponding to one sentence of the second text;
inputting the plurality of first semantic vectors into a preset entity recognition model to determine paragraph types of paragraphs included in the long text;
determining the weight corresponding to the first semantic vector in each paragraph according to the paragraph type;
and calculating the similarity of the long text relative to the reference text based on the first semantic vector, the weight corresponding to the first semantic vector and the second semantic vector.
2. The long text semantic similarity matching method according to claim 1, wherein the entity recognition model comprises any one of the following models:
long and short time memory LSTM-conditional random field CRF model; or
The BiLSTM-conditional random field CRF model is memorized in two directions and in long time.
3. The long text semantic similarity matching method according to claim 1, wherein the determining a weight corresponding to the first semantic vector in each paragraph according to the paragraph type comprises:
determining the weight corresponding to the first semantic vector in each paragraph according to a preset type weight corresponding table and the paragraph type, wherein the type weight corresponding table comprises a mapping relation between the paragraph type and the weight.
4. The long text semantic similarity matching method according to claim 1, wherein the similarity of the long text with respect to the reference text is calculated based on the first semantic vector, the weight corresponding to the first semantic vector, and the second semantic vector according to the following formula:
Figure FDA0002706930410000021
wherein N is a second semantic vector of the reference text, M is a plurality of first semantic vectors of the long text, L is the number of sentences of the long text, and M isiA first semantic vector for the ith sentence in the long text, said λmiThe weight corresponding to the first semantic vector of the ith sentence in the long text;
wherein cos < N, m > is calculated based on the following formula,
Figure FDA0002706930410000022
where k is the dimension of the vector, NjA projection vector in j dimension of the second semantic vector, the mjAnd a projection vector of the first semantic vector in the j dimension is obtained.
5. The long text semantic similarity matching method according to claim 1, wherein the pooling of the plurality of first word vectors and the plurality of second word vectors respectively comprises:
and performing pooling processing on the plurality of first word vectors and the plurality of second word vectors through a maximum pooling layer or an average pooling layer respectively.
6. The method according to claim 1, wherein the preprocessing the long text and the reference text to obtain a plurality of first word vectors corresponding to a plurality of sentences of the long text and a plurality of second word vectors corresponding to a sentence of the reference text respectively comprises:
performing word segmentation processing on the long text and the reference text respectively through a preset word segmentation algorithm to obtain a plurality of first word segmentations corresponding to the long text and a plurality of second word segmentations corresponding to the reference text;
and respectively carrying out vectorization processing on the first participle and the second participle through a preset vectorization model so as to obtain a plurality of first word vectors corresponding to a plurality of sentences of the long text and a plurality of second word vectors corresponding to one sentence of the reference text.
7. The long text semantic similarity matching method according to claim 6, wherein the segmentation algorithm comprises a ending segmentation algorithm, and the vectorization model comprises a Word2vec model, a Glove model or a Bert model.
8. A long text semantic similarity matching device, comprising:
the preprocessing module is used for respectively preprocessing the long text and the reference text to obtain a plurality of first word vectors corresponding to a plurality of sentences of the long text and a plurality of second word vectors corresponding to one sentence of the reference text;
the pooling module is used for respectively pooling the first word vectors and the second word vectors to obtain a plurality of first semantic vectors corresponding to a plurality of sentences of the long text and a second semantic vector corresponding to one sentence of the second text;
the classification module is used for determining paragraph types of paragraphs included in the long text according to a preset entity recognition model and the plurality of first semantic vectors;
the weighting module is used for determining the weight corresponding to the first semantic vector in each paragraph according to the paragraph type;
and the similarity calculation module is used for calculating the similarity of the long text relative to the reference text based on the first semantic vector, the weight corresponding to the first semantic vector, the second semantic vector and a preset algorithm model.
9. An electronic device, comprising:
one or more processors;
a memory;
one or more applications, wherein the one or more applications are stored in the memory and configured to be executed by the one or more processors, the one or more programs configured to: executing the long text semantic similarity matching method according to any one of claims 1 to 7.
10. A storage medium, which is a computer-readable storage medium, on which a computer program is stored, wherein the program, when executed by a processor, implements the long text semantic similarity matching method according to any one of claims 1 to 7.
CN202011042061.6A 2020-09-28 2020-09-28 Long text semantic similarity matching method, device, electronic equipment and storage medium Active CN112183111B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011042061.6A CN112183111B (en) 2020-09-28 2020-09-28 Long text semantic similarity matching method, device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011042061.6A CN112183111B (en) 2020-09-28 2020-09-28 Long text semantic similarity matching method, device, electronic equipment and storage medium

Publications (2)

Publication Number Publication Date
CN112183111A true CN112183111A (en) 2021-01-05
CN112183111B CN112183111B (en) 2024-08-23

Family

ID=73943871

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011042061.6A Active CN112183111B (en) 2020-09-28 2020-09-28 Long text semantic similarity matching method, device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN112183111B (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113255369A (en) * 2021-06-10 2021-08-13 平安国际智慧城市科技股份有限公司 Text similarity analysis method and device and storage medium
CN113553848A (en) * 2021-07-19 2021-10-26 北京奇艺世纪科技有限公司 Long text classification method, system, electronic equipment and computer readable storage medium
CN114741499A (en) * 2022-06-08 2022-07-12 杭州费尔斯通科技有限公司 Text abstract generation method and system based on sentence semantic model
CN117235546A (en) * 2023-11-14 2023-12-15 国泰新点软件股份有限公司 Multi-version file comparison method, device, system and storage medium
CN117688138A (en) * 2024-02-02 2024-03-12 中船凌久高科(武汉)有限公司 Long text similarity comparison method based on paragraph division
WO2024098636A1 (en) * 2022-11-08 2024-05-16 华院计算技术(上海)股份有限公司 Text matching method and apparatus, computer-readable storage medium, and terminal

Citations (23)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7356461B1 (en) * 2002-01-14 2008-04-08 Nstein Technologies Inc. Text categorization method and apparatus
CN101620596A (en) * 2008-06-30 2010-01-06 东北大学 Multi-document auto-abstracting method facing to inquiry
US20120072220A1 (en) * 2010-09-20 2012-03-22 Alibaba Group Holding Limited Matching text sets
CN107273391A (en) * 2016-04-08 2017-10-20 北京国双科技有限公司 Document recommends method and apparatus
CN107463605A (en) * 2017-06-21 2017-12-12 北京百度网讯科技有限公司 The recognition methods and device of low-quality News Resources, computer equipment and computer-readable recording medium
CN107679144A (en) * 2017-09-25 2018-02-09 平安科技(深圳)有限公司 News sentence clustering method, device and storage medium based on semantic similarity
CN108256539A (en) * 2016-12-28 2018-07-06 北京智能管家科技有限公司 Man-machine interaction method, interactive system and Intelligent story device based on semantic matches
CN108628825A (en) * 2018-04-10 2018-10-09 平安科技(深圳)有限公司 Text message Similarity Match Method, device, computer equipment and storage medium
CN109117474A (en) * 2018-06-25 2019-01-01 广州多益网络股份有限公司 Calculation method, device and the storage medium of statement similarity
CN109213999A (en) * 2018-08-20 2019-01-15 成都佳发安泰教育科技股份有限公司 A kind of subjective item methods of marking
CN109388786A (en) * 2018-09-30 2019-02-26 武汉斗鱼网络科技有限公司 A kind of Documents Similarity calculation method, device, equipment and medium
CN109977203A (en) * 2019-03-07 2019-07-05 北京九狐时代智能科技有限公司 Statement similarity determines method, apparatus, electronic equipment and readable storage medium storing program for executing
CN110033022A (en) * 2019-03-08 2019-07-19 腾讯科技(深圳)有限公司 Processing method, device and the storage medium of text
CN110134942A (en) * 2019-04-01 2019-08-16 北京中科闻歌科技股份有限公司 Text hot spot extracting method and device
CN110298035A (en) * 2019-06-04 2019-10-01 平安科技(深圳)有限公司 Word vector based on artificial intelligence defines method, apparatus, equipment and storage medium
US10459962B1 (en) * 2018-09-19 2019-10-29 Servicenow, Inc. Selectively generating word vector and paragraph vector representations of fields for machine learning
CN110399484A (en) * 2019-06-25 2019-11-01 平安科技(深圳)有限公司 Sentiment analysis method, apparatus, computer equipment and the storage medium of long text
CN110704621A (en) * 2019-09-25 2020-01-17 北京大米科技有限公司 Text processing method and device, storage medium and electronic equipment
CN110941961A (en) * 2019-11-29 2020-03-31 秒针信息技术有限公司 Information clustering method and device, electronic equipment and storage medium
CN110968664A (en) * 2018-09-30 2020-04-07 北京国双科技有限公司 Document retrieval method, device, equipment and medium
US20200125804A1 (en) * 2017-06-30 2020-04-23 Fujitsu Limited Non-transitory computer readable recording medium, semantic vector generation method, and semantic vector generation device
WO2020111314A1 (en) * 2018-11-27 2020-06-04 한국과학기술원 Conceptual graph-based query-response apparatus and method
CN111444700A (en) * 2020-04-02 2020-07-24 山东山大鸥玛软件股份有限公司 Text similarity measurement method based on semantic document expression

Patent Citations (23)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7356461B1 (en) * 2002-01-14 2008-04-08 Nstein Technologies Inc. Text categorization method and apparatus
CN101620596A (en) * 2008-06-30 2010-01-06 东北大学 Multi-document auto-abstracting method facing to inquiry
US20120072220A1 (en) * 2010-09-20 2012-03-22 Alibaba Group Holding Limited Matching text sets
CN107273391A (en) * 2016-04-08 2017-10-20 北京国双科技有限公司 Document recommends method and apparatus
CN108256539A (en) * 2016-12-28 2018-07-06 北京智能管家科技有限公司 Man-machine interaction method, interactive system and Intelligent story device based on semantic matches
CN107463605A (en) * 2017-06-21 2017-12-12 北京百度网讯科技有限公司 The recognition methods and device of low-quality News Resources, computer equipment and computer-readable recording medium
US20200125804A1 (en) * 2017-06-30 2020-04-23 Fujitsu Limited Non-transitory computer readable recording medium, semantic vector generation method, and semantic vector generation device
CN107679144A (en) * 2017-09-25 2018-02-09 平安科技(深圳)有限公司 News sentence clustering method, device and storage medium based on semantic similarity
CN108628825A (en) * 2018-04-10 2018-10-09 平安科技(深圳)有限公司 Text message Similarity Match Method, device, computer equipment and storage medium
CN109117474A (en) * 2018-06-25 2019-01-01 广州多益网络股份有限公司 Calculation method, device and the storage medium of statement similarity
CN109213999A (en) * 2018-08-20 2019-01-15 成都佳发安泰教育科技股份有限公司 A kind of subjective item methods of marking
US10459962B1 (en) * 2018-09-19 2019-10-29 Servicenow, Inc. Selectively generating word vector and paragraph vector representations of fields for machine learning
CN110968664A (en) * 2018-09-30 2020-04-07 北京国双科技有限公司 Document retrieval method, device, equipment and medium
CN109388786A (en) * 2018-09-30 2019-02-26 武汉斗鱼网络科技有限公司 A kind of Documents Similarity calculation method, device, equipment and medium
WO2020111314A1 (en) * 2018-11-27 2020-06-04 한국과학기술원 Conceptual graph-based query-response apparatus and method
CN109977203A (en) * 2019-03-07 2019-07-05 北京九狐时代智能科技有限公司 Statement similarity determines method, apparatus, electronic equipment and readable storage medium storing program for executing
CN110033022A (en) * 2019-03-08 2019-07-19 腾讯科技(深圳)有限公司 Processing method, device and the storage medium of text
CN110134942A (en) * 2019-04-01 2019-08-16 北京中科闻歌科技股份有限公司 Text hot spot extracting method and device
CN110298035A (en) * 2019-06-04 2019-10-01 平安科技(深圳)有限公司 Word vector based on artificial intelligence defines method, apparatus, equipment and storage medium
CN110399484A (en) * 2019-06-25 2019-11-01 平安科技(深圳)有限公司 Sentiment analysis method, apparatus, computer equipment and the storage medium of long text
CN110704621A (en) * 2019-09-25 2020-01-17 北京大米科技有限公司 Text processing method and device, storage medium and electronic equipment
CN110941961A (en) * 2019-11-29 2020-03-31 秒针信息技术有限公司 Information clustering method and device, electronic equipment and storage medium
CN111444700A (en) * 2020-04-02 2020-07-24 山东山大鸥玛软件股份有限公司 Text similarity measurement method based on semantic document expression

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113255369A (en) * 2021-06-10 2021-08-13 平安国际智慧城市科技股份有限公司 Text similarity analysis method and device and storage medium
CN113255369B (en) * 2021-06-10 2023-02-03 平安国际智慧城市科技股份有限公司 Text similarity analysis method and device and storage medium
CN113553848A (en) * 2021-07-19 2021-10-26 北京奇艺世纪科技有限公司 Long text classification method, system, electronic equipment and computer readable storage medium
CN113553848B (en) * 2021-07-19 2024-02-02 北京奇艺世纪科技有限公司 Long text classification method, system, electronic device, and computer-readable storage medium
CN114741499A (en) * 2022-06-08 2022-07-12 杭州费尔斯通科技有限公司 Text abstract generation method and system based on sentence semantic model
WO2024098636A1 (en) * 2022-11-08 2024-05-16 华院计算技术(上海)股份有限公司 Text matching method and apparatus, computer-readable storage medium, and terminal
CN117235546A (en) * 2023-11-14 2023-12-15 国泰新点软件股份有限公司 Multi-version file comparison method, device, system and storage medium
CN117235546B (en) * 2023-11-14 2024-03-12 国泰新点软件股份有限公司 Multi-version file comparison method, device, system and storage medium
CN117688138A (en) * 2024-02-02 2024-03-12 中船凌久高科(武汉)有限公司 Long text similarity comparison method based on paragraph division

Also Published As

Publication number Publication date
CN112183111B (en) 2024-08-23

Similar Documents

Publication Publication Date Title
CN112183111B (en) Long text semantic similarity matching method, device, electronic equipment and storage medium
CN110298035B (en) Word vector definition method, device, equipment and storage medium based on artificial intelligence
CN106570141B (en) Approximate repeated image detection method
CN113849648B (en) Classification model training method, device, computer equipment and storage medium
CN112329460B (en) Text topic clustering method, device, equipment and storage medium
CN110941951B (en) Text similarity calculation method, text similarity calculation device, text similarity calculation medium and electronic equipment
US20200364216A1 (en) Method, apparatus and storage medium for updating model parameter
CN112085091B (en) Short text matching method, device, equipment and storage medium based on artificial intelligence
CN111368037A (en) Text similarity calculation method and device based on Bert model
CN114064852A (en) Method and device for extracting relation of natural language, electronic equipment and storage medium
CN111611796A (en) Hypernym determination method and device for hyponym, electronic device and storage medium
CN113723077B (en) Sentence vector generation method and device based on bidirectional characterization model and computer equipment
CN114818986A (en) Text similarity calculation duplication-removing method, system, medium and equipment
CN113326383B (en) Short text entity linking method, device, computing equipment and storage medium
CN113239693A (en) Method, device and equipment for training intention recognition model and storage medium
CN113435531A (en) Zero sample image classification method and system, electronic equipment and storage medium
CN113988085B (en) Text semantic similarity matching method and device, electronic equipment and storage medium
CN116089586A (en) Question generation method based on text and training method of question generation model
CN106021346B (en) Retrieval processing method and device
CN113761934B (en) Word vector representation method based on self-attention mechanism and self-attention model
CN112528646B (en) Word vector generation method, terminal device and computer-readable storage medium
CN113177406B (en) Text processing method, text processing device, electronic equipment and computer readable medium
CN112579774B (en) Model training method, model training device and terminal equipment
CN114417891A (en) Reply sentence determination method and device based on rough semantics and electronic equipment
CN112507081A (en) Similar sentence matching method and device, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant