CN110750977B - Text similarity calculation method and system - Google Patents

Text similarity calculation method and system Download PDF

Info

Publication number
CN110750977B
CN110750977B CN201911009970.7A CN201911009970A CN110750977B CN 110750977 B CN110750977 B CN 110750977B CN 201911009970 A CN201911009970 A CN 201911009970A CN 110750977 B CN110750977 B CN 110750977B
Authority
CN
China
Prior art keywords
text
difference
feature vector
similarity
word
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201911009970.7A
Other languages
Chinese (zh)
Other versions
CN110750977A (en
Inventor
陈晓军
温周伏土
崔恒斌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alipay Hangzhou Information Technology Co Ltd
Original Assignee
Alipay Hangzhou Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alipay Hangzhou Information Technology Co Ltd filed Critical Alipay Hangzhou Information Technology Co Ltd
Priority to CN201911009970.7A priority Critical patent/CN110750977B/en
Publication of CN110750977A publication Critical patent/CN110750977A/en
Application granted granted Critical
Publication of CN110750977B publication Critical patent/CN110750977B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3347Query execution using vector based model
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Health & Medical Sciences (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the specification discloses a text similarity calculation method and a text similarity calculation system. The method comprises the following steps: acquiring a first text and a second text; a first editing distance exists between the first text and the second text, and the first editing distance is smaller than a preset first editing distance threshold; extracting a first difference text and a second difference text according to the difference between the first text and the second text; extracting corresponding first feature vectors at least according to the first text and the second text; extracting a corresponding second feature vector at least according to the first difference text and the second difference text; obtaining a third feature vector based on the first feature vector and the second feature vector; and determining a similarity between the first text and the second text based on the third feature vector.

Description

Text similarity calculation method and system
Technical Field
One or more embodiments of the present disclosure relate to the field of natural language processing, and in particular, to a method and system for calculating text similarity.
Background
When the existing models are used for sentence similarity matching, more than one is from the statistical point of view, the depth model is used for learning the corpus, and then the models are used for testing. However, in matching two sentences, there is a general problem that when only a few words or words are different from each other, most models determine that the two sentences are similar, but in some cases, the meaning of the two sentences is changed due to the difference of the few words or words.
Therefore, it is desirable to provide a text similarity calculation method and system that can correctly recognize the similarity of two sentences in the case where the editing distance of the two sentences is small.
Disclosure of Invention
One aspect of the embodiments of the present specification provides a text similarity calculation method. The text similarity calculation method may include: acquiring a first text and a second text; a first editing distance exists between the first text and the second text, and the first editing distance is smaller than a preset first editing distance threshold; extracting a first difference text and a second difference text according to the difference between the first text and the second text; extracting corresponding first feature vectors at least according to the first text and the second text; extracting a corresponding second feature vector at least according to the first difference text and the second difference text; obtaining a third feature vector based on the first feature vector and the second feature vector; and determining a similarity between the first text and the second text based on the third feature vector.
Another aspect of embodiments of the present description provides a text similarity calculation system, which may include: the acquisition module is used for acquiring the first text and the second text; a first editing distance exists between the first text and the second text, and the first editing distance is smaller than a preset first editing distance threshold; the difference extraction module is used for extracting a first difference text and a second difference text according to the difference between the first text and the second text; the first feature extraction module is used for extracting corresponding first feature vectors at least according to the first text and the second text; the second feature extraction module is used for extracting a corresponding second feature vector at least according to the first difference text and the second difference text; the similarity determining module is used for obtaining a third feature vector based on the first feature vector and the second feature vector; and determining a similarity between the first text and the second text based on the third feature vector.
An aspect of embodiments of the present description provides a text similarity calculation device that may include at least one processor and at least one memory; the at least one memory is configured to store computer instructions; the at least one processor is configured to execute at least some of the computer instructions to implement a text similarity calculation method as described herein.
Drawings
The present specification will be further elucidated by way of example embodiments, which will be described in detail by means of the accompanying drawings. The embodiments are not limiting, in which like numerals represent like structures, wherein:
FIG. 1 is a block diagram of a text similarity calculation system according to some embodiments of the present description;
FIG. 2 is an exemplary flow chart of a text similarity calculation method according to some embodiments of the present description;
FIG. 3 is an exemplary block diagram of a text similarity model shown in accordance with some embodiments of the present description; and
fig. 4 is an exemplary schematic diagram of extracting difference text according to some embodiments of the present description.
Detailed Description
In order to more clearly illustrate the technical solutions of the embodiments of the present specification, the drawings that are required to be used in the description of the embodiments will be briefly described below. It is apparent that the drawings in the following description are only some examples or embodiments of the present specification, and it is possible for those of ordinary skill in the art to apply the present specification to other similar situations according to the drawings without inventive effort. Unless otherwise apparent from the context of the language or otherwise specified, like reference numerals in the figures refer to like structures or operations.
It will be appreciated that "system," "apparatus," "unit" and/or "module" as used herein is one method for distinguishing between different components, elements, parts, portions or assemblies of different levels. However, if other words can achieve the same purpose, the words can be replaced by other expressions.
As used in this specification and the claims, the terms "a," "an," "the," and/or "the" are not specific to a singular, but may include a plurality, unless the context clearly dictates otherwise. In general, the terms "comprises" and "comprising" merely indicate that the steps and elements are explicitly identified, and they do not constitute an exclusive list, as other steps or elements may be included in a method or apparatus.
A flowchart is used in this specification to describe the operations performed by the system according to embodiments of the present specification. It should be appreciated that the preceding or following operations are not necessarily performed in order precisely. Rather, the steps may be processed in reverse order or simultaneously. Also, other operations may be added to or removed from these processes.
FIG. 1 is a block diagram of a text similarity calculation system according to some embodiments of the present description.
As shown in fig. 1, the text similarity calculation system may include an acquisition module 110, a difference extraction module 120, a first feature extraction module 130, a second feature extraction module 140, and a similarity determination module 150.
The acquisition module 110 may be configured to acquire a first text and a second text; and a first editing distance exists between the first text and the second text, and the first editing distance is smaller than a preset first editing distance threshold. The detailed description of acquiring the first text and the second text may be referred to fig. 2, and will not be repeated here.
The difference extraction module 120 may be configured to extract a first difference text and a second difference text according to a difference between the first text and the second text. A detailed description of extracting the first difference text and the second difference text may be referred to fig. 2, and will not be repeated here.
The first feature extraction module 130 may be configured to extract a corresponding first feature vector at least according to the first text and the second text. A detailed description of extracting the corresponding first feature vector may be referred to fig. 2, and will not be repeated here.
The second feature extraction module 140 may be configured to extract a corresponding second feature vector at least according to the first difference text and the second difference text. A detailed description of extracting the corresponding second feature vector may be referred to fig. 2, and will not be repeated here.
The similarity determination module 150 may be configured to obtain a third feature vector based on the first feature vector and the second feature vector; and determining a similarity between the first text and the second text based on the third feature vector. A detailed description of determining the similarity between the first text and the second text may be referred to in fig. 2, and will not be repeated here.
It should be understood that the system shown in fig. 1 and its modules may be implemented in a variety of ways. For example, in some embodiments, the system and its modules may be implemented in hardware, software, or a combination of software and hardware. Wherein the hardware portion may be implemented using dedicated logic; the software portions may then be stored in a memory and executed by a suitable instruction execution system, such as a microprocessor or special purpose design hardware. Those skilled in the art will appreciate that the methods and systems described above may be implemented using computer executable instructions and/or embodied in processor control code, such as provided on a carrier medium such as a magnetic disk, CD or DVD-ROM, a programmable memory such as read only memory (firmware), or a data carrier such as an optical or electronic signal carrier. The systems and their modules of one or more embodiments of this specification may be implemented not only in hardware circuitry, such as very large scale integrated circuits or gate arrays, semiconductors such as logic chips, transistors, etc., or programmable hardware devices such as field programmable gate arrays, programmable logic devices, etc., but also in software, such as executed by various types of processors, and in a combination of the above hardware circuitry and software (e.g., firmware).
It should be noted that the above description of the candidate display, determination system, and modules thereof is for descriptive purposes only and is not intended to limit one or more embodiments of the present disclosure to the scope of the illustrated embodiments. It will be appreciated by those skilled in the art that, given the principles of the system, various modules may be combined arbitrarily or a subsystem may be constructed in connection with other modules without departing from such principles. Two or more of the above modules may be combined into a single module, which may perform respective various functions of the combined modules; the above module may be split into different modules, and the split modules may perform different functions respectively. For example, in some embodiments, the acquisition module 110, the difference extraction module 120, the first feature extraction module 130, the second feature extraction module 140, and the similarity determination module 150 disclosed in fig. 1 may be different modules in the system, or may be one module to implement the functions of two or more templates. For example, the acquiring module 110 and the difference extracting module 120 may be two modules, or may be one module having both the acquiring and difference extracting functions. For example, each module may share one memory module, or each module may have a respective memory module. Such variations are within the scope of one or more embodiments of the present description.
Fig. 2 is an exemplary flow chart of a text similarity calculation method according to some embodiments of the present description.
As shown in fig. 2, the text similarity calculation method may include steps 210, 220, 230, 240, and 250.
Step 210, acquiring a first text and a second text; and a first editing distance exists between the first text and the second text, and the first editing distance is smaller than a preset first editing distance threshold. Specifically, it may be implemented by the acquisition module 110.
In some embodiments, text refers to strings composed of alphabetical characters. In some embodiments, the characters may include Chinese characters, letters, symbols, numbers, and other words. In some embodiments, the symbols may also include punctuation marks, line feed symbols, or other identification symbols. The first text refers to one text defined in the specification, and "first" is used to distinguish other text, for example, the second text, described later in the specification.
In some embodiments, the edit distance may be: only one text can be processed into the minimum number of operations used by the other text on the premise of three operations of 'insert', 'delete', 'replace'. The smaller the edit distance, the closer the two texts are.
The first edit distance may refer to an edit distance existing between the first text and the second text. For example, processing the first text "Beijing opera center" as the second text "Shanghai opera center" requires replacement of "Beijing" with "Shanghai", and the first edit distance is 1.
The first range threshold may be a value set in advance. When the deep learning model is used for sentence similarity matching, the model can accurately judge the similarity between two texts when the editing distance between the two texts is large. However, when the edit distance between two texts is small, the model easily judges that the two sentences are similar, but in practice the result thus obtained may be inaccurate. In some embodiments, two texts, for which an edit distance exists and which is smaller than a preset value, may be regarded as the first text and the second text. For example, if the first edit distance threshold is set to 3, two texts having edit distances smaller than 3 may be regarded as the first text and the second text.
And 220, extracting a first difference text and a second difference text according to the difference between the first text and the second text. In particular, it may be implemented by the difference extraction module 220.
In some embodiments, as shown in fig. 4, the first difference text may be a word and/or a word having a difference compared to the first text and the second text. In particular, the first difference text may be a word and/or a word that appears in the first text and that does not appear in the second text. For example, the first text "Beijing fair center" is compared with the second text "Shanghai fair center", the word having the difference is "Beijing", and the first difference text is "Beijing".
In some embodiments, as shown in fig. 4, the second difference text may be a word and/or a word for which there is a difference in the second text compared to the first text. The second difference text may be a word and/or a word that appears in the second text and that does not appear in the first text. For example, the second text "Shanghai's public backlog center" is compared to the first text "Beijing public backlog center" and the word having the difference is "Shanghai", then the second difference text is "Shanghai".
In some embodiments, the first difference text further comprises expanding the difference text forward or backward. For example, the first text "Mycat is yellow" is compared with the second text "My dog is yellow" and the word having the difference is "cat" and the word is expanded forward and backward, respectively, so that the first difference text is "My cat is".
In some embodiments, the second difference text further comprises expanding the difference text forward or backward. For example, the second text "My dog is yellow" is compared with the first text "My cat is yellow" and the word having the difference is "dog", and the second difference text is "My dog is" if the word is expanded forward and backward, respectively.
In some embodiments, if a difference word and/or word between two sentences exists in only one sentence, only one difference text is extracted in the above manner, and thus it is necessary to represent another difference text using a set character.
In some embodiments, if the first text is compared to the second text, there is no word and/or word of difference, the first difference text may include at least one set character. In some embodiments, if the second text is compared to the first text, there is no word and/or word of difference, the second difference text may include at least one set character. The setup characters may include, but are not limited to, the following: '#', '$', 'x', '' and the like.
For example, the first text is "i like drinking coffee", the second text is "i dislike drinking coffee", the first difference text may be "#", and the second difference text may be "no".
And step 230, extracting corresponding first feature vectors at least according to the first text and the second text. Specifically, the first feature extraction module 130 may implement the method.
In some embodiments, the first feature vector may refer to semantic features of the vector representations corresponding to the first text and the second text. Semantic features may be used to represent the meaning of a sentence and/or word. In some embodiments, words, phrases, sentences may be used as semantic features of text. In some embodiments, the semantic features may be represented in the form of vectors. For example, the first text is "i like drinking black tea", the second text is "i like drinking green tea", then the first feature may be: feature vectors are extracted from the first text and the second text using a deep learning model. In some embodiments, the first feature vector may also be other features. For example: word characteristics, statistical characteristics, combination characteristics, and the like are not limited by the expression of the present specification.
In some embodiments, a text similarity model, such as a BERT model, may be used to extract first feature vectors corresponding to the first text and the second text. For more detailed description of the text similarity model, see fig. 4, which is not repeated here.
And step 240, extracting a corresponding second feature vector at least according to the first difference text and the second difference text. In particular, it may be implemented by the second feature extraction module 140.
In some embodiments, the second feature vector may be a semantic feature of the vector representation corresponding to the first difference text and the second difference text. For example, the first difference text is "black tea", the second difference text is "green tea", and the second feature vector may be: feature vectors extracted from the first difference text and the second difference text using a deep learning model.
In some embodiments, a text similarity model, such as a BERT model, may be used to extract a second feature vector corresponding to the first difference text and the second difference text. For more detailed description of the text similarity model, see fig. 4, which is not repeated here.
Step 250, obtaining a third feature vector based on the first feature vector and the second feature vector; and determining a similarity between the first text and the second text based on the third feature vector. Specifically, it may be implemented by the similarity determination module 150.
In some embodiments, the first feature vector and the second feature vector may be linearly transformed to obtain a third feature vector. Linear transformation refers to the linear mapping of linear space to itself. Linear mapping refers to the mapping of one vector space to another vector space. The linear transformation may include, but is not limited to, the following: and (5) summing and splicing the ends.
For example, a summation operation is performed on the first feature vector of 100 dimensions and the second feature vector of 100 dimensions, to obtain the third feature vector of 100 dimensions. For another example, the first feature vector of 100 dimensions and the second feature vector of 100 dimensions are spliced end to obtain the third feature vector of 200 dimensions.
Through the linear transformation of the above example, the weight of a small number of difference words and/or words is added to the third feature vector, so that the model can pay attention to the influence of the small number of difference words and/or words on the text semantics.
For example: the first text 'Mycat is yellow' corresponds to the first feature vector, and the second text 'Mydog is yellow' corresponds to the second feature vector. By the above linear transformation, the third feature vector contains both the first feature vector and the second feature vector. The semantic features corresponding to the words "cat" and "dog" in the third feature vector are weighted more heavily in the overall feature vector than the first feature vector.
In some embodiments, a text similarity model may be used to determine similarity between the first text and the second text based on the third feature vector. For more detailed description of the text similarity model, see fig. 3, which is not repeated here.
It should be noted that the above description of the process 200 is for purposes of example and illustration only and is not intended to limit the scope of applicability of one or more embodiments of the present disclosure. Various modifications and changes to flow 200 may be made by those skilled in the art in light of one or more embodiments of the present description. However, such modifications and variations are still within the scope of one or more embodiments of the present description. For example, step 220 and step 230 may exchange execution order.
FIG. 3 is an exemplary block diagram of a text similarity model shown in accordance with some embodiments of the present description.
In some embodiments, a deep learning model may be utilized to determine the similarity of two texts. The deep learning model may include, but is not limited to, the following: BERT model (Bidirectional Encoder Representation from Transformers) cyclic neural network model (Recurrent Neural Network, RNN) and Convolutional neural network model (Convolutional NeuralNetworks, CNN), and the like.
In some embodiments, a text similarity model may be used to determine similarity between the first text and the second text.
The following description will take the BERT model as a text similarity model, and the core layer (hidden layer) of the BERT model as a feature extraction layer.
The BERT model is built up of a number of transformer Encoder layers, each Transformer Encoder being understood as a black box that converts the semantic vectors of individual words in the input text into enhanced semantic (including the entire top and bottom Wen Yuyi) vectors of the same length. For example, the BERT model converts the input text into 100-dimensional semantic vectors, which, when input to Transformer Encoder, transformer Encoder outputs the same 100-dimensional context-added semantic vectors.
After the core layer (a plurality of transformer Encoder) of the BERT, an output layer is added according to the requirement to process the extracted feature vectors, so that the method can be used for various natural language processing tasks. For example, after the core layer, a full connection layer is added, so that text similarity determination can be performed.
As shown in fig. 3, the process of extracting text features using the feature extraction model includes:
step 310, inputting the first text and the second text into a BERT model, and extracting the first feature by using the BERT model.
The BERT model has functions of word segmentation, self-vector, etc., so the input may be pre-processed text. For example: the first text is "Beijing metric center", and the second text is "Shanghai metric center". A separator [ SEP ] can be added between the two texts to obtain a spliced text, namely, a "beijing public accumulation center [ SEP ] Shanghai public accumulation center", and the spliced text is used as input of the BERT model.
In some embodiments, the feature vector output by the feature extraction layer is taken as the first feature vector.
Step 320, inputting the first difference text and the second difference text extracted in step 220 into a BERT model, and extracting the second feature using the BERT model.
In some embodiments, for the first text and the second text described in step 310, the corresponding first difference text is "Beijing" and the second difference text is "Shanghai". Preprocessing the first difference text and the second difference text to obtain a spliced text 'Beijing [ SEP ] Shanghai', and taking the spliced text as input of a BERT model.
In some embodiments, the feature vector output by the feature extraction layer is taken as the second feature vector.
In step 330, the first feature vector and the second feature vector are linearly transformed to obtain a third feature vector.
Regarding the linear transformation of the first feature vector and the second feature vector, the third feature vector is obtained, which is detailed in fig. 2 and will not be described herein.
And 340, inputting the third feature vector into an output layer of the BERT model, and obtaining the similarity of the output of the BERT model output layer.
In some embodiments, the third feature vector acquired in step 330 may be input into an output layer of the BERT model that is made up of fully connected layers, and the similarity represented by the score output by the output layer may be used as the similarity between the first text and the second text.
Benefits that may be brought by one or more embodiments of the present description include, but are not limited to: the accuracy of similarity recognition of the deep learning model to two texts with smaller editing distance is effectively improved. It should be noted that, the advantages that may be generated by different embodiments may be different, and in different embodiments, the advantages that may be generated may be any one or a combination of several of the above, or any other possible advantages that may be obtained.
While the basic concepts have been described above, it will be apparent to those skilled in the art that the foregoing detailed disclosure is intended to be presented by way of example only, and is not to be taken as limiting the one or more embodiments of the invention. Although not explicitly described herein, various modifications, improvements, and adaptations to one or more embodiments of the present disclosure may occur to one skilled in the art. Such modifications, improvements, and adaptations are intended to be suggested within one or more embodiments of the present specification and are therefore within the spirit and scope of one or more example embodiments of the present specification.
Meanwhile, one or more embodiments of the present specification use specific words to describe one or more embodiments of the present specification. Reference to "one embodiment," "an embodiment," and/or "some embodiments" means that a particular feature, structure, or characteristic is associated with at least one embodiment of the present description. Thus, it should be emphasized and should be appreciated that two or more references to "an embodiment" or "one embodiment" or "an alternative embodiment" in various positions in this specification are not necessarily referring to the same embodiment. Furthermore, certain features, structures, or characteristics of one or more embodiments of the present description may be combined as suitable.
Furthermore, those of skill in the art will appreciate that aspects of one or more embodiments of the specification may be illustrated and described in terms of several patentable categories or circumstances, including any novel and useful processes, machines, products, or combinations of materials, or any novel and useful improvements thereof. Accordingly, aspects of one or more embodiments of the present description may be performed entirely by hardware, entirely by software (including firmware, resident software, micro-code, etc.) or by a combination of hardware and software. The above hardware or software may be referred to as a "data block," module, "" engine, "" unit, "" component, "or" system. Furthermore, aspects of one or more embodiments of the present description may take the form of a computer product, comprising computer-readable program code, embodied in one or more computer-readable media.
The computer storage medium may contain a propagated data signal with the computer program code embodied therein, for example, on a baseband or as part of a carrier wave. The propagated signal may take on a variety of forms, including electro-magnetic, optical, etc., or any suitable combination thereof. A computer storage medium may be any computer readable medium that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code located on a computer storage medium may be propagated through any suitable medium, including radio, cable, fiber optic cable, RF, or the like, or a combination of any of the foregoing.
Computer program code necessary for operation of portions of one or more embodiments of the present disclosure may be written in any one or more programming languages, including an object oriented programming language such as Java, scala, smalltalk, eiffel, JADE, emerald, C ++, c#, vb.net, python and the like, a conventional programming language such as C language, visual Basic, fortran 2003, perl, COBOL 2002, PHP, ABAP, dynamic programming languages such as Python, ruby and Groovy, or other programming languages and the like. The program code may execute entirely on the user's computer or as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any form of network, such as a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet), or the use of services such as software as a service (SaaS) in a cloud computing environment.
Furthermore, the order in which the elements and sequences are recited in one or more embodiments of the specification, the use of numerical letters, or other designations, is not intended to limit the order in which the elements and sequences of one or more embodiments of the specification should be read unless expressly recited in the claims. While certain presently useful inventive embodiments have been discussed in the foregoing disclosure, by way of various examples, it is to be understood that such details are merely illustrative and that the appended claims are not limited to the disclosed embodiments, but, on the contrary, are intended to cover all modifications and equivalent arrangements included within the spirit and scope of one or more embodiments of the present disclosure. For example, while the system components described above may be implemented by hardware devices, they may also be implemented solely by software solutions, such as installing the described system on an existing server or mobile device.
Likewise, it should be noted that in order to simplify the presentation of the disclosure of one or more embodiments of the present disclosure, and thereby facilitate an understanding of one or more inventive embodiments, various features are sometimes incorporated into the description of one or more embodiments of the present disclosure, either in the drawings or the description thereof. This method of disclosure, however, is not intended to imply that more features than are presented in the claims are required for the present description. Indeed, less than all of the features of a single embodiment disclosed above.
In some embodiments, numbers describing the components, number of attributes are used, it being understood that such numbers being used in the description of embodiments are modified in some examples by the modifier "about," approximately, "or" substantially. Unless otherwise indicated, "about," "approximately," or "substantially" indicate that the number allows for a 20% variation. Accordingly, in some embodiments, numerical parameters set forth in the specification and claims are approximations that may vary depending upon the desired properties sought to be obtained by the individual embodiments. In some embodiments, the numerical parameters should take into account the specified significant digits and employ a method for preserving the general number of digits. Although the numerical ranges and parameters set forth herein are approximations that may be employed in some embodiments to confirm the breadth of the range, in particular embodiments, the setting of such numerical values is as precise as possible.
Each patent, patent application publication, and other material, such as articles, books, specifications, publications, documents, etc., referred to in this specification is incorporated herein by reference in its entirety. Except for application history documents that are inconsistent or conflicting with the content of this specification, documents that are currently or later attached to this specification in which the broadest scope of the claims to this specification is limited are also. It is noted that, if the description, definition, and/or use of a term in an attached material in this specification does not conform to or conflict with what is described in this specification, the description, definition, and/or use of the term in this specification controls.
Finally, it should be understood that the embodiments described in this specification are merely illustrative of the principles of the embodiments of this specification. Other variations are possible within the scope of this description. Thus, by way of example, and not limitation, alternative configurations of embodiments of the present specification may be considered as consistent with the teachings of the present specification. Accordingly, the embodiments of the present specification are not limited to only the embodiments explicitly described and depicted in the present specification.

Claims (19)

1. A text similarity calculation method, the method comprising:
acquiring a first text and a second text; a first editing distance exists between the first text and the second text, and the first editing distance is smaller than a preset first editing distance threshold;
extracting a first difference text and a second difference text according to the difference between the first text and the second text; wherein the first difference text is a word and/or a word with a difference from the first text and comparing the first text with the second text; the second difference text is a word and/or a word with a difference from the second text and compared with the first text;
extracting corresponding first feature vectors at least according to the first text and the second text;
extracting a corresponding second feature vector at least according to the first difference text and the second difference text;
obtaining a third feature vector based on the first feature vector and the second feature vector; and determining a similarity between the first text and the second text based on the third feature vector.
2. The method of claim 1, wherein the first difference text and/or the second difference text further comprises expanding difference text forward or backward.
3. The method of claim 2, wherein the first difference text includes at least one set character if the first text is compared to the second text and there is no word and/or word of the difference; if the second text is compared to the first text, the second difference text includes at least one set character if there is no word and/or word of the difference.
4. The method of claim 1, wherein the extracting the corresponding first feature vector from at least the first text and the second text comprises:
inputting the first text and the second text into a text similarity model;
and obtaining at least one vector output by the text similarity model feature extraction layer as a first feature vector.
5. The method of claim 4, wherein the extracting the corresponding second feature vector from at least the first difference text and the second difference text comprises:
inputting the first difference text and the second difference text into the text similarity model;
and obtaining at least one vector output by the text similarity model feature extraction layer as a second feature vector.
6. The method of claim 5, wherein the determining a similarity between the first text and the second text based on the third feature vector comprises:
and inputting the third feature vector into an output layer of the text similarity model, and obtaining the similarity output by the output layer.
7. The method of claim 6, wherein the text similarity model is a BERT model.
8. The method of claim 6, wherein the obtaining a third feature vector based on the first feature vector and the second feature vector comprises:
and performing linear transformation on the first characteristic vector and the second characteristic vector to obtain the third characteristic vector.
9. The method of claim 8, wherein the linear transformation comprises at least one of:
and (5) summing and splicing the ends.
10. A text similarity calculation system, the system comprising:
the acquisition module is used for acquiring the first text and the second text; a first editing distance exists between the first text and the second text, and the first editing distance is smaller than a preset first editing distance threshold;
the difference extraction module is used for extracting a first difference text and a second difference text according to the difference between the first text and the second text; wherein the first difference text is a word and/or a word with a difference from the first text and comparing the first text with the second text; the second difference text is a word and/or a word with a difference from the second text and compared with the first text;
the first feature extraction module is used for extracting corresponding first feature vectors at least according to the first text and the second text;
the second feature extraction module is used for extracting a corresponding second feature vector at least according to the first difference text and the second difference text;
the similarity determining module is used for obtaining a third feature vector based on the first feature vector and the second feature vector; and determining a similarity between the first text and the second text based on the third feature vector.
11. The system of claim 10, wherein the first difference text and/or the second difference text further comprises expanding difference text forward or backward.
12. The system of claim 11, wherein the first difference text includes at least one set character if the first text is compared to the second text and there is no word and/or word of the difference; if the second text is compared to the first text, the second difference text includes at least one set character if there is no word and/or word of the difference.
13. The system of claim 10, wherein the extracting the corresponding first feature vector from at least the first text and the second text comprises:
inputting the first text and the second text into a text similarity model;
and obtaining at least one vector output by the text similarity model feature extraction layer as a first feature vector.
14. The system of claim 13, wherein the extracting the corresponding second feature vector from at least the first difference text and the second difference text comprises:
inputting the first difference text and the second difference text into the text similarity model;
and obtaining at least one vector output by the text similarity model feature extraction layer as a second feature vector.
15. The system of claim 14, wherein the determining a similarity between the first text and the second text based on the third feature vector comprises:
and inputting the third feature vector into an output layer of the text similarity model, and obtaining the similarity output by the output layer.
16. The system of claim 15, wherein the text similarity model is a BERT model.
17. The system of claim 15, wherein the obtaining a third feature vector based on the first feature vector and the second feature vector comprises:
and performing linear transformation on the first characteristic vector and the second characteristic vector to obtain the third characteristic vector.
18. The system of claim 17, wherein the linear transformation comprises at least one of:
and (5) summing and splicing the ends.
19. A text similarity calculation device, wherein the device comprises at least one processor and at least one memory;
the at least one memory is configured to store computer instructions;
the at least one processor is configured to execute at least some of the computer instructions to implement the method of any one of claims 1-9.
CN201911009970.7A 2019-10-23 2019-10-23 Text similarity calculation method and system Active CN110750977B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911009970.7A CN110750977B (en) 2019-10-23 2019-10-23 Text similarity calculation method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911009970.7A CN110750977B (en) 2019-10-23 2019-10-23 Text similarity calculation method and system

Publications (2)

Publication Number Publication Date
CN110750977A CN110750977A (en) 2020-02-04
CN110750977B true CN110750977B (en) 2023-06-02

Family

ID=69279478

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911009970.7A Active CN110750977B (en) 2019-10-23 2019-10-23 Text similarity calculation method and system

Country Status (1)

Country Link
CN (1) CN110750977B (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111368081A (en) * 2020-03-03 2020-07-03 支付宝(杭州)信息技术有限公司 Method and system for determining selected text content
CN111159415B (en) * 2020-04-02 2020-07-14 成都数联铭品科技有限公司 Sequence labeling method and system, and event element extraction method and system
CN111401076B (en) * 2020-04-09 2023-04-25 支付宝(杭州)信息技术有限公司 Text similarity determination method and device and electronic equipment
CN111858925B (en) * 2020-06-04 2023-08-18 国家计算机网络与信息安全管理中心 Script extraction method and device of telecommunication phishing event
CN111666755A (en) * 2020-06-24 2020-09-15 深圳前海微众银行股份有限公司 Method and device for recognizing repeated sentences
CN112528894B (en) * 2020-12-17 2024-05-31 科大讯飞股份有限公司 Method and device for discriminating difference term

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2006139708A (en) * 2004-11-15 2006-06-01 Ricoh Co Ltd Text data similarity calculation method, text data similarity calculation apparatus, and text data similarity calculation program
CN103034627A (en) * 2011-10-09 2013-04-10 北京百度网讯科技有限公司 Method and device for calculating sentence similarity and method and device for machine translation
CN104850537A (en) * 2014-02-17 2015-08-19 腾讯科技(深圳)有限公司 Method and device for screening text content
CN108170684A (en) * 2018-01-22 2018-06-15 京东方科技集团股份有限公司 Text similarity computing method and system, data query system and computer product
WO2018153217A1 (en) * 2017-02-27 2018-08-30 芋头科技(杭州)有限公司 Method for determining sentence similarity

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR101400944B1 (en) * 2012-10-08 2014-06-27 한국과학기술정보연구원 method of evaluating a value for a referenced information, apparatus thereof, storage medium for storing a program evaluating a value for a referenced information
CN106485193A (en) * 2015-09-02 2017-03-08 富士通株式会社 The direction detection device of file and picture and method
US10026020B2 (en) * 2016-01-15 2018-07-17 Adobe Systems Incorporated Embedding space for images with multiple text labels

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2006139708A (en) * 2004-11-15 2006-06-01 Ricoh Co Ltd Text data similarity calculation method, text data similarity calculation apparatus, and text data similarity calculation program
CN103034627A (en) * 2011-10-09 2013-04-10 北京百度网讯科技有限公司 Method and device for calculating sentence similarity and method and device for machine translation
CN104850537A (en) * 2014-02-17 2015-08-19 腾讯科技(深圳)有限公司 Method and device for screening text content
WO2018153217A1 (en) * 2017-02-27 2018-08-30 芋头科技(杭州)有限公司 Method for determining sentence similarity
CN108170684A (en) * 2018-01-22 2018-06-15 京东方科技集团股份有限公司 Text similarity computing method and system, data query system and computer product

Also Published As

Publication number Publication date
CN110750977A (en) 2020-02-04

Similar Documents

Publication Publication Date Title
CN110750977B (en) Text similarity calculation method and system
CN113011533B (en) Text classification method, apparatus, computer device and storage medium
US10664660B2 (en) Method and device for extracting entity relation based on deep learning, and server
US10650192B2 (en) Method and device for recognizing domain named entity
CN109670191B (en) Calibration optimization method and device for machine translation and electronic equipment
CN107729313B (en) Deep neural network-based polyphone pronunciation distinguishing method and device
CN109408824B (en) Method and device for generating information
US11327971B2 (en) Assertion-based question answering
CN111291570A (en) Method and device for realizing element identification in judicial documents
CN107943911A (en) Data pick-up method, apparatus, computer equipment and readable storage medium storing program for executing
CN110532573A (en) A kind of interpretation method and system
CN107341143B (en) Sentence continuity judgment method and device and electronic equipment
CN104933158B (en) The training method and device of mathematical problem solving model, inference method and device
CN111858913A (en) Method and system for automatically generating text abstract
CN111046660B (en) Method and device for identifying text professional terms
CN113849623A (en) Text visual question answering method and device
CN113836303A (en) Text type identification method and device, computer equipment and medium
CN110866390B (en) Method and device for recognizing Chinese grammar error, computer equipment and storage medium
CN111368066A (en) Method, device and computer readable storage medium for acquiring dialogue abstract
CN112417093A (en) Model training method and device
Lysak et al. Optimized Table Tokenization for Table Structure Recognition
CN113705207A (en) Grammar error recognition method and device
CN115130437B (en) Intelligent document filling method and device and storage medium
CN110717029A (en) Information processing method and system
CN114398492B (en) Knowledge graph construction method, terminal and medium in digital field

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant