CN113204629A - Text matching method and device, computer equipment and readable storage medium - Google Patents

Text matching method and device, computer equipment and readable storage medium Download PDF

Info

Publication number
CN113204629A
CN113204629A CN202110603418.1A CN202110603418A CN113204629A CN 113204629 A CN113204629 A CN 113204629A CN 202110603418 A CN202110603418 A CN 202110603418A CN 113204629 A CN113204629 A CN 113204629A
Authority
CN
China
Prior art keywords
text
matched
sentence
target
vector
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110603418.1A
Other languages
Chinese (zh)
Inventor
肖京
赵盟盟
王磊
杨怡
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN202110603418.1A priority Critical patent/CN113204629A/en
Publication of CN113204629A publication Critical patent/CN113204629A/en
Priority to PCT/CN2022/072189 priority patent/WO2022252638A1/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computing Systems (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Evolutionary Computation (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Databases & Information Systems (AREA)
  • Machine Translation (AREA)

Abstract

The application belongs to the technical field of natural language processing, and provides a text matching method, a text matching device, computer equipment and a readable storage medium, wherein the method comprises the following steps: acquiring a target text and a text set to be matched corresponding to the target text; respectively obtaining a first sentence vector corresponding to the target text and a second sentence vector corresponding to each text to be matched in the text set to be matched through a trained BERT model; respectively carrying out noise reduction processing on the first sentence vectors and each second sentence vector to obtain noise reduction first sentence vectors and each noise reduction second sentence vector; determining the matching degree of each text to be matched and the target text according to the noise-reduced first sentence vectors and each noise-reduced second sentence vector; and determining a target matching text of the target text in the text set to be matched according to the matching degrees. The matching precision of text matching can be improved.

Description

Text matching method and device, computer equipment and readable storage medium
Technical Field
The present application relates to the field of natural language processing technologies, and in particular, to a text matching method and apparatus, a computer device, and a readable storage medium.
Background
Text matching is an important basic field in the field of natural language processing, and a large number of NLP tasks are based on text matching, such as information retrieval, machine translation, question and answer systems, and the like, and the nature of the NLP tasks is a text matching problem.
In the traditional text matching algorithm, the TF-IDF method based on statistical word frequency is widely applied with simple and easily understood principle and high realization degree, the main principle is that the important degree of a word in a text is measured by comparing the frequency of each word in a sentence with the frequency of each word in a corpus in a given corpus, a plurality of keywords of the text are extracted to form a set, and then the similarity of the vectorized word set is calculated. However, this method has limitations, is greatly influenced by the corpus, ignores the interactivity between words, and has a poor matching effect when facing to the strong interference text data, for example, "machine learning" and "learning machine" in which the words are completely overlapped but the expressions have different meanings, and the conventional TF-IDF method is difficult to deal with and has low accuracy.
Disclosure of Invention
The present application mainly aims to provide a text matching method, device, computer device and readable storage medium, and aims to solve the technical problem in the related art that the text matching accuracy is not high.
In a first aspect, the present application provides a text matching method, including:
acquiring a target text and a text set to be matched corresponding to the target text;
respectively obtaining a first sentence vector corresponding to the target text and a second sentence vector corresponding to each text to be matched in the text set to be matched through a trained BERT model;
respectively carrying out noise reduction processing on the first sentence vectors and each second sentence vector to obtain noise reduction first sentence vectors and each noise reduction second sentence vector;
determining the matching degree of each text to be matched and the target text according to the noise-reduced first sentence vectors and each noise-reduced second sentence vector;
and determining a target matching text of the target text in the text set to be matched according to the matching degrees.
In a second aspect, the present application further provides a text matching apparatus, including:
the acquisition module is used for acquiring a target text and a text set to be matched corresponding to the target text;
an obtaining module, configured to obtain, through a trained BERT model, a first sentence vector corresponding to the target text and a second sentence vector corresponding to each text to be matched in the text set to be matched, respectively;
the denoising module is used for respectively denoising the first sentence vector and each second sentence vector to obtain a denoising first sentence vector and each denoising second sentence vector;
the first determining module is used for determining the matching degree of each text to be matched and the target text according to the noise-reduced first sentence vector and each noise-reduced second sentence vector;
and the second determining module is used for determining a target matching text of the target text in the text set to be matched according to each matching degree.
In a third aspect, the present application also provides a computer device comprising a processor, a memory, and a computer program stored on the memory and executable by the processor, wherein the computer program, when executed by the processor, implements the steps of the text matching method as described above.
In a fourth aspect, the present application further provides a computer-readable storage medium having a computer program stored thereon, wherein the computer program, when executed by a processor, implements the text matching method as described above.
The text matching method comprises the steps of firstly obtaining a target text and a text set to be matched corresponding to the target text, then respectively obtaining a first sentence vector corresponding to the target text and a second sentence vector corresponding to each text to be matched in the text set to be matched by using a trained BERT model, and then respectively carrying out noise reduction on the first sentence vector and each second sentence vector to obtain a noise-reduced first sentence vector and each noise-reduced second sentence vector; and further determining the matching degree of each text to be matched and the target text according to the noise-reduced first sentence vector and each noise-reduced second sentence vector, and finally determining the target matching text of the target text in the text set to be matched according to each matching degree. Different from the TFIDF method, the text matching method provided by the application is not influenced by a corpus, the semantics of the text can be represented more accurately by the target text obtained by the BERT model and the sentence vector corresponding to each text to be matched in the text set to be matched, and the characteristic representation capability of the sentence vector is further enhanced by performing noise reduction processing on the sentence vector, so that the anti-interference capability can be improved even if the text set to be matched contains more strong interference data, and the synonymous text can be matched quickly and accurately, so that the matching precision is improved, and the method has wider practicability and better robustness.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
FIG. 1 is a schematic flow chart diagram illustrating an embodiment of a text matching method according to the present application;
FIG. 2 is a schematic flow chart diagram illustrating another embodiment of a text matching method according to the present application;
FIG. 3 is a diagram illustrating an embodiment of a text matching method according to the present application, which relates to a text matching example;
fig. 4 is a schematic block diagram of a text matching apparatus according to an embodiment of the present application;
fig. 5 is a block diagram schematically illustrating a structure of a computer device according to an embodiment of the present application.
The implementation, functional features and advantages of the objectives of the present application will be further explained with reference to the accompanying drawings.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some, but not all, embodiments of the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
The flow diagrams depicted in the figures are merely illustrative and do not necessarily include all of the elements and operations/steps, nor do they necessarily have to be performed in the order depicted. For example, some operations/steps may be decomposed, combined or partially combined, so that the actual execution sequence may be changed according to the actual situation.
It is to be understood that the terminology used in the description of the present application herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in the specification of the present application and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.
It should also be understood that the term "and/or" as used in this specification and the appended claims refers to and includes any and all possible combinations of one or more of the associated listed items.
The embodiment of the application provides a text matching method and device, computer equipment and a readable storage medium. The text matching method is mainly applied to text matching equipment, and can be equipment with a data processing function, such as a mobile terminal, a Personal Computer (PC), a portable computer and a server.
Some embodiments of the present application will be described in detail below with reference to the accompanying drawings. The embodiments described below and the features of the embodiments can be combined with each other without conflict.
Referring to fig. 1, fig. 1 is a flowchart illustrating a text matching method according to an embodiment of the present application.
As shown in fig. 1, the text matching method includes steps S101 to S105.
Step S101, a target text and a text set to be matched corresponding to the target text are obtained.
The target text refers to text data needing text matching, and comprises a sentence or a combination of sentences which are formed by a plurality of words consisting of a plurality of characters according to a specific semantic sequence. The target text comprises a plurality of types, such as a chinese type, an english type, a chinese-english combination type, and the like, and the following description of text matching is given by taking the target text of the english type (denoted as a) as an example. The text set to be matched refers to a set of texts to be matched for matching with the target text, and is represented as (a',.., An).
For example, the method for acquiring the target text and the text set to be matched may be that the text matching device loads an input interface, so as to acquire the target text and the text set to be matched, which are input by the user on the input interface, or receive the target text and the text set to be matched, which are sent by other devices.
And S102, respectively obtaining a first sentence vector corresponding to the target text and a second sentence vector corresponding to each text to be matched in the text set to be matched through a trained bidirectional attention neural network (BERT) model.
The trained BERT (Bidirectional Encoder Representation from transforms) model is a pre-trained end-to-end sentence-level language model, and a unique vector Representation of a whole sentence can be directly obtained. Sentence vectors (defined as first sentence vectors) corresponding to the target text and sentence vectors (defined as second sentence vectors) corresponding to each text to be matched in the text set to be matched are respectively obtained through the trained BERT model.
In some embodiments, step S102 specifically includes: respectively inputting the target text and each text to be matched in the text set to be matched into a trained BERT model for embedding operation to obtain a first sentence embedding vector corresponding to the target text and a second sentence embedding vector corresponding to each text to be matched; and respectively inputting the first sentence embedding vector and each second sentence embedding vector into a Transformer of the trained BERT model for coding operation and decoding operation to obtain a first sentence vector corresponding to the target text and a second sentence vector corresponding to each text to be matched.
The well-trained BERT model adopts a Transformer as a main body model structure, and the structure of the Transformer comprises a plurality of encoding (Encoder) layers and decoding (Decoder) layers.
Inputting the target text into a trained BERT model for embedding operation to obtain a sentence embedding vector (defined as a first sentence embedding vector) corresponding to the target text; and inputting the texts to be matched in the text set to be matched into the trained BERT model one by one for embedding operation to obtain sentence embedding vectors (defined as second sentence embedding vectors) corresponding to the texts to be matched.
In some embodiments, the inputting the target text and each text to be matched in the text set to be matched into a trained BERT model respectively to perform an embedding operation, to obtain a first sentence embedding vector corresponding to the target text and a second sentence embedding vector corresponding to each text to be matched specifically includes: respectively inputting the target text and each text to be matched into an embedding layer of the trained BERT model for embedding operation to obtain a word embedding vector, a word position information embedding vector and a phrase segmentation information embedding vector corresponding to the target text, and a word embedding vector, a word position information embedding vector and a phrase segmentation information embedding vector corresponding to each text to be matched; adding the word embedding vector, the position information embedding vector of the word and the phrase segmentation information embedding vector corresponding to the target text to obtain a first sentence embedding vector corresponding to the target text, and adding the word embedding vector, the position information embedding vector of the word and the phrase segmentation information embedding vector corresponding to each text to be matched to obtain a second sentence embedding vector corresponding to each text to be matched.
That is, for a target text, inputting the target text into a trained BERT model for embedding, embedding the target text by the trained BERT model from three angles, namely word embedding, word position information embedding and phrase segmentation information embedding, in an embedding layer of the trained BERT model, obtaining an embedding vector containing word information, an embedding vector containing word position information and an embedding vector containing sentence information phrase segmentation information, and adding the embedding vectors of the three angles to obtain a first sentence embedding vector corresponding to the target text. That is, the first sentence embedding vector includes information of a word of the target text, a position of the word, and a phrase cut.
Similarly, for each text to be matched in the text set to be matched, the text to be matched is input into the trained BERT model one by one for embedding operation, the trained BERT model carries out embedding operation on each text to be matched at the embedding layer of the trained BERT model from three angles, the three angles are respectively word embedding, word position information embedding and phrase segmentation information embedding, an embedding vector containing the information of the word, an embedding vector containing the word position information and an embedding vector containing sentence information phrase segmentation information are obtained, and the embedding vectors at the three angles are added, so that a second sentence embedding vector corresponding to each text to be matched can be obtained.
After a first sentence embedding vector corresponding to a target text is obtained, the first sentence embedding vector is input into a Transformer through a trained BERT model, the Transformer conducts depth coding (encoding) on the first sentence embedding vector corresponding to the target text in an Encoder layer through the trained BERT model to obtain the output of the Encoder layer, and then the output of the Encoder layer is input into a Decoder layer to conduct depth Decoding (Decoding), so that the first sentence vector corresponding to the target text is obtained.
Similarly, after the second sentence embedding vectors corresponding to the texts to be matched are obtained, the trained BERT model respectively inputs the second sentence embedding vectors corresponding to the texts to be matched into the Transformer, the trained BERT model enables the Transformer to respectively carry out depth coding (encoding) on the second sentence embedding vectors corresponding to the texts to be matched on the Encoder layer to obtain the output of the Encoder layer, and then the output of the Encoder layer is input to the Decoder layer to carry out depth Decoding (Decoding), so that the second sentence vectors corresponding to the texts to be matched are obtained.
Step S103, performing denoising processing on the first sentence vector and each of the second sentence vectors, respectively, to obtain a denoising first sentence vector and each of the denoising second sentence vectors.
And then, respectively carrying out noise reduction treatment on the first sentence vector corresponding to the target text and the second sentence vector corresponding to each text to be matched to obtain the noise reduction first sentence vector corresponding to the target text and the noise reduction second sentence vector corresponding to each text to be matched so as to achieve the purpose of filtering out the noise implicit in the sentence vectors and enhancing the feature expression capability of the first sentence vector and each second sentence vector.
In some embodiments, step S103 specifically includes: and performing low-pass filtering processing on the first sentence vectors and each second sentence vector respectively to obtain noise-reduced first sentence vectors and each noise-reduced second sentence vector.
The noise reduction processing is performed on the first sentence vector corresponding to the target text and the second sentence vector corresponding to each text to be matched, which may be performing low-pass filtering processing on the first sentence vector corresponding to the target text and the second sentence vector corresponding to each text to be matched, respectively, to obtain the noise reduction first sentence vector corresponding to the target text and the noise reduction second sentence vector corresponding to each text to be matched, so as to enhance the feature expression capability of the first sentence vector and each second sentence vector. The low-pass filtering is a noise filtering method, and the rule is that low-frequency signals can normally pass through, and high-frequency signals exceeding a set critical value are blocked and weakened.
And step S104, determining the matching degree of each text to be matched and the target text according to the noise reduction first sentence vectors and each noise reduction second sentence vector.
After the noise-reduction first sentence vector corresponding to the target text and the noise-reduction second sentence vector corresponding to each text to be matched are obtained, the matching degree between each text to be matched and the target text is determined according to the noise-reduction first sentence vector corresponding to the target text and the noise-reduction second sentence vector corresponding to each text to be matched.
In some embodiments, as shown in fig. 2, step S104 may specifically include sub-step S1041 and sub-step S1042.
And a substep S1041 of calculating the similarity between each of the noise-reduced second sentence vectors and the noise-reduced first sentence vector, respectively.
That is, the similarity between the noise-reduced second sentence vector corresponding to each text to be matched and the noise-reduced first sentence vector corresponding to the target text is calculated respectively.
Illustratively, cosine similarity is used to measure the similarity between the noise-reduced second sentence vectors corresponding to the texts to be matched and the noise-reduced first sentence vectors corresponding to the target text. The cosine similarity between the noise-reduced second sentence vector corresponding to each text to be matched and the noise-reduced first sentence vector corresponding to the target text can be respectively calculated by adopting a preset cosine similarity calculation formula as follows:
Figure BDA0003093339880000071
wherein x isiRepresenting a denoised first sentence vector, y, corresponding to the target textiAnd representing the noise reduction second sentence vector corresponding to the single text to be matched.
And a substep S1042 of determining the matching degree of each text to be matched and the target text according to the calculated similarity.
That is, the similarity between the noise-reduced second sentence vector corresponding to each text to be matched and the noise-reduced first sentence vector corresponding to the target text is used to represent the matching degree between each text to be matched and the target text, and the higher the similarity is, the higher the matching degree with the target text is.
And step S105, determining a target matching text of the target text in the text set to be matched according to each matching degree.
After the matching degree between each text to be matched and the target text is obtained, the target matching text of the target text can be determined in each text to be matched in the text set to be matched according to the matching degree between each text to be matched and the target text.
In some embodiments, step S105 specifically includes: sorting the matching degrees; and taking the text to be matched corresponding to the highest matching degree in the text set to be matched as the target matching text of the target text.
That is, the matching degrees between the noise-reduced second sentence vectors corresponding to the texts to be matched and the noise-reduced first sentence vectors corresponding to the target texts are sorted, and the text to be matched with the highest matching degree is determined as the target matching text matched with the target text, that is:
max (noise reduction sentence vector of target text, noise reduction sentence vector of text to be matched)
For a better understanding of the above embodiments, reference is made to fig. 3, which illustrates the same in detail.
Such as for the target text a "What is you favorite movie? ", its corresponding text set to be matched includes" What is you favorite opera? "," What is said you favorite spot? "thus a text to be matched Which looks very similar to the target text but has completely different semantics, and further includes a'" a Which movie do you like best? "such text to be matched that has a great difference with the grammar structure of the target text but is synonymous with the grammar structure of the target text, in the process of determining the target matching text matched with the target text, the text to be matched which is similar to the target text but has completely different semantics is regarded as interference data, even if the text to be matched has interference data in a set, sentence vectors corresponding to the target text A and sentence vectors of the texts to be matched, which are obtained by embedding the target text A and the texts to be matched by using a BERT model, can better represent the semantics of the texts, by carrying out noise reduction processing on the sentence vector of the target text and the sentence vectors of each text to be matched in the text set to be matched, the characterization capability of the sentence vector characteristics is further enhanced, so that the target text A' What is your favorite movie? "synonymous target matching text A'" Which movie do you like best? ".
Compared with the TFIDF method, when the sentence vector of the text is obtained by using the BERT model, the text matching method provided by the embodiment considers not only the word embedding but also the characteristics of the position information of the word, the phrase segmentation and the like in the sentence embedding stage, so that the semantics of the whole text can be better represented. In addition, noise reduction operation is added behind the BERT model, noise of the sentence vector can be filtered, and the characteristic representation capability of the sentence vector is enhanced, so that the matching precision in the scene of facing a data set containing more strong interference data is improved, and the method has the advantages of wider practicability, better robustness and the like. In addition, compared with the traditional TF-IDF method, the influence of the distribution of the corpus on the model precision of the method is smaller, the consumption of computing resources is less, and text matching tasks such as information retrieval, machine translation and the like are better completed.
In some embodiments, before step S101, a trained BERT model is obtained, and specifically, before step S101, the method includes: acquiring a pre-training BERT model based on a Transformer; and training the pre-trained BERT model according to a preset training set so as to update the parameters of the pre-trained BERT model and obtain the trained BERT model.
Firstly, a pre-training BERT model based on a Transformer is obtained through learning in a pre-training mode, then the pre-training BERT model is trained according to a pre-training set, wherein the pre-training set comprises a plurality of text samples, so that parameters of the pre-training BERT model are updated until the BERT model converges, and the trained BERT model is obtained. The trained BERT model is an end-to-end language model, the input is a text, and the output is a sentence vector corresponding to the text, namely the trained BERT model can directly obtain the sentence vector of the text, so that the convenience of obtaining the text sentence vector is improved.
The text matching method provided by the embodiment includes the steps of firstly obtaining a target text and a text set to be matched corresponding to the target text, then respectively obtaining a first sentence vector corresponding to the target text and a second sentence vector corresponding to each text to be matched in the text set to be matched by using a trained BERT model, and then respectively carrying out noise reduction on the first sentence vector and each second sentence vector to obtain a noise-reduced first sentence vector and each noise-reduced second sentence vector; and further determining the matching degree of each text to be matched and the target text according to the noise-reduced first sentence vector and each noise-reduced second sentence vector, and finally determining the target matching text of the target text in the text set to be matched according to each matching degree. Different from the TFIDF method, the text matching method provided by the application is not influenced by a corpus, the target text obtained through the BERT model and the sentence vectors corresponding to the texts to be matched in the text set to be matched can better represent the semantics of the texts, and the feature representation capability of the sentence vectors is further enhanced by carrying out noise reduction processing on the sentence vectors.
Referring to fig. 4, fig. 4 is a schematic block diagram of a text matching apparatus according to an embodiment of the present application.
As shown in fig. 4, the text matching apparatus 400 includes: an obtaining module 401, an obtaining module extracting module 402, a denoising module 403, a first determining module 404, and a second determining module 405.
An obtaining module 401, configured to obtain a target text and a text set to be matched corresponding to the target text;
an obtaining module 402, configured to obtain, through a trained BERT model, a first sentence vector corresponding to the target text and a second sentence vector corresponding to each text to be matched in the text set to be matched, respectively;
a denoising module 403, configured to perform denoising processing on the first sentence vector and each of the second sentence vectors respectively to obtain a denoising first sentence vector and each of the denoising second sentence vectors;
a first determining module 404, configured to determine, according to the noise-reduced first sentence vectors and the noise-reduced second sentence vectors, matching degrees between the texts to be matched and the target text;
a second determining module 405, configured to determine, according to each matching degree, a target matching text of the target text in the to-be-matched text set.
It should be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working processes of the apparatus and each module and unit described above may refer to the corresponding processes in the foregoing text matching method embodiment, and are not described herein again.
The apparatus provided by the above embodiments may be implemented in the form of a computer program that can be run on a computer device as shown in fig. 5.
Referring to fig. 5, fig. 5 is a schematic block diagram of a computer device according to an embodiment of the present disclosure. The computer device may be a Personal Computer (PC), a server, or the like having a data processing function.
As shown in fig. 5, the computer device includes a processor, a memory, and a network interface connected by a system bus, wherein the memory may include a nonvolatile storage medium and an internal memory.
The non-volatile storage medium may store an operating system and a computer program. The computer program includes program instructions that, when executed, cause a processor to perform any of the text matching methods.
The processor is used for providing calculation and control capability and supporting the operation of the whole computer equipment.
The internal memory provides an environment for the execution of a computer program on a non-volatile storage medium, which when executed by a processor, causes the processor to perform any of the text matching methods.
The network interface is used for network communication, such as sending assigned tasks and the like. Those skilled in the art will appreciate that the architecture shown in fig. 5 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.
It should be understood that the Processor may be a Central Processing Unit (CPU), and the Processor may be other general purpose processors, Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components, etc. Wherein a general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
Wherein, in one embodiment, the processor is configured to execute a computer program stored in the memory to implement the steps of:
acquiring a target text and a text set to be matched corresponding to the target text; respectively obtaining a first sentence vector corresponding to the target text and a second sentence vector corresponding to each text to be matched in the text set to be matched through a trained BERT model; respectively carrying out noise reduction processing on the first sentence vectors and each second sentence vector to obtain noise reduction first sentence vectors and each noise reduction second sentence vector; determining the matching degree of each text to be matched and the target text according to the noise-reduced first sentence vectors and each noise-reduced second sentence vector; and determining a target matching text of the target text in the text set to be matched according to the matching degrees.
In some embodiments, the processor is configured to, when obtaining, through a trained BERT model, a first sentence vector corresponding to the target text and a second sentence vector corresponding to each text to be matched in the text set to be matched, implement:
respectively inputting the target text and each text to be matched in the text set to be matched into a trained BERT model for embedding operation to obtain a first sentence embedding vector corresponding to the target text and a second sentence embedding vector corresponding to each text to be matched;
and respectively inputting the first sentence embedding vector and each second sentence embedding vector into a Transformer of the trained BERT model for coding operation and decoding operation to obtain a first sentence vector corresponding to the target text and a second sentence vector corresponding to each text to be matched.
In some embodiments, when the processor implements the denoising processing on the first sentence vector and each of the second sentence vectors, respectively, to obtain a denoised first sentence vector and each of the denoised second sentence vectors, the processor is configured to implement:
and performing low-pass filtering processing on the first sentence vectors and each second sentence vector respectively to obtain noise-reduced first sentence vectors and each noise-reduced second sentence vector.
In some embodiments, the processor is configured to, when determining the matching degree between each text to be matched and the target text according to the noise-reduced first sentence vector and each noise-reduced second sentence vector, implement:
respectively calculating the similarity between each noise-reduced second sentence vector and the noise-reduced first sentence vector;
and determining the matching degree of each text to be matched and the target text according to each similarity.
In some embodiments, the processor is configured to, when determining a target matching text of the target text in the set of texts to be matched according to each matching degree, implement:
sorting the matching degrees;
and taking the text to be matched corresponding to the highest matching degree in the text set to be matched as the target matching text of the target text.
In some embodiments, the processor is configured to implement that, when the target text and each text to be matched in the text set to be matched are respectively input into a trained BERT model to perform embedding operation, and a first sentence embedding vector corresponding to the target text and a second sentence embedding vector corresponding to each text to be matched are obtained, the processor is configured to implement:
respectively inputting the target text and each text to be matched into an embedding layer of the trained BERT model for embedding operation to obtain a word embedding vector, a word position information embedding vector and a phrase segmentation information embedding vector corresponding to the target text, and a word embedding vector, a word position information embedding vector and a phrase segmentation information embedding vector corresponding to each text to be matched;
adding the word embedding vector, the position information embedding vector of the word and the phrase segmentation information embedding vector corresponding to the target text to obtain a first sentence embedding vector corresponding to the target text, and adding the word embedding vector, the position information embedding vector of the word and the phrase segmentation information embedding vector corresponding to each text to be matched to obtain a second sentence embedding vector corresponding to each text to be matched.
In some embodiments, before the processor implements the obtaining of the target text and the text set to be matched corresponding to the target text, the following steps are implemented:
acquiring a pre-training BERT model based on a Transformer;
and training the pre-trained BERT model according to a preset training set so as to update the parameters of the pre-trained BERT model and obtain the trained BERT model.
Embodiments of the present application further provide a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, where the computer program includes program instructions, and a method implemented when the program instructions are executed may refer to various embodiments of the text matching method of the present application.
The computer-readable storage medium may be an internal storage unit of the computer device described in the foregoing embodiment, for example, a hard disk or a memory of the computer device. The computer readable storage medium may also be an external storage device of the computer device, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like provided on the computer device.
Further, the computer-readable storage medium may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function, and the like; the storage data area may store data created according to the use of the blockchain node, and the like.
The block chain is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, a consensus mechanism, an encryption algorithm and the like. A block chain (Blockchain), which is essentially a decentralized database, is a series of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, so as to verify the validity (anti-counterfeiting) of the information and generate a next block. The blockchain may include a blockchain underlying platform, a platform product service layer, an application service layer, and the like.
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or system. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or system that comprises the element.
The above-mentioned serial numbers of the embodiments of the present application are merely for description and do not represent the merits of the embodiments. While the invention has been described with reference to specific embodiments, the scope of the invention is not limited thereto, and those skilled in the art can easily conceive various equivalent modifications or substitutions within the technical scope of the invention. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims (10)

1. A text matching method, characterized in that the method comprises the steps of:
acquiring a target text and a text set to be matched corresponding to the target text;
respectively obtaining a first sentence vector corresponding to the target text and a second sentence vector corresponding to each text to be matched in the text set to be matched through a trained bidirectional attention neural network BERT model;
respectively carrying out noise reduction processing on the first sentence vectors and each second sentence vector to obtain noise reduction first sentence vectors and each noise reduction second sentence vector;
determining the matching degree of each text to be matched and the target text according to the noise-reduced first sentence vectors and each noise-reduced second sentence vector;
and determining a target matching text of the target text in the text set to be matched according to the matching degrees.
2. The text matching method according to claim 1, wherein the obtaining, through the trained BERT model, the first sentence vector corresponding to the target text and the second sentence vector corresponding to each text to be matched in the text set to be matched respectively comprises:
respectively inputting the target text and each text to be matched in the text set to be matched into a trained BERT model for embedding operation to obtain a first sentence embedding vector corresponding to the target text and a second sentence embedding vector corresponding to each text to be matched;
and respectively inputting the first sentence embedding vector and each second sentence embedding vector into a Transformer of the trained BERT model for coding operation and decoding operation to obtain a first sentence vector corresponding to the target text and a second sentence vector corresponding to each text to be matched.
3. The text matching method of claim 1, wherein the denoising the first sentence vector and each of the second sentence vectors to obtain a denoised first sentence vector and each of the denoised second sentence vectors comprises:
and performing low-pass filtering processing on the first sentence vectors and each second sentence vector respectively to obtain noise-reduced first sentence vectors and each noise-reduced second sentence vector.
4. The text matching method according to claim 1, wherein the determining the matching degree between each text to be matched and the target text according to the noise-reduced first sentence vector and each noise-reduced second sentence vector comprises:
respectively calculating the similarity between each noise-reduced second sentence vector and the noise-reduced first sentence vector;
and determining the matching degree of each text to be matched and the target text according to each similarity.
5. The text matching method according to claim 1, wherein the determining the target matching text of the target text in the set of texts to be matched according to each matching degree comprises:
sorting the matching degrees;
and taking the text to be matched corresponding to the highest matching degree in the text set to be matched as the target matching text of the target text.
6. The text matching method according to claim 2, wherein the step of inputting the target text and each text to be matched in the text set to be matched into a trained BERT model for embedding to obtain a first sentence embedding vector corresponding to the target text and a second sentence embedding vector corresponding to each text to be matched comprises:
respectively inputting the target text and each text to be matched into an embedding layer of the trained BERT model for embedding operation to obtain a word embedding vector, a word position information embedding vector and a phrase segmentation information embedding vector corresponding to the target text, and a word embedding vector, a word position information embedding vector and a phrase segmentation information embedding vector corresponding to each text to be matched;
adding the word embedding vector, the position information embedding vector of the word and the phrase segmentation information embedding vector corresponding to the target text to obtain a first sentence embedding vector corresponding to the target text, and adding the word embedding vector, the position information embedding vector of the word and the phrase segmentation information embedding vector corresponding to each text to be matched to obtain a second sentence embedding vector corresponding to each text to be matched.
7. The text matching method according to claim 1, wherein before the obtaining of the target text and the set of texts to be matched corresponding to the target text, the method comprises:
acquiring a pre-training BERT model based on a Transformer;
and training the pre-trained BERT model according to a preset training set so as to update the parameters of the pre-trained BERT model and obtain the trained BERT model.
8. A text matching apparatus, characterized in that the text matching apparatus comprises:
the acquisition module is used for acquiring a target text and a text set to be matched corresponding to the target text;
an obtaining module, configured to obtain, through a trained BERT model, a first sentence vector corresponding to the target text and a second sentence vector corresponding to each text to be matched in the text set to be matched, respectively;
the denoising module is used for respectively denoising the first sentence vector and each second sentence vector to obtain a denoising first sentence vector and each denoising second sentence vector;
the first determining module is used for determining the matching degree of each text to be matched and the target text according to the noise-reduced first sentence vector and each noise-reduced second sentence vector;
and the second determining module is used for determining a target matching text of the target text in the text set to be matched according to each matching degree.
9. A computer arrangement, characterized in that the computer arrangement comprises a processor, a memory, and a computer program stored on the memory and executable by the processor, wherein the computer program, when executed by the processor, implements the steps of the text matching method according to any of claims 1 to 7.
10. A computer-readable storage medium, characterized in that a computer program is stored on the computer-readable storage medium, wherein the computer program, when being executed by a processor, carries out the steps of the text matching method according to any one of claims 1 to 7.
CN202110603418.1A 2021-05-31 2021-05-31 Text matching method and device, computer equipment and readable storage medium Pending CN113204629A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202110603418.1A CN113204629A (en) 2021-05-31 2021-05-31 Text matching method and device, computer equipment and readable storage medium
PCT/CN2022/072189 WO2022252638A1 (en) 2021-05-31 2022-01-14 Text matching method and apparatus, computer device and readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110603418.1A CN113204629A (en) 2021-05-31 2021-05-31 Text matching method and device, computer equipment and readable storage medium

Publications (1)

Publication Number Publication Date
CN113204629A true CN113204629A (en) 2021-08-03

Family

ID=77023971

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110603418.1A Pending CN113204629A (en) 2021-05-31 2021-05-31 Text matching method and device, computer equipment and readable storage medium

Country Status (2)

Country Link
CN (1) CN113204629A (en)
WO (1) WO2022252638A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114077670A (en) * 2021-11-19 2022-02-22 深圳思为科技有限公司 Text labeling method and software product
WO2022252638A1 (en) * 2021-05-31 2022-12-08 平安科技(深圳)有限公司 Text matching method and apparatus, computer device and readable storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160378769A1 (en) * 2015-06-23 2016-12-29 Microsoft Technology Licensing, Llc Preliminary ranker for scoring matching documents
CN110377714A (en) * 2019-07-18 2019-10-25 泰康保险集团股份有限公司 Text matching technique, device, medium and equipment based on transfer learning
CN111259113A (en) * 2020-01-15 2020-06-09 腾讯科技(深圳)有限公司 Text matching method and device, computer readable storage medium and computer equipment
CN112000805A (en) * 2020-08-24 2020-11-27 平安国际智慧城市科技股份有限公司 Text matching method, device, terminal and storage medium based on pre-training model
CN112183078A (en) * 2020-10-22 2021-01-05 上海风秩科技有限公司 Text abstract determining method and device

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110287494A (en) * 2019-07-01 2019-09-27 济南浪潮高新科技投资发展有限公司 A method of the short text Similarity matching based on deep learning BERT algorithm
CN111241242B (en) * 2020-01-09 2023-05-30 北京百度网讯科技有限公司 Method, device, equipment and computer readable storage medium for determining target content
CN111539212A (en) * 2020-04-13 2020-08-14 腾讯科技(武汉)有限公司 Text information processing method and device, storage medium and electronic equipment
CN113204629A (en) * 2021-05-31 2021-08-03 平安科技(深圳)有限公司 Text matching method and device, computer equipment and readable storage medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160378769A1 (en) * 2015-06-23 2016-12-29 Microsoft Technology Licensing, Llc Preliminary ranker for scoring matching documents
CN110377714A (en) * 2019-07-18 2019-10-25 泰康保险集团股份有限公司 Text matching technique, device, medium and equipment based on transfer learning
CN111259113A (en) * 2020-01-15 2020-06-09 腾讯科技(深圳)有限公司 Text matching method and device, computer readable storage medium and computer equipment
CN112000805A (en) * 2020-08-24 2020-11-27 平安国际智慧城市科技股份有限公司 Text matching method, device, terminal and storage medium based on pre-training model
CN112183078A (en) * 2020-10-22 2021-01-05 上海风秩科技有限公司 Text abstract determining method and device

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2022252638A1 (en) * 2021-05-31 2022-12-08 平安科技(深圳)有限公司 Text matching method and apparatus, computer device and readable storage medium
CN114077670A (en) * 2021-11-19 2022-02-22 深圳思为科技有限公司 Text labeling method and software product

Also Published As

Publication number Publication date
WO2022252638A1 (en) 2022-12-08

Similar Documents

Publication Publication Date Title
CN109165380B (en) Neural network model training method and device and text label determining method and device
KR20210151281A (en) Textrank based core sentence extraction method and device using bert sentence embedding vector
CN111368037A (en) Text similarity calculation method and device based on Bert model
CN112256822A (en) Text search method and device, computer equipment and storage medium
CN112528637A (en) Text processing model training method and device, computer equipment and storage medium
CN114417865B (en) Description text processing method, device and equipment for disaster event and storage medium
WO2022252638A1 (en) Text matching method and apparatus, computer device and readable storage medium
WO2020252935A1 (en) Voiceprint verification method, apparatus and device, and storage medium
CN112686049A (en) Text auditing method, device, equipment and storage medium
CN111061877A (en) Text theme extraction method and device
CN114491018A (en) Construction method of sensitive information detection model, and sensitive information detection method and device
CN110298038A (en) A kind of text scoring method and device
CN113886601A (en) Electronic text event extraction method, device, equipment and storage medium
CN111241843B (en) Semantic relation inference system and method based on composite neural network
CN111401034B (en) Semantic analysis method, semantic analysis device and terminal for text
WO2022022049A1 (en) Long difficult text sentence compression method and apparatus, computer device, and storage medium
CN116561298A (en) Title generation method, device, equipment and storage medium based on artificial intelligence
CN112307738A (en) Method and device for processing text
WO2023088278A1 (en) Method and apparatus for verifying authenticity of expression, and device and medium
CN112528646B (en) Word vector generation method, terminal device and computer-readable storage medium
CN111177378B (en) Text mining method and device and electronic equipment
CN113204965B (en) Keyword extraction method, keyword extraction device, computer equipment and readable storage medium
CN116306612A (en) Word and sentence generation method and related equipment
CN113724738A (en) Voice processing method, decision tree model training method, device, equipment and storage medium
CN113704452A (en) Data recommendation method, device, equipment and medium based on Bert model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
REG Reference to a national code

Ref country code: HK

Ref legal event code: DE

Ref document number: 40055796

Country of ref document: HK