CN113297353A - Text matching method, device, equipment and storage medium - Google Patents

Text matching method, device, equipment and storage medium Download PDF

Info

Publication number
CN113297353A
CN113297353A CN202110667338.2A CN202110667338A CN113297353A CN 113297353 A CN113297353 A CN 113297353A CN 202110667338 A CN202110667338 A CN 202110667338A CN 113297353 A CN113297353 A CN 113297353A
Authority
CN
China
Prior art keywords
text
text data
data
abstract
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110667338.2A
Other languages
Chinese (zh)
Inventor
周楠楠
汤耀华
杨海军
徐倩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
WeBank Co Ltd
Original Assignee
WeBank Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by WeBank Co Ltd filed Critical WeBank Co Ltd
Priority to CN202110667338.2A priority Critical patent/CN113297353A/en
Publication of CN113297353A publication Critical patent/CN113297353A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/34Browsing; Visualisation therefor
    • G06F16/345Summarisation for human users

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Machine Translation (AREA)

Abstract

The present disclosure provides a text matching method, apparatus, device and storage medium, which can be applied to the field of intelligent question answering, the method comprising: the method comprises the steps of obtaining first text data to be matched, determining abstract data of the first text data through a text abstract model, wherein the abstract data is core data in the first text data, and determining a text matching result of the first text data from a text database according to the abstract data of the first text data, wherein the text matching result comprises matching text data with the highest similarity to the first text data. The text abstract model is obtained by adopting a BERT model and MLP training, and the text abstract model can fully understand semantic information of the text data by performing data analysis on the feature vectors of the text data, so that the abstract data of the first text data can be accurately acquired, data support is provided for text matching, and the accuracy of text matching is further improved.

Description

Text matching method, device, equipment and storage medium
Technical Field
The present disclosure relates to the field of artificial intelligence, and in particular, to a text matching method, apparatus, device, and storage medium.
Background
The intelligent question-answering system mainly comprises a text matching part and an answer extracting part, wherein the text matching is an important link of the intelligent question-answering system and is used for matching standard document paragraphs corresponding to user input questions.
The current text matching scheme is mainly based on keyword matching standard documents, only the text which is completely consistent with the description content of the text can be matched, the text which is expressed by different words and has the same semantic meaning cannot be matched, the accuracy rate of text matching is low, and the feedback effect of a question-answering system is poor.
Disclosure of Invention
The present disclosure is mainly directed to provide a text matching method, device, apparatus, and storage medium, which are used to improve the accuracy of text matching, so as to improve the feedback effect of a question-answering system.
In order to achieve the above object, in a first aspect, the present disclosure provides a text matching method, including:
acquiring first text data to be matched;
carrying out spoken language removal processing on the first text data through a text abstract model to obtain second text data, wherein the second text data is abstract data of the first text data, and the text abstract model is obtained by adopting a BERT model and multi-layer perceptron MLP training;
and determining a text matching result of the first text data from a text database according to the second text data.
In an optional embodiment of the present disclosure, the performing spoken language removal processing on the first text data through the text summarization model to obtain the second text data includes:
performing character-level segmentation on the first text data to obtain segmented first text data;
converting the segmented first text data into a digital sequence according to a preset dictionary;
and inputting the digital sequence into the text abstract model to obtain the second text data.
In an optional embodiment of the disclosure, the inputting the number sequence into the text summarization model to obtain the second text data includes:
inputting the number sequence into the BERT model of the text abstract model to obtain a first feature vector corresponding to each number in the number sequence;
inputting the first feature vector into the MLP of the text abstract model to obtain a second feature vector, wherein the second feature vector is used for indicating whether a text corresponding to each number in the number sequence is an abstract or not;
determining the second text data from a plurality of second feature vectors of the sequence of numbers.
In an optional embodiment of the present disclosure, the determining a text matching result of the first text data from a text database according to the second text data includes:
acquiring feature vectors of the second text data and the standard text data;
determining matched text data from the text database by calculating the similarity of the feature vectors of the second text data and the standard text data, wherein the similarity of the feature vectors of the matched text data and the second text data is greater than or equal to a preset threshold value;
and taking the matched text data as a text matching result of the first text data.
In an optional embodiment of the present disclosure, the obtaining the feature vectors of the second text data and the standard text data includes:
and acquiring the feature vectors of the second text data and the standard text data through the BERT model.
In an optional embodiment of the present disclosure, the training process of the text summarization model includes:
acquiring a text data sample and a labeling result of the text data sample, wherein the labeling result is used for indicating whether each text in the text data sample belongs to an abstract or not;
performing character-level segmentation on the text data samples, and converting the segmented text data samples into digital sequence samples according to a preset dictionary;
constructing an initial text abstract model;
taking the digital sequence sample corresponding to the text data sample as the input of the initial text abstract model, taking the labeling result of the text data sample as the output of the initial text abstract model, and training the initial text abstract model;
and when the loss function is converged, obtaining the trained text abstract model.
In a second aspect, the present disclosure provides a text matching apparatus, including:
the acquisition module is used for acquiring first text data to be matched;
the processing module is used for carrying out spoken language removal processing on the first text data through a text abstract model to obtain second text data, wherein the second text data are abstract data of the first text data, and the text abstract model is obtained by adopting a BERT model and multi-layer perceptron MLP training;
the processing module is further configured to determine a text matching result of the first text data from a text database according to the second text data.
In an optional embodiment of the present disclosure, the processing module is specifically configured to:
performing character-level segmentation on the first text data to obtain segmented first text data;
converting the segmented first text data into a digital sequence according to a preset dictionary;
and inputting the digital sequence into the text abstract model to obtain the second text data.
In an optional embodiment of the present disclosure, the processing module is specifically configured to:
inputting the number sequence into the BERT model of the text abstract model to obtain a first feature vector corresponding to each number in the number sequence;
inputting the first feature vector into the MLP of the text abstract model to obtain a second feature vector, wherein the second feature vector is used for indicating whether a text corresponding to each number in the number sequence is an abstract or not;
determining the second text data from a plurality of second feature vectors of the sequence of numbers.
In an optional embodiment of the present disclosure, the obtaining module is further configured to:
acquiring feature vectors of the second text data and the standard text data;
the processing module is specifically configured to determine matching text data from the text database by calculating similarity of feature vectors of the second text data and the standard text data, where the similarity of the feature vectors of the matching text data and the second text data is greater than or equal to a preset threshold;
and taking the matched text data as a text matching result of the first text data.
In an optional embodiment of the present disclosure, the obtaining module is specifically configured to: and acquiring the feature vectors of the second text data and the standard text data through the BERT model.
In an optional embodiment of the present disclosure, the obtaining module is further configured to obtain a text data sample and a labeling result of the text data sample, where the labeling result is used to indicate whether each text in the text data sample belongs to a summary;
the processing module is also used for carrying out character-level segmentation on the text data samples and converting the segmented text data samples into digital sequence samples according to a preset dictionary;
constructing an initial text abstract model;
taking the digital sequence sample corresponding to the text data sample as the input of the initial text abstract model, taking the labeling result of the text data sample as the output of the initial text abstract model, and training the initial text abstract model;
and when the loss function is converged, obtaining the trained text abstract model.
In a third aspect, the present disclosure provides an electronic device comprising:
memory, a processor and a computer program stored on the memory and executable on the processor, the computer program, when executed by the processor, implementing the method according to any one of the first aspects of the disclosure.
In a fourth aspect, the present disclosure provides a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the method according to any one of the first aspects of the present disclosure.
In a fifth aspect, the present disclosure provides a computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of the first aspects of the present disclosure.
The embodiment of the disclosure provides a text matching method, a text matching device, text matching equipment and a storage medium, which can be applied to the field of intelligent question answering, wherein the method comprises the following steps: the method comprises the steps of obtaining first text data to be matched, determining abstract data of the first text data through a text abstract model, wherein the abstract data is core data in the first text data, and determining a text matching result of the first text data from a text database according to the abstract data of the first text data, wherein the text matching result comprises matching text data with the highest similarity to the first text data. The text abstract model is obtained by adopting a BERT model and MLP training, and the text abstract model can fully understand semantic information of the text data by performing data analysis on the feature vectors of the text data, so that the abstract data of the first text data can be accurately acquired, data support is provided for text matching, and the accuracy of text matching is further improved.
Drawings
Fig. 1 is a schematic view of an application scenario of a text matching method according to an embodiment of the present disclosure;
fig. 2 is a schematic view of an application scenario of the text matching method according to the embodiment of the present disclosure;
fig. 3 is a first flowchart of a text matching method according to an embodiment of the present disclosure;
fig. 4 is a flowchart illustrating a second text matching method according to an embodiment of the disclosure;
fig. 5 is a schematic flowchart of a training method of a text matching model according to an embodiment of the present disclosure;
fig. 6 is a block diagram of a text matching apparatus according to an embodiment of the present disclosure;
fig. 7 is a block diagram of an electronic device according to an embodiment of the present disclosure.
The implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.
Detailed Description
Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
The terms "first," "second," and the like in the description, in the claims, and in the above-described figures of the disclosed embodiments are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the disclosure described herein are capable of operation in other sequences than described or illustrated herein.
It will be understood that the terms "comprises" and "comprising," and any variations thereof, as used herein, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
In the description of the embodiments of the present disclosure, the term "correspond" may indicate that there is a direct or indirect correspondence between the two, may also indicate that there is an association relationship between the two, and may also indicate and be indicated, configure and configured, and the like.
The rapid development of internet technology in recent years has led to a rapid increase in network information, and it has been difficult for users to find desired content accurately and rapidly when searching using a conventional search engine due to the excessive repetitive information. In addition, since the search request of the user varies from person to person, the search intention of the user is difficult to clearly express by a simple keyword search, and it is naturally difficult to search information satisfying the request by a conventional search engine. Although the keyword searching and matching algorithm has the characteristics of simplicity and easy operation, the information of the grammar and the semantic level is not deeply mined and only stays in the language surface layer, so the searching efficiency of the searching mode is not high.
The conversion is that a relation is established between people and computers in the form of natural language question answering, and the exploration intention of a user on knowledge can be expressed in a dialogue mode, so that the user can search and locate information which accords with the self retrieval intention in a large amount of data and information.
The Question Answering System (QA) is generated in such a large background. The automatic question-answering system adopts a natural language processing technology, can process the user question and simultaneously provide correct answers for the user, so that the user can conveniently and quickly retrieve the content meeting the self requirement in massive internet information.
The question-answering system comprises a question-answering system based on a common question set, namely, a certain number of question-answering pairs are stored in a knowledge base (database) of the question-answering system in advance. When the questions input by the user are matched with the question sentences in the system knowledge base, the system directly returns answers to the user; when the questions input by the user do not match with the question sentences in the system knowledge base, the system cannot provide correct answers, or the system can further collect the user questions to confirm the information again.
Fig. 1 is a schematic view of a first application scenario of the text matching method according to the embodiment of the present disclosure, as shown in fig. 1, the scenario includes a server 11 and a terminal device 12, a user accesses the server 11 through the terminal device 12, the server 11 is provided with an intelligent question-answering system and a database corresponding to the question-answering system, and a certain number of question-answering pairs are stored in the database.
The server 11 may be a server of an e-commerce platform, a social platform, a search platform, and the like, and the embodiment of the present disclosure is not limited in any way. The terminal device 12 may be any device having a display function or a voice collection function, such as a smart phone, a smart television, a tablet computer, a smart watch, and the like.
Exemplarily, fig. 2 is a schematic view of an application scenario of the text matching method according to the embodiment of the present disclosure, as shown in fig. 2, the scenario includes an intelligent robot, and the intelligent robot includes a voice recognition device and/or a touch display screen. The intelligent robot also comprises a communication device used for connecting the Internet. The intelligent robot can store a certain number of question-answer pairs in advance in the local. The intelligent robot can provide intelligent question and answer service for users under the condition of networking or disconnection.
Based on any scene, a user can input questions in a text or voice mode, a question-answering system on a server or an intelligent robot can determine standard questions corresponding to the questions input by the user from a database corresponding to the question-answering system through a text matching technology, and after reply information of the standard questions is obtained, the inquired reply information is returned to the user in a voice or text mode.
At present, the existing question-answering system is mainly based on a keyword matching technology, only the problem completely consistent with the description content of the question-answering system can be matched, and the success rate of text matching is low. Furthermore, sometimes the expression of the user is spoken, and the probability of text matching failure is greatly increased because semantic information input by the user cannot be understood, so that the feedback effect of the question-answering system is poor.
In view of the above problems, the embodiments of the present disclosure provide a text matching method, which can be used to improve the accuracy of text matching of an intelligent question and answer system. In consideration of the spoken language problem expressed by the user, firstly, abstract data of the text is extracted from the acquired text data, namely, the spoken language data in the text data is removed to obtain keywords of the text data. And then determining a matched text with the similarity of the abstract data being greater than a preset threshold value from a knowledge base of the question-answering system through text vector similarity calculation, acquiring reply data corresponding to the matched text, and finally feeding the reply data back to the user.
In order to fully understand the semantics of the text data instead of simply performing text elimination, a pre-trained text abstract model is adopted to perform data processing on the feature vectors of the text data and determine abstract data in the text data. The text abstract model can adopt a language model BERT, when a single text (such as a word or a word) is processed, the text before and after the text can be considered by the BERT model to obtain the meaning of the text in the context, so that the text abstract determined by the text abstract model can accurately represent the real meaning input by a user, meanwhile, the spoken text is also eliminated, and data support is provided for the matching of the subsequent text.
The technical solution of the present disclosure is explained in detail below with specific examples. The following several specific embodiments may be combined with each other, and details of the same or similar concepts or processes may not be repeated in some embodiments.
Fig. 3 is a first flowchart of a text matching method according to an embodiment of the present disclosure. The text matching method provided by the embodiment can be applied to the server shown in fig. 1 or the intelligent robot shown in fig. 2.
As shown in fig. 3, the method may include:
step 101, obtaining first text data to be matched.
In this embodiment, the first text data generally includes spoken text, which generally refers to words or words that do not affect the sentence intent corresponding to the text data, such as "good bar", "then", "i feel", and the like.
In one example of the present disclosure, a user inputs first text data by means of text input. For example, a user accesses the server through a smartphone, entering a question "why my repayment display did not succeed" in the question and answer bar of the server page.
In one example of the disclosure, a user inputs voice data, and first text data corresponding to the voice data is acquired by adopting a voice recognition technology. For example, when a user faces an intelligent robot to initiate a voice call, a voice recognition device of the intelligent robot acquires voice data and converts the voice data into text data.
And 102, performing spoken language removal processing on the first text data through a text abstract model to obtain second text data, wherein the second text data is abstract data of the first text data.
In this embodiment, the text abstract model is used to extract abstract data of the first text data.
The summary data is core data in the text data, plays a key role in expressing sentence meaning in the sentence, and can be used for performing sentence meaning analysis on the sentence, such as subject, predicate, object, and the like in the sentence.
The text abstract model of the embodiment performs the spoken language removal processing on the text data, and aims to remove non-core data which plays an auxiliary role or no role in expressing the meaning of the sentence in the sentence. Wherein the non-core data includes spoken data. When the non-core data is removed, the sentence meaning expression of the sentence in which the non-core data is located, such as the fixed language and the shape language in the sentence, is not affected. It should be noted that not all spoken data is non-core data, which requires a model to be comprehensively analyzed in conjunction with the preceding and following words of each word in the sentence.
Illustratively, the first text data is the above example "why my repayment showed unsuccessful", and the second text data is "repayment unsuccessful" obtained by data processing of the text summary model. It can be seen that the de-colloquization process of the text summarization model on the first text data eliminates non-core text including colloquized text in the first text data, such as "why", "my", "show", "tweed" in the above example.
Optionally, in some examples, before the first text data is subjected to the spoken language removal processing by the text summarization model, text preprocessing such as punctuation removal, case conversion, segmentation, and the like is further included.
In one example of the present disclosure, the text summarization model is trained by using a BERT model and a Multi-Layer perceptron (MLP).
The BERT model is a language model in the Natural Language Processing (NLP) field, and may be used to encode text data to obtain a numerical feature vector corresponding to the text data. The MLP comprises an input layer, a hidden layer and an output layer, and the different layers of the MLP are fully connected. The input of the BERT model is used as the input of the text abstract model, the output of the BERT model is used as the input of the MLP, and the output of the MLP is used as the output of the text abstract model.
It should be noted that before the first text data is input into the text abstract model, the chinese and/or english words in the first text data are converted into machine-recognizable natural language processing texts, i.e., numeric sequences, and then the numeric sequences are input into the BERT model of the text abstract model, and then the abstract data of the first text data is obtained through the MLP of the text abstract model.
And 103, determining a text matching result of the first text data from the text database according to the second text data.
In this embodiment, the text matching result is used to indicate the text data in the text database that matches the first text data. Specifically, the text matching result includes matching text data with the highest similarity to the first text data in the text database.
In one example of the present disclosure, a text matching result of the first text data may be determined from the text database according to the second text data in a vector similarity calculation manner of the text data. Specifically, the text matching result can be determined by the following steps:
and step 1031, obtaining feature vectors of the second text data and the standard text data.
Specifically, the feature vectors of the second text data and the standard text data can be obtained through the BERT model. It should be understood that neither the second text data nor the standard text data contain spoken text. Therefore, the second text data (or the standard text data) can be directly converted into a machine-recognizable digital sequence through a preset dictionary of the BERT model, and then the digital sequence is input into the BERT model to obtain a feature vector corresponding to the second text data (or a feature vector corresponding to the standard text data).
The preset dictionary of the BERT model includes correspondence between text and numbers. Illustratively, the text data is "payment unsuccessful", and each word in the text corresponds to a number 424324251213.
It should be noted that the preset dictionary includes a corresponding relationship between an english text and a number in addition to a corresponding relationship between a chinese text and a number.
For an english text, a Byte Pair Encoding (BPE) algorithm may be used to segment an english word into several parts, where each part corresponds to a number. Illustratively, the english word "inaffable" can be split into three parts, namely [ "in", "# # aff", "# # able", and the split words correspond to the numbers 3, 9, 7, respectively.
As can be seen from the above description, the BERT model is used to generate feature vectors corresponding to text data, where the feature vectors corresponding to text data include feature vectors of each word in the text data, and the feature vectors of each word include at least one of a word vector, a part-of-speech vector, and a location vector.
The word vector is obtained by vectorizing each word; the part-of-speech vector may represent the part-of-speech of each word, such as a vector corresponding to the part-of-speech of verbs, nouns, etc.; the position vector may represent the position of each word in the text.
For example, the vector of the ith word Xi may be identified as:
Figure BDA0003117390340000091
wherein, EwiWord vector, Et, representing word XiiPart-of-speech vector, Ep, representing the word XiiRepresenting the position vector of the word Xi.
And step 1032, determining matched text data from the text database by calculating the similarity of the feature vectors of the second text data and the standard text data.
In one example of the present disclosure, the similarity of the feature vectors of the second text data and the standard text data may be determined by calculating cosine similarity between the feature vectors.
In an example of the present disclosure, the similarity of the feature vector of the second text data and the standard text data may also be determined by a calculation formula such as euclidean distance, manhattan distance, and the like.
In one example of the present disclosure, the similarity of the feature vectors of the matching text data and the second text data is greater than or equal to a preset threshold.
Alternatively, in an example, if there are a plurality of standard text data having similarity greater than or equal to a preset threshold (for example, 0.8) with the feature vector of the second text data, the standard text data having the similarity value closest to 1 may be used as the matching text data.
Step 1033, the matching text data is used as the text matching result of the first text data.
Optionally, after the text matching result of the first text data is determined, the reply information corresponding to the matched text data may be acquired from the text database according to the text matching result. If the execution subject is the server, the server may send the reply message to the terminal device of the user. If the execution subject of this embodiment is an intelligent robot, the intelligent robot may feed back reply information to the user by voice or text, for example, the reply information is broadcasted by voice, or the reply information is displayed on the display.
Optionally, in some embodiments, other pre-training models may be used to replace the BERT model, for example, a GPT model, an XLNet model, and the like, which is not limited in this embodiment.
In the text matching method provided in the above embodiment, the first text data to be matched is acquired, the abstract data of the first text data is determined through the text abstract model, where the abstract data is core data in the first text data, and the text matching result of the first text data is determined from the text database according to the abstract data of the first text data, and the text matching result includes matching text data with the highest similarity to the first text data. The text abstract model of the embodiment is obtained by adopting a BERT model and MLP training, and the text abstract model can fully understand semantic information of the text data by performing data analysis on the feature vectors of the text data, so that the abstract data of the first text data can be accurately acquired, data support is provided for text matching, and the accuracy of text matching is further improved.
On the basis of the above embodiment, the following describes in detail the internal processing procedure of the text abstract model of the above embodiment by a specific embodiment.
Fig. 4 is a flowchart illustrating a second text matching method according to an embodiment of the present disclosure. The method of the present embodiment is equally applicable to the server shown in fig. 1 or the intelligent robot shown in fig. 2.
As shown in fig. 4, the method may include:
step 201, performing character level segmentation on the first text data to obtain segmented first text data.
Step 202, converting the segmented first text data into a digital sequence according to a preset dictionary.
In this embodiment, before performing the character-level segmentation on the first text data, start and end tags need to be added to the first text data. Specifically, a [ CLS ] label is set at the beginning of the first text data, an [ SEP ] label is set at the end of the first text data, then character-level segmentation is carried out, and the segmented first text data is converted into a digital sequence according to a preset dictionary.
Illustratively, the first text data is "how to make payment setting", and after adding start and end tags and performing character-level segmentation, "[ CLS ] how to make payment setting [ SEP ]" can be obtained.
The preset dictionary comprises the corresponding relation between texts and numbers, and also comprises the numbers corresponding to the starting label and the ending label, for example, the [ CLS ] corresponds to the number 101, and the [ SEP ] corresponds to the number 102. Illustratively, how the [ CLS ] makes the payment setting [ SEP ] "of the above example can be converted into a numerical sequence by a preset dictionary: [1013536616254558182102].
Optionally, in some examples, before performing the character-level segmentation on the first text data, text processing such as punctuation removal, case conversion, and the like is further included. The punctuation marks do not contribute much to semantic features and should be removed. Case conversion mainly relates to English texts, such as "GOOD" to "GOOD", so as to uniformly perform feature expression processing.
Step 203, inputting the number sequence into a BERT model of the text abstract model to obtain a first feature vector corresponding to each number in the number sequence.
In this embodiment, the dimension of the first feature vector may be determined by the model parameter of the constructed BERT model. For example, one digit in the digit sequence corresponds to a feature vector of 768 dimensions, or a feature vector of 1024 dimensions.
And 204, inputting the first feature vector into an MLP of the text abstract model to obtain a second feature vector, wherein the second feature vector is used for indicating whether the text corresponding to each number in the number sequence is an abstract or not.
A plurality of first feature vectors corresponding to the number sequence are obtained through step 203, and the plurality of first feature vectors corresponding to the number sequence are input into the MLP of the text abstract model, so that a plurality of second feature vectors corresponding to the number sequence are obtained, and each second feature vector is a two-dimensional vector. The plurality of second feature vectors corresponding to the number sequence can be regarded as a matrix of L × 2, where L represents the length of the number sequence (including [ CLS ] and [ SEP ] in the number sequence).
Exemplarily, how the text data "[ CLS ] makes the payment setting [ SEP ]" is converted into a numerical sequence: [ 1013536616254558182102 ], inputting a text abstract model, outputting ([0.3,0.7], [0.9,0.1], [0.9,0.1], [0.8,0.2], [0.8,0.2], [0.1,0.9], [0.1,0.9], [0.2,0.8], [0.2,0.8], [0.7,0.3],) through a BERT model and an MLP, wherein the former number of each array represents the probability value that the current word is not the abstract, and the latter number represents the probability value that the current word is the abstract. It should be noted that, since the number sequence includes the numbers corresponding to the beginning and ending labels, such as 101 and 102 described above, the arrays at the front and back ends of the MLP output, such as [0.3,0.7] and [0.7,0.3] described above, should be eliminated during data analysis. From the middle 8 arrays, the abstract word is determined, for example, the sixth array [0.1,0.9] corresponds to the word "branch", and the word is determined to be the abstract word.
Step 205 determines second text data from a plurality of second feature vectors of the digit sequence.
In the text matching method provided by the above embodiment, the text data is subjected to feature vector analysis through the text abstract model, and the abstract data in the text data is determined. The text abstract model comprises a BERT model and an MLP, wherein the BERT model is used for converting a digital sequence corresponding to text data into a plurality of multi-dimensional feature vectors, and the MLP is used for determining abstract data of the text data according to the multi-dimensional feature vectors. By adopting the text abstract model provided by the embodiment, the semantic information of the text data can be fully understood, so that the abstract data of the first text data can be accurately acquired, data support is provided for text matching, and the accuracy of the text matching is further improved.
Based on the above embodiments, the following describes the training process of the text summarization model in the above embodiments in detail through a specific implementation.
Exemplarily, fig. 5 is a flowchart illustrating a training method of a text summarization model according to an embodiment of the present disclosure, and as shown in fig. 5, the training method may include:
step 301, obtaining a text data sample and a labeling result of the text data sample.
And the marking result is used for indicating whether each text in the text data sample belongs to the abstract or not.
Before training the text abstract model, a certain number of text data samples and the labeling result of each text data sample need to be collected. The labeling result of the text data sample is determined by manual labeling of a labeling person, the labeling person needs to label whether each text in the text data sample belongs to the abstract text, and each text corresponds to a label value.
Illustratively, a label value of 1 indicates that the text belongs to the abstract text, and a label value of 0 indicates that the text does not belong to the abstract text. For example, the text data sample is "want to make payment settings again" with the label result of "0000001111".
Step 302, performing character-level segmentation on the text data sample, and converting the segmented text data sample into a digital sequence sample according to a preset dictionary.
In this embodiment, before performing character-level segmentation on the text data sample, start and end tags need to be added to the text data sample. Specifically, a [ CLS ] label is set at the beginning of a sentence of the text data sample, a [ SEP ] label is set at the end of the sentence, then character-level segmentation is carried out, and the segmented text data sample is converted into a digital sequence sample according to a preset dictionary.
Since the text digest model requires the length of the input number sequence samples to be a fixed length, the text digest model can be complemented with a tag value "0" for the number sequence samples having an insufficient length, and can perform the truncation process for the number sequence samples having a length exceeding the fixed length.
Optionally, in some examples, before performing the character-level segmentation on the text data sample, text processing such as punctuation removal, case conversion, and the like is further included.
Step 303, constructing an initial text abstract model.
And 304, taking the digital sequence sample corresponding to the text data sample as the input of the initial text abstract model, taking the labeling result of the text data sample as the output of the initial text abstract model, and training the initial text abstract model.
And 305, obtaining a trained text abstract model when the loss function is converged.
In this embodiment, the number sequence sample corresponding to the text data sample is input into the initial text abstract model, and the output result of the text data sample predicted by the model is obtained. And calculating the error of model prediction according to the output result and the labeling result, and updating the model parameters through error back propagation until the model converges. In this embodiment, the model parameters include BERT model parameters and MLP parameters.
Specifically, the convergence condition of the text summarization model may be: the model error is smaller than a preset error threshold; or the change of the weight value between two adjacent iterations is smaller than a set change threshold value; or a set maximum number of iterations is reached. And when the text abstract model meets the convergence condition, stopping training to obtain the trained text abstract model.
The text abstract model constructed in the model training process can accurately acquire abstract data in the text data, the abstract data can represent the real meaning input by a user, spoken data is not included, and data support is provided for subsequent database text matching.
In the embodiment of the present disclosure, the text matching apparatus may be divided into functional modules according to the method embodiment, for example, each functional module may be divided corresponding to each function, or two or more functions may be integrated into one processing module. The integrated module can be realized in a form of hardware or a form of a software functional module. It should be noted that, the division of the modules in the embodiments of the present disclosure is illustrative, and is only one division of logic functions, and there may be another division in actual implementation. The following description will be given by taking an example in which each functional module is divided by using a corresponding function.
Exemplarily, fig. 6 is a block diagram of a structure of a text matching apparatus provided in an embodiment of the present disclosure. For ease of illustration, only portions that are relevant to embodiments of the present disclosure are shown.
As shown in fig. 6, the text matching apparatus 400 provided in this embodiment includes: an acquisition module 401 and a processing module 402.
An obtaining module 401, configured to obtain first text data to be matched;
a processing module 402, configured to perform spoken language removal processing on the first text data through a text abstract model to obtain second text data, where the second text data is abstract data of the first text data, and the text abstract model is obtained by training a BERT model and a multi-layer perceptron MLP;
the processing module 402 is further configured to determine a text matching result of the first text data from a text database according to the second text data.
In an optional embodiment of the present disclosure, the processing module 402 is specifically configured to:
performing character-level segmentation on the first text data to obtain segmented first text data;
converting the segmented first text data into a digital sequence according to a preset dictionary;
and inputting the digital sequence into the text abstract model to obtain the second text data.
In an optional embodiment of the present disclosure, the processing module 402 is specifically configured to:
inputting the number sequence into the BERT model of the text abstract model to obtain a first feature vector corresponding to each number in the number sequence;
inputting the first feature vector into the MLP of the text abstract model to obtain a second feature vector, wherein the second feature vector is used for indicating whether a text corresponding to each number in the number sequence is an abstract or not;
determining the second text data from a plurality of second feature vectors of the sequence of numbers.
In an optional embodiment of the present disclosure, the obtaining module 401 is further configured to:
acquiring feature vectors of the second text data and the standard text data;
the processing module 402 is specifically configured to determine matching text data from the text database by calculating similarity between the feature vectors of the second text data and the standard text data, where the similarity between the feature vectors of the matching text data and the second text data is greater than or equal to a preset threshold;
and taking the matched text data as a text matching result of the first text data.
In an optional embodiment of the present disclosure, the obtaining module 401 is specifically configured to: and acquiring the feature vectors of the second text data and the standard text data through the BERT model.
In an optional embodiment of the present disclosure, the obtaining module 401 is further configured to obtain a text data sample and a labeling result of the text data sample, where the labeling result is used to indicate whether each text in the text data sample belongs to a summary;
the processing module 402 is further configured to perform character-level segmentation on the text data sample, and convert the segmented text data sample into a digital sequence sample according to a preset dictionary;
constructing an initial text abstract model;
taking the digital sequence sample corresponding to the text data sample as the input of the initial text abstract model, taking the labeling result of the text data sample as the output of the initial text abstract model, and training the initial text abstract model;
and when the loss function is converged, obtaining the trained text abstract model.
The text matching device provided in the embodiment of the present disclosure is used for executing the technical solution provided in any one of the foregoing method embodiments, and the implementation principle and the technical effect are similar, which are not described herein again.
Fig. 7 is a block diagram of an electronic device according to an embodiment of the present disclosure. As shown in fig. 7, the electronic device 500 of the present embodiment may include:
at least one processor 501 (only one processor is shown in FIG. 7); and
a memory 502 communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory 502 stores a computer program executable by the at least one processor 501, the computer program being executable by the at least one processor 501 to enable the electronic device 500 to perform the solution of the first device in any of the method embodiments described above.
Alternatively, the memory 502 may be separate or integrated with the processor 501.
When the memory 502 is a separate device from the processor 501, the electronic device 500 further comprises: a bus 503 for connecting the memory 502 and the processor 501.
The electronic device provided by the embodiment of the present disclosure may execute the technical solution provided by any of the foregoing method embodiments, and the implementation principle and technical effect are similar, which are not described herein again.
The embodiment of the present disclosure further provides a computer-readable storage medium, where a computer program is stored, and when the computer program is executed by a processor, the computer program is used to implement the technical solution provided by any one of the foregoing method embodiments.
An embodiment of the present disclosure further provides a computer program product, which includes a computer program, and when the computer program is executed by a processor, the computer program implements the technical solution provided by any of the foregoing method embodiments.
The embodiment of the present disclosure further provides a chip, including: a processing module and a communication interface, wherein the processing module can execute the technical scheme provided by any one of the method embodiments.
Further, the chip further includes a storage module (e.g., a memory), where the storage module is configured to store instructions, and the processing module is configured to execute the instructions stored in the storage module, and the execution of the instructions stored in the storage module causes the processing module to execute the technical solution provided by any one of the foregoing method embodiments.
It should be understood that the Processor may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of a method disclosed in connection with the present invention may be embodied directly in a hardware processor, or in a combination of the hardware and software modules within the processor.
The memory may comprise a high-speed RAM memory, and may further comprise a non-volatile storage NVM, such as at least one disk memory, and may also be a usb disk, a removable hard disk, a read-only memory, a magnetic or optical disk, etc.
The bus may be an Industry Standard Architecture (ISA) bus, a Peripheral Component Interconnect (PCI) bus, an Extended ISA (EISA) bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, the buses in the figures of the present disclosure are not limited to only one bus or one type of bus.
The storage medium may be implemented by any type or combination of volatile or non-volatile memory devices, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks. A storage media may be any available media that can be accessed by a general purpose or special purpose computer.
An exemplary storage medium is coupled to the processor such the processor can read information from, and write information to, the storage medium. Of course, the storage medium may also be integral to the processor. The processor and the storage medium may reside in an Application Specific Integrated Circuits (ASIC). Of course, the processor and the storage medium may reside as discrete components in an electronic device.
Finally, it should be noted that: the above embodiments are only used for illustrating the technical solutions of the present disclosure, and not for limiting the same; while the present disclosure has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art will understand that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present disclosure.

Claims (15)

1. A text matching method, comprising:
acquiring first text data to be matched;
carrying out spoken language removal processing on the first text data through a text abstract model to obtain second text data, wherein the second text data is abstract data of the first text data, and the text abstract model is obtained by adopting a BERT model and multi-layer perceptron MLP training;
and determining a text matching result of the first text data from a text database according to the second text data.
2. The method of claim 1, wherein the performing a spoken language removal process on the first text data through the text summarization model to obtain the second text data comprises:
performing character-level segmentation on the first text data to obtain segmented first text data;
converting the segmented first text data into a digital sequence according to a preset dictionary;
and inputting the digital sequence into the text abstract model to obtain the second text data.
3. The method of claim 2, wherein said entering said sequence of numbers into said text summarization model resulting in said second text data comprises:
inputting the number sequence into the BERT model of the text abstract model to obtain a first feature vector corresponding to each number in the number sequence;
inputting the first feature vector into the MLP of the text abstract model to obtain a second feature vector, wherein the second feature vector is used for indicating whether a text corresponding to each number in the number sequence is an abstract or not;
determining the second text data according to a plurality of second feature vectors of the data sequence.
4. The method of claim 1, wherein determining the text matching result for the first text data from a text database according to the second text data comprises:
acquiring a feature vector of the second text data and standard text data;
determining matched text data from the text database by calculating the similarity of the feature vectors of the second text data and the standard text data, wherein the similarity of the feature vectors of the matched text data and the second text data is greater than or equal to a preset threshold value;
and taking the matched text data as a text matching result of the first text data.
5. The method of claim 4, wherein obtaining the feature vectors of the second text data and the standard text data comprises:
and acquiring the feature vectors of the second text data and the standard text data through the BERT model.
6. The method of claim 1,
the training process of the text abstract model comprises the following steps:
acquiring a text data sample and a labeling result of the text data sample, wherein the labeling result is used for indicating whether each text in the text data sample belongs to an abstract or not;
performing character-level segmentation on the text data samples, and converting the segmented text data samples into digital sequence samples according to a preset dictionary;
constructing an initial text abstract model;
taking the digital sequence sample corresponding to the text data sample as the input of the initial text abstract model, taking the labeling result of the text data sample as the output of the initial text abstract model, and training the initial text abstract model;
and when the loss function is converged, obtaining the trained text abstract model.
7. A text matching apparatus, comprising:
the acquisition module is used for acquiring first text data to be matched;
the processing module is used for carrying out spoken language removal processing on the first text data through a text abstract model to obtain second text data, wherein the second text data are abstract data of the first text data, and the text abstract model is obtained by adopting a BERT model and multi-layer perceptron MLP training;
the processing module is further configured to determine a text matching result of the first text data from a text database according to the second text data.
8. The apparatus of claim 7, wherein the processing module is specifically configured to:
performing character-level segmentation on the first text data to obtain segmented first text data;
converting the segmented first text data into a digital sequence according to a preset dictionary;
and inputting the digital sequence into the text abstract model to obtain the second text data.
9. The apparatus of claim 8, wherein the processing module is specifically configured to:
inputting the number sequence into the BERT model of the text abstract model to obtain a first feature vector corresponding to each number in the number sequence;
inputting the first feature vector into the MLP of the text abstract model to obtain a second feature vector, wherein the second feature vector is used for indicating whether a text corresponding to each number in the number sequence is an abstract or not;
determining the second text data from a plurality of second feature vectors of the sequence of numbers.
10. The apparatus of claim 7, wherein the obtaining module is further configured to:
acquiring a feature vector of the second text data and standard text data;
the processing module is specifically configured to determine matching text data from the text database by calculating similarity of feature vectors of the second text data and the standard text data, where the similarity of the feature vectors of the matching text data and the second text data is greater than or equal to a preset threshold;
and taking the matched text data as a text matching result of the first text data.
11. The apparatus of claim 10, wherein the obtaining module is specifically configured to: and acquiring the feature vectors of the second text data and the standard text data through the BERT model.
12. The apparatus of claim 7,
the obtaining module is further configured to obtain a text data sample and a labeling result of the text data sample, where the labeling result is used to indicate whether each text in the text data sample belongs to an abstract;
the processing module is also used for carrying out character-level segmentation on the text data samples and converting the segmented text data samples into digital sequence samples according to a preset dictionary;
constructing an initial text abstract model;
taking the digital sequence sample corresponding to the text data sample as the input of the initial text abstract model, taking the labeling result of the text data sample as the output of the initial text abstract model, and training the initial text abstract model;
and when the loss function is converged, obtaining the trained text abstract model.
13. An electronic device, characterized in that the electronic device comprises: memory, processor and computer program stored on the memory and executable on the processor, which when executed by the processor implements the method of any one of claims 1-6.
14. A computer-readable storage medium, characterized in that the computer-readable storage medium has stored thereon a computer program which, when being executed by a processor, carries out the method according to any one of claims 1-6.
15. A computer program product, characterized in that it comprises a computer program which, when being executed by a processor, carries out the method of any one of claims 1-6.
CN202110667338.2A 2021-06-16 2021-06-16 Text matching method, device, equipment and storage medium Pending CN113297353A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110667338.2A CN113297353A (en) 2021-06-16 2021-06-16 Text matching method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110667338.2A CN113297353A (en) 2021-06-16 2021-06-16 Text matching method, device, equipment and storage medium

Publications (1)

Publication Number Publication Date
CN113297353A true CN113297353A (en) 2021-08-24

Family

ID=77328446

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110667338.2A Pending CN113297353A (en) 2021-06-16 2021-06-16 Text matching method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN113297353A (en)

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109635103A (en) * 2018-12-17 2019-04-16 北京百度网讯科技有限公司 Abstraction generating method and device
CN109657054A (en) * 2018-12-13 2019-04-19 北京百度网讯科技有限公司 Abstraction generating method, device, server and storage medium
CN111460135A (en) * 2020-03-31 2020-07-28 北京百度网讯科技有限公司 Method and device for generating text abstract
CN111859986A (en) * 2020-07-27 2020-10-30 中国平安人寿保险股份有限公司 Semantic matching method, device, equipment and medium based on multitask twin network
CN111966917A (en) * 2020-07-10 2020-11-20 电子科技大学 Event detection and summarization method based on pre-training language model
CN112182166A (en) * 2020-10-29 2021-01-05 腾讯科技(深圳)有限公司 Text matching method and device, electronic equipment and storage medium
WO2021000497A1 (en) * 2019-07-03 2021-01-07 平安科技(深圳)有限公司 Retrieval method and apparatus, and computer device and storage medium
CN112417142A (en) * 2020-11-23 2021-02-26 浙江工业大学 Auxiliary method and system for generating word meaning and abstract based on eye movement tracking
CN112861543A (en) * 2021-02-04 2021-05-28 吴俊� Deep semantic matching method and system for matching research and development supply and demand description texts
CN112883182A (en) * 2021-03-05 2021-06-01 海信电子科技(武汉)有限公司 Question-answer matching method and device based on machine reading

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109657054A (en) * 2018-12-13 2019-04-19 北京百度网讯科技有限公司 Abstraction generating method, device, server and storage medium
CN109635103A (en) * 2018-12-17 2019-04-16 北京百度网讯科技有限公司 Abstraction generating method and device
WO2021000497A1 (en) * 2019-07-03 2021-01-07 平安科技(深圳)有限公司 Retrieval method and apparatus, and computer device and storage medium
CN111460135A (en) * 2020-03-31 2020-07-28 北京百度网讯科技有限公司 Method and device for generating text abstract
CN111966917A (en) * 2020-07-10 2020-11-20 电子科技大学 Event detection and summarization method based on pre-training language model
CN111859986A (en) * 2020-07-27 2020-10-30 中国平安人寿保险股份有限公司 Semantic matching method, device, equipment and medium based on multitask twin network
CN112182166A (en) * 2020-10-29 2021-01-05 腾讯科技(深圳)有限公司 Text matching method and device, electronic equipment and storage medium
CN112417142A (en) * 2020-11-23 2021-02-26 浙江工业大学 Auxiliary method and system for generating word meaning and abstract based on eye movement tracking
CN112861543A (en) * 2021-02-04 2021-05-28 吴俊� Deep semantic matching method and system for matching research and development supply and demand description texts
CN112883182A (en) * 2021-03-05 2021-06-01 海信电子科技(武汉)有限公司 Question-answer matching method and device based on machine reading

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
RADFORD A: "Improving language understanding by generative pre-training", 《URL HTTPS://S3-US-WEST2.AMAZONAWS.COM/OPENAIASSETS/RESEARCH-COVERS/LANGUAGEUNSUPERVISED/LANGUAGE UNDERSTANDING PAPER.PDF》, 31 December 2018 (2018-12-31) *
程艳芳: "基于优化选择的抽取式自动文本摘要研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》, 30 August 2020 (2020-08-30) *

Similar Documents

Publication Publication Date Title
US11403680B2 (en) Method, apparatus for evaluating review, device and storage medium
CN111241237B (en) Intelligent question-answer data processing method and device based on operation and maintenance service
CN113011533A (en) Text classification method and device, computer equipment and storage medium
CN110347790B (en) Text duplicate checking method, device and equipment based on attention mechanism and storage medium
CN109086265B (en) Semantic training method and multi-semantic word disambiguation method in short text
CN112100354A (en) Man-machine conversation method, device, equipment and storage medium
CN110096572B (en) Sample generation method, device and computer readable medium
CN111599340A (en) Polyphone pronunciation prediction method and device and computer readable storage medium
CN113268576B (en) Deep learning-based department semantic information extraction method and device
CN111985228A (en) Text keyword extraction method and device, computer equipment and storage medium
CN112699686A (en) Semantic understanding method, device, equipment and medium based on task type dialog system
CN112380866A (en) Text topic label generation method, terminal device and storage medium
CN114218945A (en) Entity identification method, device, server and storage medium
CN113836938A (en) Text similarity calculation method and device, storage medium and electronic device
CN113326702A (en) Semantic recognition method and device, electronic equipment and storage medium
CN112231537A (en) Intelligent reading system based on deep learning and web crawler
CN111859950A (en) Method for automatically generating lecture notes
CN110795544A (en) Content search method, device, equipment and storage medium
CN113705207A (en) Grammar error recognition method and device
CN111783424A (en) Text clause dividing method and device
CN110750967A (en) Pronunciation labeling method and device, computer equipment and storage medium
CN114528851B (en) Reply sentence determination method, reply sentence determination device, electronic equipment and storage medium
CN116304046A (en) Dialogue data processing method and device, storage medium and electronic equipment
CN111401069A (en) Intention recognition method and intention recognition device for conversation text and terminal
CN115292492A (en) Method, device and equipment for training intention classification model and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination