CN112632232B - Text matching method, device, equipment and medium - Google Patents

Text matching method, device, equipment and medium Download PDF

Info

Publication number
CN112632232B
CN112632232B CN202110253010.6A CN202110253010A CN112632232B CN 112632232 B CN112632232 B CN 112632232B CN 202110253010 A CN202110253010 A CN 202110253010A CN 112632232 B CN112632232 B CN 112632232B
Authority
CN
China
Prior art keywords
text
matching
target
long
short
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110253010.6A
Other languages
Chinese (zh)
Other versions
CN112632232A (en
Inventor
傅玮萍
许国伟
丁文彪
刘子韬
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Century TAL Education Technology Co Ltd
Original Assignee
Beijing Century TAL Education Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Century TAL Education Technology Co Ltd filed Critical Beijing Century TAL Education Technology Co Ltd
Priority to CN202110253010.6A priority Critical patent/CN112632232B/en
Publication of CN112632232A publication Critical patent/CN112632232A/en
Application granted granted Critical
Publication of CN112632232B publication Critical patent/CN112632232B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the disclosure relates to a text matching method, a text matching device, text matching equipment and a text matching medium, wherein the method comprises the following steps: acquiring a target short text and a target long text; determining candidate long texts matched with the target short texts; matching the target long text and the candidate long text by adopting a first matching model, and determining a first matching result; and obtaining a first matching result of the target short text and the target long text based on the first matching result. By adopting the technical scheme, the matching between the short text and the long text is converted into the matching between the two long texts, so that the problem of inaccurate matching caused by too little short text information or the misalignment between the short text and the long text information can be avoided, and the matching accuracy between the short text and the long text is improved.

Description

Text matching method, device, equipment and medium
Technical Field
The present disclosure relates to the field of semantic processing technologies, and in particular, to a text matching method, apparatus, device, and medium.
Background
Text matching is a core problem in the field of natural language processing, and it serves as a core support in many natural language processing applications, such as information retrieval, question and answer systems, dialog systems, recommendation systems, etc., by calculating the correlation between texts.
In the existing text matching technology, the similarity between texts is usually calculated or whether the texts are related is judged by extracting the features of the texts and adopting some similarity measurement methods. In addition, a deep learning technology is adopted, semantic vector representation of the text is obtained based on neural network training, and then text matching is carried out; or end-to-end text matching is directly carried out by using a deep learning method. However, the above text matching methods all have the problems of inaccurate matching caused by small amount of short text information, non-alignment between short text information and long text information, and the like.
Disclosure of Invention
To solve the above technical problem or at least partially solve the above technical problem, the present disclosure provides a text matching method, apparatus, device, and medium.
The embodiment of the disclosure provides a text matching method, which comprises the following steps:
acquiring a target short text and a target long text;
determining candidate long texts matched with the target short texts;
matching the target long text and the candidate long text by adopting a first matching model, and determining a first matching result;
and obtaining a matching result of the target short text and the target long text based on the first matching result.
The embodiment of the present disclosure further provides a text matching apparatus, the apparatus includes:
the text acquisition module is used for acquiring a target short text and a target long text;
the candidate module is used for determining candidate long texts matched with the target short texts;
the first matching module is used for matching the target long text with the candidate long text by adopting a first matching model and determining a first matching result;
and the result module is used for obtaining the matching result of the target short text and the target long text based on the first matching result.
An embodiment of the present disclosure further provides an electronic device, which includes: a processor; a memory for storing the processor-executable instructions; the processor is used for reading the executable instructions from the memory and executing the instructions to realize the text matching method provided by the embodiment of the disclosure.
The embodiment of the present disclosure also provides a computer-readable storage medium, which stores a computer program for executing the text matching method provided by the embodiment of the present disclosure.
Compared with the prior art, the technical scheme provided by the embodiment of the disclosure has the following advantages: according to the text matching scheme provided by the embodiment of the disclosure, a target short text and a target long text are obtained; determining candidate long texts matched with the target short texts; matching the target long text and the candidate long text by adopting a first matching model, and determining a first matching result; and obtaining a matching result of the target short text and the target long text based on the first matching result. By adopting the technical scheme, the matching between the short text and the long text is converted into the matching between the two long texts, so that the problem of inaccurate matching caused by too little short text information or the misalignment between the short text and the long text information can be avoided, and the matching accuracy between the short text and the long text is improved.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure.
In order to more clearly illustrate the embodiments or technical solutions in the prior art of the present disclosure, the drawings used in the description of the embodiments or prior art will be briefly described below, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without inventive exercise.
Fig. 1 is a schematic flowchart of a text matching method according to an embodiment of the present disclosure;
fig. 2 is a schematic flow chart of another text matching method provided in the embodiment of the present disclosure;
FIG. 3 is a schematic diagram of model training provided by embodiments of the present disclosure;
FIG. 4 is a schematic diagram of text matching provided by embodiments of the present disclosure;
fig. 5 is a schematic structural diagram of a text matching apparatus according to an embodiment of the present disclosure;
fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure.
Detailed Description
In order that the above objects, features and advantages of the present disclosure may be more clearly understood, aspects of the present disclosure will be further described below. It should be noted that the embodiments and features of the embodiments of the present disclosure may be combined with each other without conflict.
In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present disclosure, but the present disclosure may be practiced in other ways than those described herein; it is to be understood that the embodiments disclosed in the specification are only a few embodiments of the present disclosure, and not all embodiments.
Fig. 1 is a flowchart of a text matching method provided in an embodiment of the present disclosure, where the method may be executed by a text matching apparatus, where the apparatus may be implemented by software and/or hardware, and may be generally integrated in an electronic device. As shown in fig. 1, the method includes:
step 101, obtaining a target short text and a target long text.
The target short text can be any text with a short length, the target long text can be any text with a long length, the short text and the long text are two texts with different lengths, and the length are relative concepts and are not limitations on the length of the texts.
In the embodiment of the disclosure, the text matching device may acquire the target short text and the target long text sent by the user, and may also acquire the target short text and the target long text from the internet. The specific sources of the target short text and the target long text are not limited by the embodiments of the present disclosure. For example, the target short text may be a composition title input by the user, and the target long text may be a composition full text or at least one text paragraph in the composition input by the user; the target short text can be a search keyword, and the target long text can be a search result text; the target short text can be a keyword, and the target long text can be any long text which needs to be judged whether the target short text is related to the keyword.
And 102, determining candidate long texts matched with the target short texts.
The candidate long texts are long texts matched with the target short texts, the number of the candidate long texts can be multiple, and the specific number is not limited.
In the embodiment of the present disclosure, determining the candidate long text matched with the target short text may include: and searching in a preset text matching library to determine candidate long texts matched with the target short texts, wherein the number of the candidate long texts is at least one. The text matching library is a matching library constructed by using short texts and matched long texts, the text matching library is stored in a text pair mode of matching the short texts with the long texts, each text pair in the text matching library is represented in a dictionary form of { short text, long text list }, and all long texts matched with the short texts are stored. Specifically, the candidate long text matched with the target short text is determined by searching in a preset text matching library.
Optionally, the determining the candidate long text matched with the target short text by searching in a preset text matching library may include: determining the text similarity between the target short text and each short text in the text matching library; and determining the long text corresponding to the short text with the text similarity larger than or equal to the similarity threshold as the candidate long text. Wherein, the similarity threshold value can be set according to the actual situation.
Specifically, the text similarity between the target short text and the short text included in the text matching library is determined, the text similarity is compared with a similarity threshold, the short text with the text similarity larger than or equal to the similarity threshold is determined, and the long text matched with the short text is determined as the candidate long text.
And 103, matching the target long text and the candidate long text by adopting a first matching model, and determining a first matching result.
The first matching model is a pre-trained deep learning model for matching long texts, and the basic neural network adopted by the first matching model is not limited in the embodiment of the disclosure, for example, the first matching model may be a gradient descent tree classification model, and may also be referred to as a text classification model.
In this embodiment of the present disclosure, before step 104, the text matching method may further include: training a first matching model, wherein the first matching model is obtained by training through the following method: acquiring a long text sample pair, and acquiring a semantic vector of the long text sample pair based on a text representation model; fusing the two semantic vectors to obtain a sample fusion feature vector; and training the basic first matching model based on the sample fusion feature vector and the matching label of the long text sample pair to obtain a first matching model.
Based on the text matching library, two long texts from the same short text can be randomly extracted as positive samples of a long text sample pair, two long texts from different short texts can be extracted as negative samples of the long text sample pair, the long text sample pair is used as a training set, the number of samples in the training set is not limited, and for example, 40 ten thousand pairs of positive samples and negative samples can be constructed as the training set. The basic representation model can be trained based on the training set to obtain a text representation model, and the basic representation model can adopt a twin neural Network (Simese Network), which is a connected neural Network sharing weight values.
And obtaining corresponding semantic vectors by adopting a text representation model for the long text sample pairs, carrying out fusion processing on the two semantic vectors to obtain sample fusion feature vectors, taking the sample fusion feature vectors as input, and taking the matching labels of the long text sample pairs as output to train a basic matching model, thereby obtaining a first matching model. The fusion process may include at least one of addition, multiplication, division, and nonlinear variation. Illustratively, for a long text sample pair (di, dj) of a positive sample or a negative sample, a text characterization model is adopted to respectively represent di and dj into semantic vectors Si and Sj, the results of the two semantic vectors Si and Sj after transformation such as addition, multiplication, division and the like are connected into a brand new sample fusion feature vector, and a sample fusion feature vector is adopted to train a basic matching model, so that a first matching model can be obtained.
In the embodiment of the present disclosure, matching the target long text and the candidate long text by using the first matching model, and determining the first matching result may include: respectively inputting the target long text and the candidate long text into a text representation model to obtain corresponding semantic vectors; determining a fusion feature vector according to the semantic vectors corresponding to the target long text and the candidate long text, and inputting the fusion feature vector into the first matching model to obtain a first matching result.
Optionally, determining a fusion feature vector according to the semantic vectors corresponding to the target long text and the candidate long text, including: and fusing semantic vectors corresponding to the target long text and the candidate long text to obtain a fused feature vector, wherein the fusing comprises at least one of addition, multiplication, division and nonlinear change.
Specifically, the text matching model may input the target long text and the candidate long text into a previously trained text representation model respectively, may obtain two corresponding semantic vectors, perform fusion processing on results of the two semantic vectors after at least one of addition, multiplication, division, nonlinear change, and the like is transformed, obtain a fusion feature vector, input the fusion feature vector into the first matching model, determine whether the two semantic vectors are related, and output a probability value, and if the probability value is greater than a first probability threshold, determine that the first matching result is a successful matching; otherwise, the first matching result is matching failure. The first probability threshold may be set according to actual conditions, and for example, the first probability threshold may be 50%.
And 104, obtaining a matching result of the target short text and the target long text based on the first matching result.
When the first matching result is that matching is successful, it can be determined that matching between the target short text and the target long text is successful, and when the first matching result is that matching is failed, it can be determined that matching between the target short text and the target long text is failed. In the above scheme, since the candidate long texts are all long texts related to the target short text, if the target long text to be matched is related to the candidate long text, it can be determined that the target long text is also related to the target short text.
According to the text matching scheme provided by the embodiment of the disclosure, a target short text and a target long text are obtained; determining candidate long texts matched with the target short texts; matching the target long text and the candidate long text by adopting a first matching model, and determining a first matching result; and obtaining a matching result of the target short text and the target long text based on the first matching result. By adopting the technical scheme, the matching between the short text and the long text is converted into the matching between the two long texts, so that the problem of inaccurate matching caused by too little short text information or the misalignment between the short text and the long text information can be avoided, and the matching accuracy between the short text and the long text is improved.
In some embodiments, prior to determining the candidate long text that matches the target short text, the method further comprises: matching the target short text and the target long text by adopting a second matching model, and determining a second matching result; determining candidate long texts matching the target short text is performed if the second matching result indicates that the target short text and the target long text fail to match.
The second matching model is a deep learning model trained in advance and used for judging whether the two texts are matched, and the second matching model can be trained based on samples in advance. The basic neural network adopted by the second matching model is not limited in the embodiments of the present disclosure, for example, the second matching model may be a Gradient Boosting Decision Tree (GBDT) classification model.
In some embodiments, the second matching model is obtained based on training in the following way: acquiring a sample short text and a sample long text; respectively extracting text characteristics of the sample short text and the sample long text; and training the basic second matching model based on the text features of the sample short text and the sample long text and the matching labeling information between the sample short text and the sample long text to obtain a second matching model.
The text features may include at least one of explicit features, semantic features, and interactive features. And training a basic second matching model by taking the text features extracted from the sample short text and the sample long text as input and the matching marking information between the sample short text and the sample long text as output, so as to determine the second matching model. The matching marking information is information indicating whether the sample short text and the sample long text are matched, that is, whether the sample short text and the sample long text are related.
Optionally, according to the first matching result, the target short text and the target long text may be used as a positive sample or a negative sample and returned to continue training the second matching model, so as to improve the accuracy of text matching of the second matching model.
In this embodiment of the present disclosure, matching the target short text and the target long text by using the second matching model, and determining the second matching result may include: extracting text characteristics of the target short text and the target long text; and inputting the text characteristics of the target short text and the target long text into a second matching model to obtain a second matching result. Optionally, the text features include at least one of explicit features, semantic features and interactive features, where the explicit features are used for characterizing visible characteristics of the text, the semantic features are used for characterizing semantic characteristics of the text, and the interactive features are used for characterizing association characteristics between two texts.
The explicit characteristics may include, among others, text length, number of words, number of single sentences, and so on. The semantic features may include a semantic vector of the text, a topic distribution vector of the long text, and the like, the semantic expression vector uses an average vector of Word vectors converted by Word2vec for each Word in the text, and the topic distribution vector uses a distribution vector output by an (Latent Dirichlet Allocation, LDA) document topic generation model. Word2vec above may be a model that represents words as Word vectors. The interactive features may include keyword information in the long text containing the short text, K sentences in the long text most similar to the semantic vector of the short text (cosine similarity), and the like.
Specifically, the text matching device may extract text features of the target short text and the target long text, respectively, input the text features of the target short text and the target long text into the second matching model, output a probability value, and determine that the second matching result is a successful matching if the probability value is greater than a second probability threshold; otherwise, the second matching result is matching failure. The second probability threshold may be set according to actual conditions, and for example, the second probability threshold may be 50%.
In the above scheme, the target short text and the target long text containing the same keyword information may be matched, but if the following situations occur: 1) the short text is too short, and the contained information amount is too small; 2) the short text does not contain any key words (such as key nouns, verbs and the like); 3) the long text does not contain the keywords in the short text, and the like, so that the short text cannot be accurately matched, that is, the second matching result of the target short text and the target long text is matching failure.
Specifically, if the second matching result indicates that the matching between the target short text and the target long text fails, determining candidate long texts matched with the target short text and performing matching in another mode; and if the second matching result indicates that the target short text is successfully matched with the target long text, namely the target short text is related to the target long text, directly returning the matching result to the user.
In the scheme, matching can be performed on the short text and the long text first, matching between the short text and the long text can be converted into matching between the two long texts after matching fails, and a matching result can be returned after matching succeeds, so that matching efficiency is improved.
Fig. 2 is a schematic flow chart of another text matching method provided in the embodiment of the present disclosure, and the embodiment specifically describes the text matching method on the basis of the above embodiment. As shown in fig. 2, the method includes:
step 201, obtaining a target short text and a target long text.
Step 202, matching the target short text and the target long text by adopting a second matching model, and determining a second matching result.
Specifically, matching the target short text and the target long text by using a second matching model, and determining a second matching result, including: extracting text characteristics of the target short text and the target long text; and inputting the text characteristics of the target short text and the target long text into a second matching model to obtain a second matching result. Optionally, the text features include at least one of explicit features, semantic features and interactive features, where the explicit features are used for characterizing visible characteristics of the text, the semantic features are used for characterizing semantic characteristics of the text, and the interactive features are used for characterizing association characteristics between two texts.
Optionally, the second matching model is obtained by training based on the following method: acquiring a sample short text and a sample long text; respectively extracting text characteristics of the sample short text and the sample long text; and training the basic second matching model based on the text features of the sample short text and the sample long text and the matching labeling information between the sample short text and the sample long text to obtain a second matching model.
Step 203, whether the second matching result indicates that the matching between the target short text and the target long text fails is judged, and if yes, step 204 is executed; otherwise, step 206 is performed.
And step 204, determining candidate long texts matched with the target short texts.
Specifically, determining candidate long texts matched with the target short texts includes: and searching in a preset text matching library to determine candidate long texts matched with the target short texts, wherein the number of the candidate long texts is at least one. Optionally, the determining the candidate long text matched with the target short text by searching in a preset text matching library includes: determining the text similarity between the target short text and each short text in the text matching library; and determining the long text corresponding to the short text with the text similarity larger than or equal to the similarity threshold as the candidate long text. And the text matching library is stored according to the mode of the text pair matched with the short text and the long text.
And step 205, matching the target long text and the candidate long text by adopting a first matching model, and determining a first matching result.
Specifically, matching the target long text and the candidate long text by using a first matching model, and determining a first matching result, includes: respectively inputting the target long text and the candidate long text into a text representation model to obtain corresponding semantic vectors; determining a fusion feature vector according to the semantic vectors corresponding to the target long text and the candidate long text, and inputting the fusion feature vector into the first matching model to obtain a first matching result. Optionally, determining a fusion feature vector according to the semantic vectors corresponding to the target long text and the candidate long text, including: and fusing semantic vectors corresponding to the target long text and the candidate long text to obtain a fused feature vector, wherein the fusing comprises at least one of addition, multiplication, division and nonlinear change.
Optionally, the first matching model is obtained by training through the following method: acquiring a long text sample pair, and acquiring a semantic vector of the long text sample pair based on a text representation model; fusing the two semantic vectors to obtain a sample fusion feature vector; and training the basic first matching model based on the sample fusion feature vector and the matching label of the long text sample pair to obtain a first matching model.
The basic characterization model may adopt a twin neural Network (Siamese Network), which takes two samples as input and outputs a characterization of embedding the two samples into a high-dimensional space to compare similarity degrees of the two samples. Exemplarily, fig. 3 is a schematic diagram of model training provided by an embodiment of the present disclosure, in which a twin neural network in fig. 3 includes two networks, namely, network 1 and network 2, and weight sharing of network 1 and network 2, each of which receives an input, "input 1" and "input 2" respectively input into network 1 and network 2, and maps them into a high-dimensional feature space, and outputs a corresponding characterization based on a loss function.
And step 206, determining a matching result of the target short text and the target long text.
And determining the first matching result or the second matching result as the matching result of the target short text and the target long text.
The text matching method in the embodiment of the present disclosure is further explained by a specific example. Exemplarily, fig. 4 is a schematic diagram of text matching provided by the embodiment of the present disclosure, and a specific process of text matching may include:
1. the short text matches directly with the long text.
The method comprises the steps of firstly extracting respective explicit characteristics, semantic characteristics, interactive characteristics and the like between long texts and short texts, then training a GBDT classification model by using the characteristics, and recording the GBDT1 so as to judge whether the two texts are related.
The extracted features are: 1) and the explicit characteristics comprise text length, word number and the like. 2) Semantic features including semantic expression vectors of text, topic distribution vectors of long text, and the like; the semantic expression vector uses an average vector of Word2vec Word vectors of each Word in the text, and a distribution vector output by an LDA topic model adopted by a topic distribution vector. 3) The interactive characteristics comprise key word information of a short text contained in a long text, K sentences which are most similar to the semantic vector of the short text (cosine similarity) in the long text, and the like.
This step can match the long text and short text containing the same keyword information accurately, but it can not match accurately if the following situations occur: 1) the short text is too short, and the contained information amount is too small; 2) short text does not contain any keywords (e.g., key nouns, verbs, etc.) 3) long text does not contain keywords in short text.
2. Matching between long texts based on matching libraries.
For two texts on which the first step is not matched, further matching is performed. Suppose the short text to be matched is ds and the long text is dq.
And constructing a larger text matching library by using the historical matching data, wherein each sample in the matching library is represented in a dictionary form of { short text, long text list }, and all long texts matched with the short texts are stored. And training a text representation model and a text related classification model. Specifically, training a text characterization Model, based on a matching library, randomly extracting two long texts from the same short text as positive examples, extracting long texts from different short texts as negative examples, constructing 40 thousands of pairs of such long text matching pairs (di, dj) as a training set, and training a text characterization Model _ S with a depth for subsequent long text matching. Training a text related classification Model, wherein the training data is also the 40 ten thousand pairs of matching pairs, and performing the following operation on each pair of matching pairs (di, dj) to obtain new characteristics; and secondly, connecting the results of the two semantic vectors Si and Sj after transformation such as addition, multiplication, division and the like into a brand new characteristic vector. And training a GBDT text related classification model, and marking the model as GBDT 2.
Using the short text ds to be matched, a long text list in which all short texts are consistent with the short text ds is searched in the matching library, and a candidate library D = [ D1, D2, …, dn ] is formed.
The matching process between the specific long texts is as follows: and representing the long text dq to be matched into a semantic vector Sq by using a text characterization Model _ S. 2) For each long text di in the candidate library: expressing the semantic vector Si by using a text representation Model _ S; connecting the results of the addition, multiplication, division and other transformations of the two semantic vectors Sq and Si into a feature vector; inputting the characteristics obtained in the second step into a model GBDT2, judging whether Sq is related to Si, outputting a probability value, determining whether Sq is related to Si by using the output probability value, and further determining whether short text ds is related to long text dq.
Since the long texts in the candidate library are all texts related to the short text ds, if the long text dq to be matched is related to the long text di in the candidate library, it can be determined that the long text dq to be matched is also related to the short text ds.
In the embodiment of the disclosure, the matching task of the short text and the long text is converted into the matching task between the long and plain texts, so that the problem of inaccurate matching caused by too little information of the short text, no keyword contained in the short text or the long text and the like can be solved, and the matching accuracy between the short text and the long text is improved.
According to the text matching scheme provided by the embodiment of the disclosure, a target short text and a target long text are obtained; matching the target short text and the target long text by adopting a second matching model, and determining a second matching result; if the second matching result indicates that the target short text and the target long text fail to be matched, determining candidate long texts matched with the target short text; matching the target long text and the candidate long text by adopting a first matching model, and determining a first matching result; and further determining a matching result of the target short text and the target long text. By adopting the technical scheme, the short text and the long text are matched, and if the matching fails, the matching between the short text and the long text is converted into the matching between the two long texts, so that the problem of inaccurate matching caused by too little short text information or non-alignment between the short text and the long text information can be avoided, and the matching accuracy between the short text and the long text is improved.
Fig. 5 is a schematic structural diagram of a text matching apparatus provided in an embodiment of the present disclosure, where the apparatus may be implemented by software and/or hardware, and may be generally integrated in an electronic device. As shown in fig. 5, the apparatus includes:
a text obtaining module 301, configured to obtain a target short text and a target long text;
a candidate module 302, configured to determine a candidate long text matching the target short text;
a first matching module 303, configured to match the target long text and the candidate long text by using a first matching model, and determine a first matching result;
a result module 304, configured to obtain a matching result between the target short text and the target long text based on the first matching result.
According to the text matching scheme provided by the embodiment of the disclosure, a target short text and a target long text are obtained; determining candidate long texts matched with the target short texts; matching the target long text and the candidate long text by adopting a first matching model, and determining a first matching result; and obtaining a matching result of the target short text and the target long text based on the first matching result. By adopting the technical scheme, the matching between the short text and the long text is converted into the matching between the two long texts, so that the problem of inaccurate matching caused by too little short text information or the misalignment between the short text and the long text information can be avoided, and the matching accuracy between the short text and the long text is improved.
Optionally, the candidate module 302 is specifically configured to:
and searching in a preset text matching library to determine candidate long texts matched with the target short texts, wherein the number of the candidate long texts is at least one.
Optionally, the candidate module 302 is specifically configured to:
determining the text similarity between the target short text and each short text in the text matching library;
and determining the long text corresponding to the short text with the text similarity larger than or equal to the similarity threshold as the candidate long text.
Optionally, the text matching library stores the short text and the long text in a matching text pair manner.
Optionally, the first matching module 303 is specifically configured to:
respectively inputting the target long text and the candidate long text into a text representation model to obtain corresponding semantic vectors;
determining a fusion feature vector according to the semantic vectors corresponding to the target long text and the candidate long text, and inputting the fusion feature vector into the first matching model to obtain a first matching result.
Optionally, the first matching module 303 is specifically configured to:
and obtaining the fusion characteristic vector after fusion processing is carried out on the semantic vectors corresponding to the target long text and the candidate long text, wherein the fusion processing comprises at least one of addition, multiplication, division and nonlinear change.
Optionally, the apparatus further comprises a first training module, configured to:
obtaining a long text sample pair, and obtaining a semantic vector of the long text sample pair based on a text representation model;
fusing the two semantic vectors to obtain a sample fusion feature vector;
training a basic first matching model based on the sample fusion feature vector and the matching label of the long text sample pair to obtain the first matching model.
Optionally, the apparatus further includes a second matching module, configured to: prior to determining candidate long texts that match the target short text,
matching the target short text and the target long text by adopting a second matching model, and determining a second matching result;
and if the second matching result indicates that the target short text and the target long text fail to be matched, executing the determination of the candidate long text matched with the target short text.
Optionally, the second matching module is specifically configured to:
extracting text features of the target short text and the target long text;
and inputting the text features of the target short text and the target long text into the second matching model to obtain a second matching result.
Optionally, the text features include at least one of an explicit feature, a semantic feature and an interactive feature, where the explicit feature is used to characterize a visible characteristic of the text, the semantic feature is used to characterize a semantic characteristic of the text, and the interactive feature is used to characterize an association characteristic between two texts.
Optionally, the apparatus further comprises a second training module, configured to:
acquiring a sample short text and a sample long text;
respectively extracting text characteristics of the sample short text and the sample long text;
and training a basic second matching model based on the text features of the sample short text and the sample long text and the matching labeling information between the sample short text and the sample long text to obtain the second matching model.
The text matching device provided by the embodiment of the disclosure can execute the text matching method provided by any embodiment of the disclosure, and has corresponding functional modules and beneficial effects of the execution method.
Fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure. As shown in fig. 6, the electronic device 400 includes one or more processors 401 and memory 402.
The processor 401 may be a Central Processing Unit (CPU) or other form of processing unit having data processing capabilities and/or instruction execution capabilities, and may control other components in the electronic device 400 to perform desired functions.
Memory 402 may include one or more computer program products that may include various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory. The volatile memory may include, for example, Random Access Memory (RAM), cache memory (cache), and/or the like. The non-volatile memory may include, for example, Read Only Memory (ROM), hard disk, flash memory, etc. One or more computer program instructions may be stored on the computer-readable storage medium and executed by processor 401 to implement the text matching methods of the embodiments of the present disclosure described above and/or other desired functions. Various contents such as an input signal, a signal component, a noise component, etc. may also be stored in the computer-readable storage medium.
In one example, the electronic device 400 may further include: an input device 403 and an output device 404, which are interconnected by a bus system and/or other form of connection mechanism (not shown).
The input device 403 may also include, for example, a keyboard, a mouse, and the like.
The output device 404 may output various information to the outside, including the determined distance information, direction information, and the like. The output devices 404 may include, for example, a display, speakers, a printer, and a communication network and its connected remote output devices, among others.
Of course, for simplicity, only some of the components of the electronic device 400 relevant to the present disclosure are shown in fig. 6, omitting components such as buses, input/output interfaces, and the like. In addition, electronic device 400 may include any other suitable components depending on the particular application.
In addition to the above-described methods and apparatus, embodiments of the present disclosure may also be a computer program product comprising computer program instructions that, when executed by a processor, cause the processor to perform the text matching methods provided by embodiments of the present disclosure.
The computer program product may write program code for carrying out operations for embodiments of the present disclosure in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server.
Furthermore, embodiments of the present disclosure may also be a computer-readable storage medium having stored thereon computer program instructions that, when executed by a processor, cause the processor to perform the text matching method provided by embodiments of the present disclosure.
The computer-readable storage medium may take any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may include, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
It is noted that, in this document, relational terms such as "first" and "second," and the like, may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
The foregoing are merely exemplary embodiments of the present disclosure, which enable those skilled in the art to understand or practice the present disclosure. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the disclosure. Thus, the present disclosure is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (13)

1. A text matching method, comprising:
acquiring a target short text and a target long text to be matched;
determining candidate long texts matched with the target short texts;
matching the target long text and the candidate long text by adopting a first matching model, and determining a first matching result;
obtaining a matching result of the target short text and the target long text based on the first matching result;
the matching of the target long text and the candidate long text is performed by adopting a first matching model, and a first matching result is determined, wherein the matching comprises the following steps:
respectively inputting the target long text and the candidate long text into a text representation model to obtain corresponding semantic vectors;
determining a fusion feature vector according to the semantic vectors corresponding to the target long text and the candidate long text, and inputting the fusion feature vector into the first matching model to obtain a first matching result;
obtaining a matching result of the target short text and the target long text based on the first matching result, wherein the obtaining of the matching result of the target short text and the target long text comprises:
when the first matching result is that matching is successful, it can be determined that the target short text and the target long text are successfully matched;
when the first matching result is a matching failure, determining that the target short text and the target long text are matched unsuccessfully;
the first matching model is a pre-trained deep learning model used for matching long texts.
2. The method of claim 1, wherein determining candidate long texts that match the target short text comprises:
and searching in a preset text matching library to determine candidate long texts matched with the target short texts, wherein the number of the candidate long texts is at least one.
3. The method of claim 2, wherein determining the candidate long text matching the target short text by searching in a preset text matching library comprises:
determining the text similarity between the target short text and each short text in the text matching library;
and determining the long text corresponding to the short text with the text similarity larger than or equal to the similarity threshold as the candidate long text.
4. The method of claim 2, wherein the text matching library is stored as pairs of text where short text matches long text.
5. The method of claim 1, wherein determining a fused feature vector according to semantic vectors corresponding to the target long text and the candidate long text comprises:
and obtaining the fusion characteristic vector after fusion processing is carried out on the semantic vectors corresponding to the target long text and the candidate long text, wherein the fusion processing comprises at least one of addition, multiplication, division and nonlinear change.
6. The method of claim 1, wherein the first matching model is obtained by training:
obtaining a long text sample pair, and obtaining a semantic vector of the long text sample pair based on a text representation model;
fusing the two semantic vectors to obtain a sample fusion feature vector;
training a basic first matching model based on the sample fusion feature vector and the matching label of the long text sample pair to obtain the first matching model.
7. The method of claim 1, wherein prior to determining the candidate long text that matches the target short text, the method further comprises:
matching the target short text and the target long text by adopting a second matching model, and determining a second matching result;
and if the second matching result indicates that the target short text and the target long text fail to be matched, executing the determination of the candidate long text matched with the target short text.
8. The method of claim 7, wherein matching the target short text with the target long text using a second matching model, and determining a second matching result comprises:
extracting text features of the target short text and the target long text;
and inputting the text features of the target short text and the target long text into the second matching model to obtain a second matching result.
9. The method of claim 8, wherein the text features comprise at least one of explicit features for characterizing visible characteristics of the text, semantic features for characterizing semantic characteristics of the text, and interactive features for characterizing association characteristics between two texts.
10. The method of claim 7, wherein the second matching model is obtained based on training as follows:
acquiring a sample short text and a sample long text;
respectively extracting text characteristics of the sample short text and the sample long text;
and training a basic second matching model based on the text features of the sample short text and the sample long text and the matching labeling information between the sample short text and the sample long text to obtain the second matching model.
11. A text matching apparatus, comprising:
the text acquisition module is used for acquiring a target short text and a target long text to be matched;
the candidate module is used for determining candidate long texts matched with the target short texts;
the first matching module is used for matching the target long text with the candidate long text by adopting a first matching model and determining a first matching result;
the result module is used for obtaining a matching result of the target short text and the target long text based on the first matching result;
the first matching module is specifically configured to:
respectively inputting the target long text and the candidate long text into a text representation model to obtain corresponding semantic vectors;
determining a fusion feature vector according to the semantic vectors corresponding to the target long text and the candidate long text, and inputting the fusion feature vector into the first matching model to obtain a first matching result;
the result module is specifically configured to:
when the first matching result is that matching is successful, it can be determined that the target short text and the target long text are successfully matched;
when the first matching result is a matching failure, determining that the target short text and the target long text are matched unsuccessfully;
the first matching model is a pre-trained deep learning model used for matching long texts.
12. An electronic device, characterized in that the electronic device comprises:
a processor;
a memory for storing the processor-executable instructions;
the processor is configured to read the executable instructions from the memory and execute the instructions to implement the text matching method of any of claims 1-10.
13. A computer-readable storage medium, characterized in that the storage medium stores a computer program for executing the text matching method of any of the preceding claims 1-10.
CN202110253010.6A 2021-03-09 2021-03-09 Text matching method, device, equipment and medium Active CN112632232B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110253010.6A CN112632232B (en) 2021-03-09 2021-03-09 Text matching method, device, equipment and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110253010.6A CN112632232B (en) 2021-03-09 2021-03-09 Text matching method, device, equipment and medium

Publications (2)

Publication Number Publication Date
CN112632232A CN112632232A (en) 2021-04-09
CN112632232B true CN112632232B (en) 2022-03-15

Family

ID=75297763

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110253010.6A Active CN112632232B (en) 2021-03-09 2021-03-09 Text matching method, device, equipment and medium

Country Status (1)

Country Link
CN (1) CN112632232B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113988085B (en) * 2021-12-29 2022-04-01 深圳市北科瑞声科技股份有限公司 Text semantic similarity matching method and device, electronic equipment and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110245216A (en) * 2019-06-13 2019-09-17 出门问问信息科技有限公司 For the semantic matching method of question answering system, device, equipment and storage medium
CN110413988A (en) * 2019-06-17 2019-11-05 平安科技(深圳)有限公司 Method, apparatus, server and the storage medium of text information matching measurement
CN111897930A (en) * 2020-06-13 2020-11-06 南京奥拓电子科技有限公司 Automatic question answering method and system, intelligent device and storage medium
CN112182180A (en) * 2020-09-27 2021-01-05 京东方科技集团股份有限公司 Question and answer processing method, electronic equipment and computer readable medium

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10747427B2 (en) * 2017-02-01 2020-08-18 Google Llc Keyboard automatic language identification and reconfiguration
CN110222707A (en) * 2019-04-28 2019-09-10 平安科技(深圳)有限公司 A kind of text data Enhancement Method and device, electronic equipment
CN111368058B (en) * 2020-03-09 2023-05-02 昆明理工大学 Question-answer matching method based on transfer learning
CN112131338B (en) * 2020-06-05 2024-02-09 支付宝(杭州)信息技术有限公司 Method and device for establishing question-answer pairs
CN112037905A (en) * 2020-07-16 2020-12-04 朱卫国 Medical question answering method, equipment and storage medium

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110245216A (en) * 2019-06-13 2019-09-17 出门问问信息科技有限公司 For the semantic matching method of question answering system, device, equipment and storage medium
CN110413988A (en) * 2019-06-17 2019-11-05 平安科技(深圳)有限公司 Method, apparatus, server and the storage medium of text information matching measurement
CN111897930A (en) * 2020-06-13 2020-11-06 南京奥拓电子科技有限公司 Automatic question answering method and system, intelligent device and storage medium
CN112182180A (en) * 2020-09-27 2021-01-05 京东方科技集团股份有限公司 Question and answer processing method, electronic equipment and computer readable medium

Also Published As

Publication number Publication date
CN112632232A (en) 2021-04-09

Similar Documents

Publication Publication Date Title
CN109145153B (en) Intention category identification method and device
US20170116203A1 (en) Method of automated discovery of topic relatedness
CN109376222B (en) Question-answer matching degree calculation method, question-answer automatic matching method and device
US20190221204A1 (en) Intelligent short text information retrieve based on deep learning
US9645988B1 (en) System and method for identifying passages in electronic documents
US10713438B2 (en) Determining off-topic questions in a question answering system using probabilistic language models
CN109299228B (en) Computer-implemented text risk prediction method and device
US20230069935A1 (en) Dialog system answering method based on sentence paraphrase recognition
US11893537B2 (en) Linguistic analysis of seed documents and peer groups
US10915756B2 (en) Method and apparatus for determining (raw) video materials for news
CN111241410A (en) Industry news recommendation method and terminal
CN112632232B (en) Text matching method, device, equipment and medium
CN113157888A (en) Multi-knowledge-source-supporting query response method and device and electronic equipment
CN112633007A (en) Semantic understanding model construction method and device and semantic understanding method and device
CN109165283B (en) Resource recommendation method, device, equipment and storage medium
CN115150354B (en) Method and device for generating domain name, storage medium and electronic equipment
CN110705308A (en) Method and device for recognizing field of voice information, storage medium and electronic equipment
CN116028626A (en) Text matching method and device, storage medium and electronic equipment
CN112100364A (en) Text semantic understanding method and model training method, device, equipment and medium
CN115712713A (en) Text matching method, device and system and storage medium
US11423228B2 (en) Weakly supervised semantic entity recognition using general and target domain knowledge
AU2019290658B2 (en) Systems and methods for identifying and linking events in structured proceedings
CN111460808A (en) Synonymous text recognition and content recommendation method and device and electronic equipment
CN113988085B (en) Text semantic similarity matching method and device, electronic equipment and storage medium
CN114743012B (en) Text recognition method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant