CN111914067A

CN111914067A - Chinese text matching method and system

Info

Publication number: CN111914067A
Application number: CN202010837271.8A
Authority: CN
Inventors: 俞凯; 吕波尔; 陈露; 朱苏
Original assignee: AI Speech Ltd
Current assignee: AI Speech Ltd
Priority date: 2020-08-19
Filing date: 2020-08-19
Publication date: 2020-11-10
Anticipated expiration: 2040-08-19
Also published as: CN111914067B

Abstract

The embodiment of the invention provides a Chinese text matching method. The method comprises the following steps: carrying out character-level coding on the Chinese sentence pairs by using a plurality of word segmentation tools to obtain initial character vectors of the Chinese sentence pairs; inputting an initial word vector of a Chinese sentence pair into an input layer, and determining semantic representation of the word vector based on a knowledge network external knowledge base; respectively carrying out iterative updating on word lattices of the semantic representation and the word vectors through a multidimensional graph attention network, and outputting the semantic word vectors with the semantic representation; inputting the semantic word vector into a sentence matching layer, and determining a final feature representation semantic word vector of the Chinese sentence pair; a match probability is determined for the feature representations of the chinese sentence pair based on the final feature representation semantic word vector of the chinese sentence pair and the plurality of word segmentation tools. The embodiment of the invention also provides a Chinese text matching system. According to the embodiment of the invention, the semantic information in the HowNet external knowledge base is integrated into the model, so that the semantic information in sentences is better utilized, and the matching effect is obviously improved.

Description

Chinese text matching method and system

Technical Field

The invention relates to the field of text matching, in particular to a Chinese text matching method and a Chinese text matching system.

Background

Text matching is an important basic problem in Natural Language Processing, and can be applied to a large number of NLP (Natural Language Processing) tasks, such as information retrieval, question-answering system, dialogue system, machine translation, and the like, which can be abstracted to a great extent as text matching problems. Word lattice convolutional neural network, two-way multi-angle matching in natural language sentences is commonly used for text matching.

Word lattice convolutional neural network: the word lattice is used as input, multiple CNN (Convolutional Neural Networks) convolution kernels are used on different n-gram texts to extract features, and the features are fused through a pooling mechanism to be used for text matching.

Bidirectional multi-angle matching in natural language sentences. The method uses words as input, each sentence is coded by a BilSTM (Bi-directional Long Short-Term Memory, bidirectional Long-Short Term Memory network), the characteristics of the two sentences are interacted by a plurality of methods, and a plurality of kinds of interactive information are combined for classification.

In the process of implementing the invention, the inventor finds that at least the following problems exist in the related art:

when the word lattice convolutional neural network is used, the features are derived from local information, the global information cannot be fused, and the model may lose remote information when the features of a certain position in a sentence are extracted. In addition, this technique uses only representations of words, and does not utilize semantic information.

When bidirectional multi-angle matching in natural language sentences is used, although interactive information between sentences can be obtained, because input is simple word segmentation, influence caused by inaccurate word segmentation can be introduced. In addition, this technique also does not utilize semantic information of the terms.

Disclosure of Invention

In order to solve at least the problem that in the prior art, only local information can be obtained by convolution of n-gram texts of a word lattice convolution neural network, word vector representation is used, and explicit semantic information is not included. When the two-way multi-angle is matched, the input unit uses a word segmentation tool to segment words, and the word segmentation tool cannot guarantee that the words are completely accurate.

In a first aspect, an embodiment of the present invention provides a method for matching a chinese text, including:

using a plurality of word segmentation tools to encode Chinese sentence pairs at a character level to obtain initial character vectors of the Chinese sentence pairs;

inputting the initial word vector of the Chinese sentence pair to an input layer, determining a word vector of the Chinese sentence pair, obtaining an antigen corresponding to the word vector based on a knowledge network external knowledge base, and determining semantic representation of the word vector;

inputting the word vectors and semantic representations of the Chinese sentence pairs into a graph transformation layer capable of sensing semantics, respectively carrying out iterative updating on the semantic representations and word lattices of the word vectors through a multidimensional graph attention network, and outputting semantic word vectors with semantic representations;

inputting the semantic word vector into a sentence matching layer, connecting the semantic word vector of the obtained Chinese sentence pair with an interactive semantic word vector, and determining a final feature representation semantic word vector of the Chinese sentence pair;

determining a match probability based on the final feature representation semantic word vector of the Chinese sentence pair and the feature representations of the Chinese sentence pair by the plurality of word segmentation tools.

In a second aspect, an embodiment of the present invention provides a chinese text matching system, including:

the coding program module is used for coding Chinese sentence pairs at a character level by using a plurality of word segmentation tools to obtain initial character vectors of the Chinese sentence pairs;

a semantic representation determining program module, configured to input an initial word vector of the chinese sentence pair to an input layer, determine a word vector of the chinese sentence pair, obtain an semantic corresponding to the word vector based on a knowledge network external knowledge base, and determine a semantic representation of the word vector;

an updating iterative program module, which is used for inputting the word vectors and semantic representations of the Chinese sentence pairs to a graph transformation layer capable of sensing semantics, respectively carrying out iterative updating on the semantic representations and word lattices of the word vectors through a multidimensional graph attention network, and outputting semantic word vectors with semantic representations;

a matching program module for inputting the semantic word vector to a sentence matching layer, connecting the semantic word vector of the Chinese sentence pair with an interactive semantic word vector, and determining a final feature representation semantic word vector of the Chinese sentence pair;

a probability determination program module for determining a matching probability based on the final feature representation semantic word vector of the Chinese sentence pair and the feature representations of the Chinese sentence pair by the plurality of word segmentation tools.

In a third aspect, an electronic device is provided, comprising: at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the steps of the chinese text matching method of any of the embodiments of the present invention.

In a fourth aspect, an embodiment of the present invention provides a storage medium, on which a computer program is stored, where the computer program is executed by a processor to implement the steps of the chinese text matching method according to any embodiment of the present invention.

The embodiment of the invention has the beneficial effects that: combining the word segmentation results of various word segmentation tools to construct a word lattice diagram as model input. Semantic information in an external knowledge base of HowNet is fused into the model, so that the model can better utilize the semantic information in sentences. The graph transformation model with enhanced semantic knowledge is used, and experiments prove that compared with a word lattice convolutional neural network, a sentence bidirectional multi-angle matching base line and the like, the model has obvious performance improvement.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.

FIG. 1 is a flow chart of a method for matching Chinese text according to an embodiment of the present invention;

FIG. 2 is a diagram illustrating an overall product structure of a method for matching Chinese texts according to an embodiment of the present invention;

FIG. 3 is a diagram of a semantic information structure of a Chinese text matching method according to an embodiment of the present invention;

FIG. 4 is a diagram illustrating updated semantic representations and word vectors of a Chinese text matching method according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of word segmentation and potential word ambiguity of a Chinese text matching method according to an embodiment of the present invention;

FIG. 6 is a graph of performance data of different models of a Chinese text matching method on LCQMC and BQ test data sets according to an embodiment of the present invention;

FIG. 7 is a graph of performance data using different segments on the LCQMC test data set for a Chinese text matching method according to an embodiment of the present invention;

fig. 8 is a schematic structural diagram of a chinese text matching system according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Fig. 1 is a flowchart of a method for matching a chinese text according to an embodiment of the present invention, which includes the following steps:

s11: using a plurality of word segmentation tools to encode Chinese sentence pairs at a character level to obtain initial character vectors of the Chinese sentence pairs;

s12: inputting the initial word vector of the Chinese sentence pair to an input layer, determining a word vector of the Chinese sentence pair, obtaining an antigen corresponding to the word vector based on a knowledge network external knowledge base, and determining semantic representation of the word vector;

s13: inputting the word vectors and semantic representations of the Chinese sentence pairs into a graph transformation layer capable of sensing semantics, respectively carrying out iterative updating on the semantic representations and word lattices of the word vectors through a multidimensional graph attention network, and outputting semantic word vectors with semantic representations;

s14: inputting the semantic word vector into a sentence matching layer, connecting the semantic word vector of the obtained Chinese sentence pair with an interactive semantic word vector, and determining a final feature representation semantic word vector of the Chinese sentence pair;

s15: determining a match probability based on the final feature representation semantic word vector and feature representations of the Chinese sentence pair.

In the embodiment, the overall model for Chinese text matching comprises an input layer and a graph transformation layer of a graph transformation network, so that the model can notice the graph information formed by the whole sentence and not only the local information; meanwhile, an external knowledge base of HowNet (HowNet) is introduced, so that semantic information can be merged into the model, and the matching performance of the model is further improved as shown in FIG. 2.

For step S11, a plurality of word segmentation tools, such as jieba, pkuseeg, and thulac, are prepared in advance. Since text matching is involved, typically one sentence of the sentence pair is a question entered by the user and the other sentence is a question in the question text library. Thereby determining whether the two sentences match. And performing multiple word segmentation on the sentence pairs by using a plurality of prepared word segmentation tools to obtain all word segmentation results. When encoding, a pre-training model BERT (Bidirectional Encoder Representation from converters) can be used to encode two sentences at a character level, so as to obtain an initial word vector Representation.

For step S12, it is then necessary to obtain word vectors, each word containing several words, and the weights of each word are obtained through a feed-forward neural network and weighted to obtain the word vectors.

Semantic information in the HowNet external knowledge base is then introduced. As shown in FIG. 3, each word may have multiple meanings, each meaning being modified by some sememe. HowNet takes the sememe as the smallest unit of the expression, and has a total of 1985 sememes in the knowledge base. The vector representation of each meaning is obtained by pooling weighting also from the vector representation of the sememe, which is written as semantic representation below. There may be multiple semantic representations for each word because some chinese words may be ambiguous words.

For step S13, in the graph transformation layer module capable of sensing semantics, the semantic representation and the word vector representation are updated iteratively in sequence. The update uses MD-GAT (multi-dimensional graph attention transform network). Among these, the multidimensional graph attention transforms the network, which is a network used many times in the method, and is applied as a function below. Both word lattice and sequence of sememes are considered as a graph.

To obtain the representation after the first update, the node x is used_jConnected node representation and x_jUpdating the representation of the previous round by itself. The update is the weighting of the connected nodes and their representations.

As an embodiment, the method further comprises: and iteratively updating the semantic representation through an reachable node of a word node corresponding to the semantic representation, iteratively updating a word lattice of the word vector through a semantic node corresponding to the word node, and outputting the semantic word vector with the semantic representation.

In the present embodiment, as shown in fig. 4, when updating the semantic representation, reachable node representation of the word node corresponding to the semantic representation is used as information, and a Gate loop Unit (GRU) is used as a control Unit for history information and new information. When the word vector is updated, the semantic node representation corresponding to the word node is used as information, and the GRU is also used as a control unit for history information and updated new information. The GRU may thus retain a portion of the useful history information. The module can be iterated for a plurality of times in the model, and experiments prove that the effect of twice iteration is the best.

For step S14, inputting the semantic word vector determined in step S13 to the sentence matching layer includes:

performing pooling weighting on the semantic word vector to obtain a weighted semantic word vector;

normalizing the weighted semantic word vector and the initial word vector to determine a semantic word vector;

interacting the semantic word vectors of the Chinese sentence pairs through a multidimensional graph attention network to obtain interactive semantic word vectors;

and inputting the semantic word vector and the interactive semantic word vector of the Chinese sentence pair into a feed-forward neural network to generate a final feature expression semantic word vector.

In the present embodiment, a word-level vector representation is first obtained by pooling weighting the updated word vectors, and the pooled input is all the words including the word. And then, normalizing the word vector and the initial word vector representation obtained by BERT coding to obtain a new word vector. In the module, not only the operation of multidimensional graph attention transformation is carried out on sentences, but also interaction is carried out among the sentences, and the interaction also uses the multidimensional graph attention transformation. The sentence information and the interactive information are connected, and a final word vector representation is obtained through a two-layer feedforward neural network. And obtaining sentence vectors through pooling weighting.

For step S15, the vectors of the two sentences and their dot products and absolute values are connected, and then the BERT encoded feature representation is connected, and together, the classification probability is obtained by a two-layer feedforward neural network and an activation function.

p＝FFN([c^CLS，r^a，r^b，r^a⊙r^b，|r^a-r^b|])

Wherein r is^aAnd r^bAre vectors of two sentences, respectively, c^CLSIs a characteristic representation after BERT coding.

It can be seen from this embodiment that the segmentation results of multiple segmentation tools are combined to construct a word lattice diagram as the model input. Semantic information in an external knowledge base of HowNet is fused into the model, so that the model can better utilize the semantic information in sentences. The graph transformation model with enhanced semantic knowledge is used, and experiments prove that compared with a word lattice convolutional neural network, a sentence bidirectional multi-angle matching base line and the like, the model has obvious performance improvement.

The method is described in detail, wherein a pre-trained language model, such as BERT, has shown powerful performance on various natural language processing tasks, including text matching. For Chinese text matching, BERT takes as input a pair of sentences, each Chinese character being a separate input token. It ignores the word information. To address this problem, some chinese variants of the original BERT have been proposed, such as BERT-wwm and ERNIE (two prior art). However, the pre-training process of BERT considering words requires a lot of time and resources. Therefore, the model of the method adopts a pre-trained language model as initialization, and fine-tunes the model by using word information.

The Hownet is an external knowledge base that manually labels each Chinese word sense with one or more related senses. The Hownet treats an sememe as an atomic semantic unit. Unlike WordNet, it emphasizes that various parts and attributes of the concept can be well represented in the sense. Hownet finds wide application in many natural language processing tasks, such as word similarity calculation, emotion analysis, word representation learning, and language modeling. However, people have less research on their effectiveness in short text matching tasks, especially in combination with pre-trained language models.

GAT (graph attention networks) is a special type of network that handles graph structure data through an attention mechanism. Given a graph

Where V and are respectively node x_iA set of edges and a set of edges. N is a radical of^(xi)Is a collection comprising nodes xi themselves and nodes directly connected by xi.

Each node xi in the graph has an initial feature vector

Where d is the feature dimension. The representation of each node is iteratively updated by a graph attention operation. In the l-th step, each node xi aggregates context information by aggregating its neighbors and itself. Updated representation

Is calculated from the weighted average of the connected nodes,

wherein

Is a learnable parameter, while σ () is a non-linear activation function, such as ReLU. Attention coefficient

Is two nodes x in a unified space_iAnd x_jEmbedded normalized similarity between them.

Wherein the content of the first and second substances,

and

are the projected learnable parameters.

Note that, in the above-described formula,

is a scalar, which means that all units

Are treated equally. This may limit the ability to model complex dependencies. Instead of general attention, multidimensional attention has proven useful in dealing with context changes and ambiguous problems in many NLP tasks. For each embedded

It does not compute a single scalar score

Instead, a feature score vector is first computed, then normalized with a feature-sensitive multidimensional softmax,

wherein

Is a function of the similarity in the equation

A scalar of the calculation.

Is a vector. The addition in the above equation means that a scalar is added to each element of the vector.

For modeling the pairwise dependencies of two nodes, and

for estimating

The contribution of each of the feature dimensions of (a),

wherein the content of the first and second substances,

and

are learnable parameters. According to the score vector

The corresponding modifications are as follows:

wherein |, indicates the element-by-element product of two vectors. For simplicity, the update process using the multidimensional attention mechanism is represented using MD-GAT (·), as follows,

after L steps of updating, each node finally has a context-aware representation

To obtain a stable training process, a residual join is also used, followed by layer normalization between the two graph attention layers.

In the specific implementation, a Chinese short text matching task is defined, and two Chinese sentences are given.

And

text matching model

Is to predict C^aAnd C^bWhether the semantics of (c) are equal. In this case, the amount of the solvent to be used,

and

respectively representing the tth and tth' Chinese characters in two sentences, and T_aAnd T_bRepresenting the number of characters in the sentence.

In the method, a matching model with enhanced language knowledge is provided. Rather than dividing each sentence into a sequence of words, we have all possible paths of segmentation to form a word lattice diagram

V is a set of nodes and is a set of edges. Each node

Corresponding to the word w_iWord w_iIs from the t-th in the sentence₁Character to tth₂A sub-sequence of characters starting with a character. As shown in FIG. 5, w can be obtained by retrieving HowNet_iAll the meanings of (a).

For two nodes

And

if x_iWith x in the original sentence_jAdjacent, there is an edge between them.

Is composed of x_iItself and the set of all reachable nodes in its forward direction, and

is composed of x_iItself and all reachable nodes backward from it.

Using two graphs

And

for two sentences C^aAnd C^bThe graph matching model predicts the similarity between the two to judge the original sentence C^aAnd C^bWhether or not to have the same meaning. As shown in fig. 2, the model consists of four parts: an input module, a semantic perception graph converter (SaGT), a sentence matching layer and a relationship classifier. The input module outputs each word w_iAnd an initial semantic representation of each meaning. The semantic perception map converter iteratively updates the word representations and the semantic representations and fuses the useful information of each other. The sentence matching layer firstly blends word representation into the characters, and then adopts a bilateral multi-view matching mechanism to match two character sequences. The relationship classifier takes a sentence vector as input and predicts the relationship between two sentences.

Context embedding in input Module for each node x in the graph_iWord w_iIs a centralized pool of context character representations. Specifically, the original character-level sentences are first concatenated to form a new sequence

They are then provided to the BERT model to obtain a contextual representation of each character

Hypothesis word w_iConsisting of successive character marks

Using for each role c_k(t₁≤k≤t₂) A feed-forward network (FFN) with two layers computes feature-oriented fractional vectors, i.e., then normalized using feature-oriented multidimensional softmax (MD-softmax),

u_k＝MD-softmax_k(FFN(c_k))

corresponding character embedding c_kBy normalized fraction u_kThe weighting results in a context word embedding,

for simplicity, the above formula abbreviation is rewritten using Att-Pooling (. cndot.), i.e.:

v_i＝Att-Pooling({|c_k|t₁≤k≤t₂})

and (4) embedding the sememes, wherein the method uses the Hownet as an external knowledge base to express the semantic information of the vocabulary. In view of ambiguity, HowNet distinguishes between different meanings for each word in an ambiguate. An example is given in figure 3. The word "Apple (Apple)" has two meanings, including "Brand (Apple Brand)" and "fruit (Apple)". The term "brand (apple brand)" has five phonetic names including "computer", "pattern value", "energy", "carrying (Bring)" and "specific brand".

For each word w_iBy using

s_i,kIs w_iIn the k-th sense, we represent its corresponding sememes as a set

To obtain each semantic s_i，kIs embedded in_i，kWe first get each of the sememes using a multidimensional attention function

The expression of (a):

wherein

Is sense generated by SAT model

The embedded vector of (2). For each semantic s_i,kWhich is embedded in s_i,kObtained by an attentive aggregation of all phoneme representations,

semantically aware graph converter, for each node x in the graph_iInsert word v_iContaining only context information, without explicit knowledge of language, but embeddedS is_i,kOnly containing linguistic knowledge and no contextual information. In order to obtain useful information therefrom, a semantic perception map converter (SaGT) is proposed. The algorithm first finds vi and s_i，kRespectively as word w_iInitial word representation of

And initial sense representation of word sense

The word representation and meaning representation are then iteratively updated in two sub-steps.

Updating sensory representation in the first iteration, the first substep is to remove the sensory representation from

Is updated to

For words with multiple meanings, it is usually determined which meaning should be used depending on the context in the sentence. Thus, when updating the representation, each sensation will first be from x_iTo aggregate useful information in the words in the front-rear direction of the word,

where the two multidimensional attention functions MD-GAT (-) have different parameters. Based on

Each sensory unit updates its representation with a Gate Round Unit (GRU),

the detailed update function of the GRU is as follows:

wherein W_z，W_r，W_g，b_z，b_rAnd b_gAre learnable parameters.

Notably, we do not use it directly

New representation form as semantic information

Middle S_i,kThe reason is that

Only the context information is contained, and a gate such as a GRU is needed to control the fusion of the context information and the semantic information.

Updating the word representation the second sub-step is based on the updated meaning representation

Representing words from

Is updated to

w_iThis word first obtains semantic information from its semantic representation,

its representation is then updated with GRU:

said GRU function and said

The GRU function of (a) has different parameters.

After multiple iterations, the final word representation

Not only contains context word information, but also contains semantic knowledge. For each sentence, we use

And

to represent the final representation.

Sentence matching layer, in obtaining semantic knowledge enhanced word representation of each sentence

And

we then incorporate this word information into the character. Without loss of generality, we will use sentence C^aThe character in (a) to describe the process. For each character

By including characters

All the representations of the words are combined, and useful word information can be obtained

Thus, semantic knowledge enhanced character representation y_tCan be obtained by:

wherein LayerNorm (·) indicates layer normalization, and

is a context character representation obtained using BERT described in Sec.

For each character

It uses multidimensional attention to get from C respectively^aAnd C^bThe information is aggregated in the statement,

the above multidimensional attention function MD-GAT (-) shares the same parameters. With this sharing mechanism, the model has very good properties, having when two sentences match completely

Where k ∈ {1, 2, …, P } (P is the number of views).

Is a parameter vector that assigns different weights to different angles of the message. Using P distances d₁，d₂，…，d_PWe can get the final character representation,

wherein

And FFN (-) is a feed-forward network with two layers.

Similarly, we can obtain sentence C^bEach character in

Final character representation of

Note that the final character representation contains three types of information: context information, knowledge representation of words and word senses, and similarity at the character level. For each sentence C^aOr C^bObtaining a sentence representation vector r using attention weighting of all final character representations of the sentence^aOr r^b。

A relation classifier using two sentence vectors r^a，r^bAnd vector c obtained with BERT^CLSThe model will predict the similarity of the two sentences,

where FFN (-) is a feed-forward network with two hidden layers and a Sigmoid activation function is used after the output layer. With N training samples

The training goal is to minimize the loss of binary mutual entropy,

wherein y is_i∈[0，1]Is the label of the ith training sample, p_i∈[0，1]Is that our model is paired in sentences

As input for the prediction.

The above method was tested, and the data set we performed semantic text similarity experiments on two Chinese data sets: LCQMC and BQ.

LCQMC is a problem-matching corpus with large-scale open domains. It consists of 260068 chinese sentence pairs, including 238766 training samples, 8802 validation samples, and 12500 test samples. Each pair is associated with a binary label indicating that the two sentences have the same meaning or the same intent. The positive samples were 30% more than the negative samples.

BQ is a domain-specific large-scale corpus used for bank problem matching. It is composed of 12 ten thousand Chinese sentence pairs, including 10 ten thousand training samples, 1 ten thousand verification samples and 1 ten thousand test samples. Each pair is also associated with a binary label indicating whether or not two sentences have the same meaning. The number of positive and negative samples is the same.

The evaluation index, Accuracy (ACC), and F1 score of each data set were used as evaluation indexes. Accuracy is the percentage of examples correctly classified. The F1 score of the match is the harmonic mean of accuracy and recall.

The hyper-parametric input word graph is formed by combining three word segmentation tools (jieba, pkuseeg and thulac). We use 200-dimensional pre-training antigen embedding provided by open HowNet. The number of graph update steps/layers L is 2 on both data sets. The dimensions of the word and word sense representation are 128. The discard rate for all hidden layers is 0.2. Each model was trained by RMSProp with an initial learning rate of 0.0005, the learning rate of the BERT layer multiplied by an additional factor of 0.1. For batch size, LCQMC used 32 and BQ used 64.

The model of the method was compared to three types of baselines: representation-based models, interaction-based models, and BERT-based models. The results are summarized in fig. 6. In order to ensure the reliability of the experimental results, the same experiment was performed five times, and an average score was given. Experiments were performed on their own using the parameters.

The representation-based model includes text CNN, BilSTM, Lattice-CNN, and model LET-1. The text CNN is a connected structure whose Convolutional Neural Networks (CNNs) are used to encode each sentence. BilSTM is another Siamese architecture with a bidirectional long short term memory (Bi-LSTM) for encoding each sentence. Lattice-CNN has also been proposed to address the potential Chinese word segmentation problem. The algorithm takes a word lattice as input, and combines feature vectors generated by a plurality of CNN cores into different n-element contexts of each node in a lattice graph by using a pool mechanism. LET-1 is a variant of the model we propose. Here, the BERT is replaced with a conventional character-level converter encoder and the interaction between sentences is eliminated at the sentence matching layer.

From FIG. 6, we can find that our LET-1 outperforms all baseline models on both data sets. More specifically, LET-1 has better performance than Lattice-CNN. Although they all utilize word lattices, Lattice-CNN focuses only on local information, while our model can utilize global information. In addition, the semantic messages are added between sentences in the model, so that the performance of the model is greatly improved. Furthermore, both Lattice-based models (Lattice-CNN and LET-1) performed better than the other two baselines (Text-CNN and BilSTM), indicating that word lattices must be used on this task.

The interaction-based model includes two benchmarks BiMPM, ESIM and our model LET-2. BiMPM is a bilateral multi-view matching model. It encodes each sentence using BiLSTM and matches two sentences from multiple angles. BiMPM performs well on certain natural language reasoning (NLI) tasks. There are two BilSTMs in the ESIM. The first is to encode a sentence, and the other is to fuse word alignment information between two sentences. The ESIM can achieve up-to-date results on various matching tasks. LET-2 is also a variant of LET. BERT is replaced by a conventional character-level converter encoder, similar to LET-1, but here we introduce an interaction mechanism.

The results of the above three models are shown in the second part of fig. 6. Our LET-2 performs better than the other models. In particular, LET-2 performs better than BiMPM, although they both use a multi-view matching mechanism. This suggests that our neural networks are powerful with word lattices. Furthermore, LET-2 performs better than LET-1, indicating the usefulness of character level comparisons in two sentences.

The BERT based model includes four baselines: BERT, BERT wwm ext and ERNIE. We compared them with our proposed model LET-3. BERT is the chinese official BERT model issued by Google. BERT-wwm is a Chinese BERT that uses a full word masking mechanism during the pre-training process. BERT wwm ext is a variation of BERT wwm, using more training data and training steps. ERNIE is designed to learn linguistic representations enhanced by knowledge masking strategies, including entity-level masking and phrase-level masking. LET-3 is the LET model we propose with BERT as a character-level encoder.

The results are shown in the third part of fig. 6. We can find that all three variants of BERT (BERTwwm, BERT-wwn-ext, ERNIE) exceed the original BERT, indicating that the use of word-level information during pre-training is very important for the Chinese matching task. The performance of our LET-3 model is superior to all of these BERT-based models. The result shows that the semantic information is an effective method for improving the Chinese semantic matching of BERT when LET is used for fine tuning the phrase.

To verify the validity of expressing lexical semantic information in conjunction with the Hownet, we performed experiments on the LCQMC test set using LET-3. In a comparison model without the knowledge of the Hownet, a word sense updating module in the SaGT is removed, and word representation is updated through multi-dimensional self-attention. Fig. 7 lists the results of the experiments in the two models for the three word segmentation methods and their combinations (lattices). For various types of segmentation, the performance of integrating semantic information is superior to that of simple word representation. The LET can acquire semantic information from the Hownet, and the performance of the model is improved. More specifically, when using semantic information, the accuracy and F1 score of the word lattice based model improved on average by 0.71% and 0.43%, respectively. Thus, the semantic information performs better on a lattice-based model. The possible reason is that the lattice-based model contains more possible words so that the model can perceive more meaning.

Furthermore, we designed an experiment to explore the impact of using different segment inputs. Using a graph of performance data of different segments on the LCQMC test data set as shown in FIG. 7, a significant improvement can be seen between the lattice-based model (lattice) and the word-based model, pkuseg and thulac. We believe this is because the lattice-based model can reduce the impact of word segmentation errors, making the prediction more accurate.

We also investigated the role of GRU in the SaGT. The average accuracy of GRU removal in the model is 87.82%, indicating that the GRU can control and integrate the history information with the current information. Through experimentation, we found that the model with two layers of SaGT achieved the best results. This indicates that multiple information fusion will optimize the message and make the model more robust.

The method provides a new graph converter with enhanced language knowledge, which is used for matching Chinese short texts. The model takes two word format graphs as input and integrates the semantic information of the Hownet to solve the word ambiguity problem to a certain extent. The method was evaluated on two reference datasets to achieve the best performance. Ablation analysis also shows that both semantic information and multi-granular information are important for text matching modeling.

Fig. 8 is a schematic structural diagram of a chinese text matching system according to an embodiment of the present invention, which can execute the chinese text matching method according to any of the above embodiments and is configured in a terminal.

The system for matching Chinese texts provided by the embodiment comprises: an encoding program module 11, a semantic representation determining program module 12, an update iterator module 13, a matching program module 14 and a probability determining program module 15.

The encoding program module 11 is configured to perform word-level encoding on a chinese sentence pair using a plurality of word segmentation tools to obtain an initial word vector of the chinese sentence pair; the semantic representation determining program module 12 is configured to input an initial word vector of the chinese sentence pair to an input layer, determine a word vector of the chinese sentence pair, obtain an semantic corresponding to the word vector based on a knowledge network external knowledge base, and determine a semantic representation of the word vector; the updating iterative program module 13 is configured to input the word vectors and semantic representations of the chinese sentence pairs to a graph transformation layer capable of perceiving semantics, perform iterative updating on the word lattices of the semantic representations and the word vectors respectively through a multidimensional graph attention network, and output semantic word vectors with semantic representations; the matching program module 14 is configured to input the semantic word vector to a sentence matching layer, connect the semantic word vector of the obtained chinese sentence pair with an interactive semantic word vector, and determine a final feature representation semantic word vector of the chinese sentence pair; the probability determination program module 15 is configured to determine a matching probability based on the final feature representation semantic word vector of the chinese sentence pair and the feature representations of the chinese sentence pair by the plurality of word segmentation tools.

Further, the matching program module is configured to:

Further, the update iterator module is configured to:

and iteratively updating the semantic representation through an reachable node of a word node corresponding to the semantic representation, iteratively updating a word lattice of the word vector through a semantic node corresponding to the word node, and outputting the semantic word vector with the semantic representation.

Further, the semantic representation determiner module is configured to:

determining the weight of each word in the initial word vector through a feedforward neural network, and weighting the initial word vector based on the weight to obtain the word vector.

Further, the system is also configured to: and performing iterative updating on the semantic representation and the word vector through a gating circulation unit.

The embodiment of the invention also provides a nonvolatile computer storage medium, wherein the computer storage medium stores computer executable instructions which can execute the Chinese text matching method in any method embodiment;

as one embodiment, a non-volatile computer storage medium of the present invention stores computer-executable instructions configured to:

determining a match probability based on the final feature representation semantic word vector and feature representations of the Chinese sentence pair.

As a non-volatile computer-readable storage medium, may be used to store non-volatile software programs, non-volatile computer-executable programs, and modules, such as program instructions/modules corresponding to the methods in embodiments of the present invention. One or more program instructions are stored in a non-transitory computer readable storage medium that, when executed by a processor, perform the chinese text matching method in any of the method embodiments described above.

The non-volatile computer-readable storage medium may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to the use of the device, and the like. Further, the non-volatile computer-readable storage medium may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some embodiments, the non-transitory computer readable storage medium optionally includes memory located remotely from the processor, which may be connected to the device over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

An embodiment of the present invention further provides an electronic device, which includes: at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the steps of the chinese text matching method of any of the embodiments of the present invention.

The client of the embodiment of the present application exists in various forms, including but not limited to:

(1) mobile communication devices, which are characterized by mobile communication capabilities and are primarily targeted at providing voice and data communications. Such terminals include smart phones, multimedia phones, functional phones, and low-end phones, among others.

(2) The ultra-mobile personal computer equipment belongs to the category of personal computers, has calculation and processing functions and generally has the characteristic of mobile internet access. Such terminals include PDA, MID, and UMPC devices, such as tablet computers.

(3) Portable entertainment devices such devices may display and play multimedia content. The devices comprise audio and video players, handheld game consoles, electronic books, intelligent toys and portable vehicle-mounted navigation devices.

(4) Other electronic devices with data processing capabilities.

In this document, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A Chinese text matching method comprises the following steps:

2. The method of claim 1, wherein the inputting the semantic word vector to a sentence matching layer, concatenating a semantic word vector and an interactive semantic word vector resulting in the chinese sentence pair comprises:

3. The method of claim 1, wherein the iteratively updating the semantic representation, the word lattice of the word vector, respectively, through a multidimensional graph attention network comprises:

4. The method of claim 1, wherein the inputting an initial word vector of the pair of chinese sentences to an input layer, determining a word vector of the pair of chinese sentences comprises:

5. The method of claim 1, wherein the method comprises: and performing iterative updating on the semantic representation and the word vector through a gating circulation unit.

6. A chinese text matching system, comprising:

7. The system of claim 6, wherein the matching program module is to:

8. The system of claim 6, wherein the update iterator module is to:

9. The system of claim 6, wherein the semantic representation determiner module is to:

10. The system of claim 6, wherein the system is further configured to: and performing iterative updating on the semantic representation and the word vector through a gating circulation unit.