CN116304745A

CN116304745A - Text topic matching method and system based on deep semantic information

Info

Publication number: CN116304745A
Application number: CN202310324759.4A
Authority: CN
Inventors: 纪科; 张秀; 杨波; 陈贞翔; 马坤; 孙润元
Original assignee: University of Jinan
Current assignee: University of Jinan
Priority date: 2023-03-27
Filing date: 2023-03-27
Publication date: 2023-06-23
Anticipated expiration: 2043-03-27
Also published as: CN116304745B

Abstract

The invention discloses a text topic matching method and system based on deep semantic information, and belongs to the technical field of text matching. The method comprises the steps of extracting a text by using a named entity recognition model, and screening the text by using a feature engineering to obtain key entities of the text; performing text summarization through a BART model to obtain main information of the text; and finally, carrying out feature fusion on the text abstract and the key entity to obtain a deep semantic information feature vector, inputting the deep semantic information and the target news text into a preset text topic matching model, and obtaining a text topic matching result. The text topic matching accuracy is improved, and the problem that the existing technology has the problem that the external knowledge irrelevant to the text topic is easy to mislead the judgment of the current topic, the text key information is easy to lose, and the long text matching effect is poor is solved.

Description

Text topic matching method and system based on deep semantic information

Technical Field

The invention relates to the technical field of text matching, in particular to a text topic matching method and system based on deep semantic information.

Background

The statements in this section merely relate to the background of the present disclosure and may not necessarily constitute prior art.

With the continuous development of modern technology, people's life is closely related to the internet. Text semantic matching is the basis for many natural language processing tasks, and text semantic matching techniques, such as information retrieval, are required in many scenarios. Different software has different requirements, so that different dimensional requirements are met for semantic matching, and common text matching cannot be applied to all products, so that topic matching is particularly important.

Topic detection for text is currently in an unfinished stage, and huge workload is difficult to process by means of manual detection only. Therefore, the realization of automatic matching of text topics through an algorithm model becomes a current hot research problem. Topic matching of texts with different lengths is regarded as a text matching problem, namely, similarity of the texts is judged by representing semantic information of the texts. The text matching algorithm undergoes a transition from a shallow statistical learning model to a deep learning model. In recent years, researchers model text matching tasks by adopting models such as LSTM, ESIM, BERT in a representation mode and an interaction mode, so that the text matching performance and speed are improved. Compared with a shallow learning mode based on statistical learning, the deep learning has better learning capability, avoids manual design rules and functions, and can directly learn characteristic representation from input.

In daily life, people browse news every day, each news has respective topics and specific description, when a user searches news, the user identifies the specific topics and then searches, and similar texts are ranked, so that the accuracy of searching is greatly improved, and great convenience is brought to the user.

However, the neural network model achieves better effect in text matching, but the disadvantages of the neural network model are still not negligible. The deep learning text matching model is developed towards the refined matching direction of text words and sentences, the sentence semantics are too deep, and the matching effect on the relatively specific topics is poor; the data noise is increased by a large amount of information irrelevant to news topic expression in the text, and the judgment on the current topic is easily misled by external knowledge irrelevant to the text subject matter, so that the topic matching result of the text is possibly influenced; due to the use of pre-training language models such as BERT, the operation of cutting the first 512 words is generally adopted for long texts, so that text key information is easy to lose, and the long text matching effect is poor.

Disclosure of Invention

In order to solve the defects in the prior art, the invention provides a text topic matching method, a text topic matching system, electronic equipment and a computer readable storage medium based on deep semantic information, wherein a text is extracted through a named entity recognition model, and then the entity is screened through feature engineering to obtain key entities of the text; performing text summarization through a BART model to obtain main information of the text; finally, feature fusion is carried out on the text abstract and the key entity, a deep semantic information feature vector is obtained, deep semantic information and a target news text are input into a preset text topic matching model, and a text topic matching result is obtained; on the premise of ensuring accuracy, text topic matching is performed rapidly and efficiently.

In a first aspect, the invention provides a text topic matching method based on deep semantic information;

a text topic matching method based on deep semantic information comprises the following steps:

acquiring a target news text, inputting the target news text into a preset named entity recognition model for processing, and acquiring an entity of the target news text;

screening the entities through feature engineering to obtain key entities;

inputting the target news text into a preset pre-training language model for processing to obtain a text abstract;

and carrying out feature fusion on the text abstract and the key entity to obtain a feature vector of the deep semantic information, inputting the deep semantic information and the target news text into a preset text topic matching model, and obtaining a text topic matching result.

Further, the inputting the target news text into a preset named entity recognition model for processing, and the obtaining the entity of the target news text includes:

vectorizing the target news text to obtain an initial representation vector of each word in the target news text;

extracting features of the initial representation vectors to obtain feature vectors of each sentence in the target news text;

constructing an information matrix, and carrying out convolutional encoding on the feature vector based on the information matrix to obtain different grid characterizations;

and representing the predicted word pair relation to the grid through a predictor, and acquiring the entity of the target news text.

Preferably, the information matrix includes a distance information matrix for representing a distance between each word in the word pair, a word pair information matrix for representing the word pair output through the norm layer, and a region information matrix for representing a region where the word pair is located.

Further, the screening the entity through the feature engineering, and obtaining the key entity includes:

acquiring a first weight of each entity according to the entity; calculating word frequency of each entity, and acquiring a second weight of each entity according to the word frequency; screening out disaggregated words in the entities, and obtaining a third weight of each entity; calculating the similarity of sentences and entities in the target news text, and acquiring a fourth weight of each entity;

and acquiring the combined characteristic weight of each entity according to the first weight, the second weight, the third weight and the fourth weight, and sequencing the entities according to the combined characteristic weight to acquire the key entity.

Further, the pre-trained language model is a BART model.

Further, the feature fusion of the text abstract and the key entity is performed to obtain a deep semantic information feature vector, the deep semantic information feature vector and the target news text are input into a preset text topic matching model, and the obtaining of the text topic matching result comprises:

performing feature fusion on the text abstract and the key entity through an LSTM network to obtain a deep semantic information feature vector;

acquiring a deep semantic information splicing vector according to the deep semantic information feature vector;

acquiring a text splicing vector according to the target news text;

splicing the deep semantic information splicing vector and the text splicing vector to obtain a splicing vector;

and inputting the spliced vector into a softmax layer for processing, and obtaining a text topic matching result.

Preferably, the obtaining the deep semantic information splicing vector according to the deep semantic information feature vector includes:

performing phase subtraction operation on the deep semantic information feature vectors and taking absolute values to obtain differences among the deep semantic information feature vectors, and splicing the differences among the deep semantic information feature vectors and the deep semantic information feature vectors to obtain deep semantic information spliced vectors.

In a second aspect, the invention provides a text topic matching system based on deep semantic information;

a text topic matching system based on deep semantic information, comprising:

a key entity acquisition module configured to: acquiring a target news text, inputting the target news text into a preset named entity recognition model for processing, and acquiring an entity of the target news text; screening the entities through feature engineering to obtain key entities;

a text summary acquisition module configured to: inputting the target news text into a preset pre-training language model for processing to obtain a text abstract;

a text topic matching module configured to: and carrying out feature fusion on the text abstract and the key entity to obtain a feature vector of the deep semantic information, inputting the deep semantic information and the target news text into a preset text topic matching model, and obtaining a text topic matching result.

In a third aspect, the present invention provides an electronic device;

an electronic device comprising a memory and a processor and computer instructions stored on the memory and running on the processor, which when executed by the processor, perform the steps of the above-described text topic matching method based on deep semantic information.

In a fourth aspect, the present invention provides a computer-readable storage medium;

a computer readable storage medium storing computer instructions that, when executed by a processor, perform the steps of the above-described text topic matching method based on deep semantic information.

Compared with the prior art, the invention has the beneficial effects that:

1. according to the technical scheme, aiming at the problem of poor matching effect of the text topics, the text semantic information is deeply mined by abstracting the main content of the text and extracting the main description objects of the text, namely key entities, so that matching between short texts can be processed, matching between long texts can be processed, topic matching detection can be performed even between texts with different lengths, and the matching effect of the text topics is improved.

2. According to the technical scheme provided by the invention, based on a named entity recognition technology, the extracted entity is further screened through feature engineering to find the key entity which is more in line with text description, and the main description object is effectively distinguished from irrelevant content, so that the summarization capability of the model on topics is improved.

3. According to the technical scheme provided by the invention, based on the text summarization technology, key point information in the text is extracted, summarized and refined, and the length of the text can be effectively shortened, and as the pre-training language model BERT can only intercept the text content of the first 512 words, the text summarization technology is adopted to summarize main information of the text and then encode the main information, so that the matching precision is improved, and the problem of poor topic matching effect between long texts and between texts with different lengths is solved.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the invention.

Fig. 1 is a flow diagram of a text topic matching method based on deep semantic information provided by an embodiment of the present invention;

fig. 2 is a schematic diagram of an overall network architecture according to an embodiment of the present invention;

fig. 3 is a schematic diagram of a network architecture of a named entity recognition model according to an embodiment of the present invention.

Detailed Description

It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the invention. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments according to the present invention. As used herein, unless the context clearly indicates otherwise, the singular forms also are intended to include the plural forms, and furthermore, it is to be understood that the terms "comprises" and "comprising" and any variations thereof are intended to cover non-exclusive inclusions, such as, for example, processes, methods, systems, products or devices that comprise a series of steps or units, are not necessarily limited to those steps or units that are expressly listed, but may include other steps or units that are not expressly listed or inherent to such processes, methods, products or devices.

Embodiments of the invention and features of the embodiments may be combined with each other without conflict.

Term interpretation:

text matching: describing the relation between the two text sections, and whether the relation points to the same semantic;

text topic matching: whether the main topics described by the two sections of text are consistent;

named entity: entities with specific meaning or strong meaning in the text generally comprise names of people, places, institutions, dates and times and proper nouns;

named entity identification: named entities in the text are identified.

Example 1

The neural network model in the prior art has poor effect on text topic matching, is easily influenced by interference factors, is easy to lose text key information, and misleads judgment of topics; therefore, the invention provides a text topic matching method based on deep semantic information.

Next, a text topic matching method based on deep semantic information disclosed in this embodiment is described in detail with reference to fig. 1 to 3, and includes the following steps:

s1, acquiring a target news text, inputting the target news text into a preset named entity recognition model for processing, and acquiring an entity of the target news text. The named entity model is a named entity model based on W2NER, and a tag sequence of the target news is predicted and generated through operations such as feature extraction of the expression vector; and extracting entity labels in the label sequence to obtain an entity of the target news text. The method comprises the following specific steps:

s101, vectorizing the target news text through the BERT model to obtain an initial representation vector of each word of the target news text.

Specifically, a BERT pre-training model is used to obtain a representation vector of text.

Splitting the target news text into sequences in units of words (token), the input vector of each token consisting of three parts: a word vector (token casting), a clause vector (segment embedding), and a position vector (position embedding).

token casting converts each word in the text into a vector of fixed dimensions. In the BERT pre-training model, each word is converted into a 768-dimensional vector representation. For each token in the text, searching a corresponding index in a pre-established index dictionary, and searching a vector obtained by searching the index in a lookup table (lookup table) to obtain the token email of the token.

segment embedding is used to distinguish between two sentences in a sentence pair. When the token is processed, a [ CLS ] identifier needs to be added at the beginning of a sentence, and a [ SEP ] identifier needs to be added at the end of the sentence. And this embedded layer has only two vector representations, namely 0 and 1: the token of the previous sentence in the sentence pair is assigned 0, and the token of the next sentence is assigned 1. If only one sentence is entered, then its segment embedding is all 0.

Since the transfomer does not have the sequence capability to acquire the entire sentence like RNN (recurrent neural network), the BERT model adds one position embedding to each position word to better understand the sequence order.

For the BERT pre-training model, a key part is the transducer encoder based on the self-attention mechanism. The method mainly obtains the representation vector of the word by adjusting the weight coefficient matrix according to the association degree between the words in the sentence, namely:

wherein Q, K, V are word vector matrices, d _k Is the dimension of ebedding. The multi-head Attention mechanism projects Q, K and V through a plurality of different linear transformations, and finally splices different Attention results, so that information under a plurality of spaces is obtained.

S102, extracting features of the initial representation vectors through a bidirectional LSTM neural network, and obtaining feature vectors of each sentence in the target news text.

The long and short term memory neural network (long short term memory, LSTM) is one type of Recurrent Neural Network (RNN). The LSTM adds 3 gate structures, namely a forget gate (forget gate), an input gate (input gate), and an output gate (output gate), in the hidden layer h, and adds a cell state. Wherein, the unit state is used for information storage; the input gate indicates how much input of the network is stored to the unit state at the current moment; the forgetting gate forgets or discards some information, and the task is to accept a long-term memory (the output transmitted from the last unit module) and determine which part to reserve and forget; the output gate determines an output value based on the cell state. LSTM is the selective extraction of features through a gating mechanism.

The Bi-LSTM neural network structure model is divided into 2 independent LSTM, the input sequence is respectively input into the 2 LSTM neural networks in positive sequence and reverse sequence for feature extraction, and word vectors formed after 2 output vectors (namely extracted feature vectors) are spliced are used as final feature expression of the word. The model design concept of Bi-LSTM is to make the feature data obtained at time t possess the information between the past and the future at the same time, so as to obtain the context information.

S103, constructing an information matrix, and carrying out convolutional encoding on the feature vector based on the information matrix to obtain different grid characterizations.

Illustratively, three distance information matrices (Distance Embedding), word pair information matrices (Word references), region information matrices (Region references) are constructed by CLN. Wherein Distance Embedding represents the distance between two words in a word-pair, and the distance is classified into one section; word encoding means that the Word-pair EMbedding is output through conditional layer-norm and is a Word pair matrix; region encoding indicates whether the Region where the word-pair is located is an upper triangle or a lower triangle.

S104, superposing Distance Embedding, word Embedding and Region Embedding, and further encoding the grid representation through 3 hole convolutions to obtain different grid representations.

S105, representing the predicted word pair relation of the grid through a predictor, and acquiring an entity of the target news text.

Specifically, the grid is characterized by a MLP predictor for predicting word pair relationships. Since the dual affine Predictor (Biaffine Predictor) can improve the performance of the MLP Predictor in relation classification, two predictors are used for relation classification of word pairs, and then the results output by the MLP Predictor and the dual affine Predictor are combined to be used as the final output result.

Wherein the word pair (x _i ,y _i ) The Biaffine classifier relationship score between was calculated as follows:

S _i ＝ MLP (h _i )

O _j ＝ MLP (h _j )

y _ij ’＝S _i ^T U O _j +W[S _i ；O _j ]+b

MLP Predictor is the characteristic result Q obtained for the convolutional layer _ij Calculating a relationship score between word pairs using an MLP

y _ij ”＝MLP(Q _ij )

Word pair relationship final probability score:

Y _ij ＝Softmax(y _ij ’+y _ij ”)

and finally merging the results of the Biaffine classifier and the MLP classifier to predict the entity boundary.

S2, screening the entities through characteristic engineering to obtain key entities. The method comprises the following specific steps:

s201, calculating the word frequency-inverse document (Tf-Idf) weight of each entity, w ₁ ＝TF(w)；

S202, calculating word frequency of each entity, giving corresponding weight, and w ₂ =0.1×n (n is the number of occurrences);

s203, training a word2vec model, and finding out entities of the uncombined group by model wv. Doesnt_match (), w ₃ (conjunctive word) =0.2, w ₃ (disaggregated word) =0.1;

s204, coding sentences and entities through word2vec model, calculating similarity, and w ₄ Cos (S, E), where S is a sentence vector and E is an entity vector;

s205, calculating a combined feature weight of each entity, w=w ₁ +w ₂ +w ₃ +w ₄ Ordering the entities; if the total entity number is less than 3, reserving all the entities as key entities; otherwise, the first 3 are taken as key entities.

S3, inputting the target news text into a preset pre-training language model for processing, and obtaining a text abstract; wherein the pre-trained language model is a BART model. BART is a pre-trained de-noised self-coding (denoising autoencoder) model of seq2seq, which is trained in the following way:

(1) Destroying the text by using any noise function (noise function);

(2) Learning a model to reconstruct the original text; the text main content is extracted by the text summarization task downstream of BART.

Specifically, in the training stage, the Encoder encodes the destroyed text by using a bidirectional model, and then the Decoder calculates the original input by adopting an autoregressive mode; in the fine tuning stage, the input of the Encoder and the Decoder are uncorrupted text, and a text abstract is obtained.

And S4, carrying out feature fusion on the text abstract and the key entity to obtain a deep semantic feature vector, inputting the deep semantic feature vector and the target news text into a preset text topic matching model, and obtaining a text topic matching result. The method comprises the following specific steps:

s401, inputting the target news text into the BERT model for processing, and obtaining a word vector of the target news text;

s402, an average pooling strategy is adopted for all word vectors in sentences obtained through BERT, average value calculation is carried out, and finally the average value vectors are used as sentence vectors of the whole sentence, so that feature vectors u and v of the target news text are obtained;

s403, carrying out phase subtraction operation on feature vectors of the target news text and taking absolute values to obtain differences |u-v| between the feature vectors;

s404, splicing the difference between the feature vectors to obtain text splicing vectors u, v, |u-v| of the target news text;

s405, inputting the key entity into a BERT pre-training model to obtain a word vector of the key entity;

s406, adopting an average pooling strategy to all word vectors of the key entities, carrying out average value calculation, and finally taking the average value vector as a key entity feature vector u of the statement to be matched ₁ 、v ₁ ；

S407, inputting the text abstract into a BERT pre-training model to obtain a word vector of the text abstract;

s408, similarly, all word vectors of the text abstract are subjected to average pooling strategy to perform average value calculation, and finally the average value vector is used as a text abstract feature vector u of a sentence to be matched ₂ 、v ₂ ；

S409, splicing the text abstract vector and the key entity vector of each sentence, extracting the characteristics through the gating structure of the LSTM to achieve the effect of characteristic fusion, and forming characteristic vectors u ', v' (u 'of the text deep semantic information to be matched with the deep semantic information characteristic vector of the first sentence, and v' of the text deep semantic information to be matched with the second sentence);

s410, performing phase subtraction operation on the deep semantic information feature vector and taking an absolute value to obtain a difference |u '-v' | of the key entities to be matched; the difference between the deep semantic information feature vector and the deep semantic information feature vector is spliced to obtain a deep semantic information spliced vector u ', v',iu '-v';

s411, splicing the text splicing vector of the target news text and the deep semantic information splicing vector again to obtain (u, v, |u-v|, u ', v',|u '-v' |);

s412, performing dimension reduction on the spliced vectors (u, v, |u-v|, u ', v',|u '-v' |) through the full-connection layer, and finally obtaining probability sizes (0 is not matched and 1 is matched) of 0 and 1 through the softmax layer, and selecting the probability with large probability as a matching result.

Example two

The embodiment discloses a text topic matching system based on deep semantic information, comprising:

It should be noted that, the key entity obtaining module, the text abstract obtaining module, and the text topic matching module correspond to the steps in the first embodiment, and the modules are the same as examples and application scenarios implemented by the corresponding steps, but are not limited to the disclosure in the first embodiment. It should be noted that the modules described above may be implemented as part of a system in a computer system, such as a set of computer-executable instructions.

Example III

The third embodiment of the invention provides an electronic device, which comprises a memory, a processor and a computer instruction stored on the memory and running on the processor, wherein the computer instruction completes the steps of the text topic matching method based on deep semantic information when being run by the processor.

Example IV

The fourth embodiment of the invention provides a computer readable storage medium for storing computer instructions, which when executed by a processor, complete the steps of the text topic matching method based on deep semantic information.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The foregoing embodiments are directed to various embodiments, and details of one embodiment may be found in the related description of another embodiment.

The above description is only of the preferred embodiments of the present invention and is not intended to limit the present invention, but various modifications and variations can be made to the present invention by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. The text topic matching method based on deep semantic information is characterized by comprising the following steps of:

screening the entities through feature engineering to obtain key entities;

2. The text topic matching method based on deep semantic information as claimed in claim 1, wherein the step of inputting the target news text into a preset named entity recognition model for processing, and the step of obtaining the entity of the target news text comprises the steps of:

3. The text topic matching method based on deep semantic information as claimed in claim 2, wherein the information matrix includes a distance information matrix for representing a distance between each word of a word pair, a word pair information matrix for representing a word pair output through a norm layer, and a region information matrix for representing a region where the word pair is located.

4. The text topic matching method based on deep semantic information as claimed in claim 1, wherein the filtering the entities through feature engineering to obtain key entities includes:

5. The deep semantic information based text topic matching method of claim 1, wherein the pre-trained language model is a BART model.

6. The method for matching text topics based on deep semantic information as set forth in claim 1, wherein the feature fusion of the text abstract and the key entity to obtain a feature vector of the deep semantic information, inputting the feature vector of the deep semantic information and the target news text into a preset text topic matching model, and obtaining a text topic matching result includes:

acquiring a text splicing vector according to the target news text;

7. The method for matching text topics based on deep semantic information as recited in claim 6, wherein the obtaining the deep semantic information stitching vector according to the deep semantic information feature vector comprises:

8. The text topic matching system based on deep semantic information is characterized by comprising the following components:

9. An electronic device comprising a memory and a processor and computer instructions stored on the memory and running on the processor, which when executed by the processor, perform the steps of any of claims 1-7.

10. A computer readable storage medium storing computer instructions which, when executed by a processor, perform the steps of any of claims 1-7.