CN114996455A

CN114996455A - News title short text classification method based on double knowledge maps

Info

Publication number: CN114996455A
Application number: CN202210643031.3A
Authority: CN
Inventors: 高楠; 王永健; 吴一鸣; 陈朋
Original assignee: Zhejiang University of Technology ZJUT
Current assignee: Zhejiang University of Technology ZJUT
Priority date: 2022-06-08
Filing date: 2022-06-08
Publication date: 2022-09-02

Abstract

A news headline short text classification method based on double knowledge maps comprises the following steps: preprocessing a news headline short text to remove special characters; extracting key words in news titles through a jieba word segmentation tool, and removing stop words; linking the keywords to an external knowledge base through an API (application programming interface) provided by a CNDBpedia external knowledge base to obtain an entity set; disambiguating the entity set through cosine similarity to obtain a candidate entity set; constructing a domain knowledge graph based on global keyword co-occurrence information, and solving the OOV problem; acquiring interpretation information related to the entity by linking to an external knowledge base, and enriching context semantic information; obtaining character-level vector representation of the explanation information of the original news title and the entity link by using BERT, and fusing the vector representation of the two parts to make up for the defect of insufficient short text information; extracting N-grams characteristics among a plurality of continuous words by using TextCNN, and capturing deep semantic information; and finally, classifying through a Softmax function to obtain a final classification result.

Description

News title short text classification method based on double knowledge maps

Technical Field

The invention relates to a news headline short text classification method based on a double-knowledge-map, in particular to classification aiming at domain news headlines. The invention utilizes a Chinese text word segmentation tool to extract a plurality of key words in each news title, and links the key words to entities in an external knowledge base through an entity linking technology. If the link fails, a suitable node is queried from the domain knowledge graph to replace the original keyword. Otherwise, the entity is added into the candidate entity set. By re-linking the candidate entity to an external knowledge base, the interpretation information associated with the entity can be obtained. The method utilizes BERT to obtain vector representation of short texts and explanatory information of news headlines, and finally classifies the short texts and the explanatory information through softmax. The invention relates to the fields of probability models, voice models, deep learning and the like, in particular to the field of natural language processing based on deep learning.

Background

With the popularization of news network platforms, the digital development of the news industry is very rapid, mass data is continuously generated, and the variety of news is more and more. Compared with paragraphs and documents, news headlines have fewer words and lack contextual semantic information, and are sparse and ambiguous. The news headlines are correctly classified, so that the information can be better organized and utilized, and therefore, the method has very important significance on accurately dividing massive news headline data into correct categories.

In the face of news data information with huge and growing scale, the efficiency of classifying news headline data by simply relying on manpower is not high and the cost is huge. In recent years, with the rapid development of machine learning and deep learning, more and more problems can be accomplished by computers, which is also a popular technique in the big data era. Therefore, it is a current trend to solve the tedious task of classifying news headlines by using deep learning. In recent years, many methods for classifying short texts of news headlines have been proposed, which are roughly classified into two types:

1) the classification method based on machine learning comprises the following steps: the method mainly comprises the steps of preprocessing news headlines, extracting features, vectorizing the processed texts, and modeling training data through a common machine learning algorithm, such as a space vector model, a decision tree model and a support vector machine model.

2) The deep learning-based classification method comprises the following steps: the method mainly comprises the steps of vectorizing each character in a news title, and then acquiring local information or sequence information of a text deep level by adopting a Convolutional Neural Network (CNN) or a Recurrent Neural Network (RNN). In recent years, more and more methods have been used to enhance the semantic information of news headline short text by using external knowledge bases to obtain relevant concepts of the short text.

The classification method based on machine learning is simple, but the quality of text feature extraction has great influence on classification accuracy, and the methods are heavily dependent on manually designed features and have high cost. In addition, the feature representations are often very sparse, the feature expression capability is weak, and the requirement of short text classification of news headlines cannot be well met.

The deep learning-based classification method can avoid complicated manual feature engineering, and has more influence on classification precision on data volume and training iteration times. By the word vector technology, text data can be converted into low-dimensional dense vectors, manual design features are not relied on, and deep semantic information can be learned.

However, the use of deep learning techniques to implement news headline classification also has several problems:

(1) most news titles are short text types, the number of words is small, context semantic information is lacked, and sparseness and ambiguity are achieved, so that classification cannot be well achieved by the mainstream natural language processing method at present.

(2) At present, semantic information of short texts is enhanced by external knowledge bases in many methods, and entities in multiple fields can be obtained when the entities are obtained through keywords. It is a challenge how to disambiguate an entity to obtain a correct and reasonable entity.

(3) Since the Chinese word segmentation limit is not as well defined as English, so that the accuracy of the segmentation is affected by the segmentation tool, an out-of-probability (OOV) problem is an inevitable problem when entity linking is performed, and how to avoid the OOV problem is a challenge for short text classification of news headlines.

In the field of natural language processing, in order to capture more semantic and syntactic information from short text, the mainstream method at present is to enrich the semantic information of the short text by using an external knowledge base. First, short text is tokenized using common tokenization tools. Then, the key words are linked to the external knowledge base through an entity linking technology to obtain related entities, and concepts related to the entities are obtained from the external knowledge base. And finally, splicing the original text information and the concept information to enhance the semantic information of the short text.

The method makes effective progress in the field of short text classification, but ignores that irregular keywords may appear in news headlines during word segmentation, and the keywords can cause OOV problems. OOV problems refer to the inability of keywords to successfully link to external knowledge bases when entities are linked, which can lead to a degradation in classification performance. Therefore, the invention constructs a map facing a specific field based on the global keyword information. When an OOV problem arises, appropriate keywords can be retrieved from the domain knowledge graph to replace the OOV word.

Therefore, how to solve the problems of insufficient short text information of news headlines and OOV is a problem to be solved urgently in the current big data age.

Disclosure of Invention

The invention provides a news headline short text classification method based on double knowledge maps, which aims to solve the problems that short text information is insufficient and OOV (object oriented language) is generated during entity linking in the prior art.

By means of the external knowledge base, the invention can extract additional interpretation information to enrich the semantics of the short text. When an entity link fails, the domain knowledge base can be used to reselect appropriate keywords to correct the results of the entity link and solve the OOV problem. And finally, combining the character-level features of the short text with the external knowledge features to extract semantic information to obtain a final classification result. Compared with the existing method, the method can achieve more advanced performance.

In order to solve the problems, the technical scheme provided by the invention is as follows:

a news headline short text classification method based on double knowledge maps comprises the following steps:

step 1: the short text of the news headline is preprocessed, and some special characters and stop words are mainly removed.

Step 2: and extracting key words in the news headlines by a jieba word segmentation tool.

And 3, step 3: and linking the keywords to the external knowledge base through an API provided by the CNDBpedia external knowledge base to obtain the entity set.

And 4, step 4: and disambiguating the obtained entity set through cosine similarity to obtain a candidate entity set.

And 5: and constructing a field knowledge graph based on the global keyword co-occurrence information to solve the OOV problem.

Step 6: and for each entity in the candidate entity set, acquiring explanation information related to the entity by linking to an external knowledge base, and enriching the context semantic information.

And 7: and acquiring character-level vector representation of the explanation information of the original news title and the entity link by using BERT, and fusing the vector representation of the original news title and the entity link to make up for the defect of insufficient short text information.

And 8: the TextCNN is used to extract N-grams features between multiple consecutive words, capturing deep semantic information.

And step 9: and finally, classifying through a Softmax function to obtain a final classification result.

The invention enhances the semantic information of the short text by extracting the key words in the short text of the news headline, searching the entity related to the key words from the external knowledge base and combining the entity link technology to obtain the explanation of the entity. Moreover, the method also constructs a domain knowledge graph based on the local data set, and is used for solving the OOV problem occurring during entity linking and enhancing the domain information of the short news text. And finally, capturing the deep semantic features of the short texts of the news headlines by adopting a TextCNN model, and finishing corresponding classification. Compared with the prior method, the method has certain improvement on accuracy and efficiency.

The invention has the advantages that:

1. the invention utilizes the double knowledge maps to obtain more additional information, makes up for the defect of insufficient short text information and improves the accuracy of short text classification of news headlines.

2. The invention constructs the domain knowledge graph to solve the OOV problem and provides a new idea for solving the OOV problem in natural language processing.

Drawings

FIG. 1 is a schematic overall flow diagram of the present invention.

FIG. 2 is a core architecture diagram of the TextCNN of the present invention.

Detailed Description

The technical scheme of the invention is further explained by combining the attached drawings.

The embodiment shows the bougnews public data set of the news RSS subscription channel as an example:

step 1: the short text of the news headline is preprocessed, and some special characters, such as Chinese and English punctuations, English characters, numbers and special symbols, are removed. In addition, stop words are removed from the Hadamard stop word list. Part of the news headline preprocessing results are shown in table one.

Table news headline preprocessing results example

Original news	Preprocessed news
		First money TD single-chip removes in commercial and beats cost tablet again	First money single-chip is removed in commercial and is made cost tablet again
Guge buys IBM thousands of patents to deal with apple litigation	Guge buys thousands of patents to deal with apple litigation
		All-metal body motorola publishing novel machine Klassic	All-metal frame motorcycle roller releasing machine
Beijing Dong: eligible iPhone user spread compensation	Kyoto eligible user spread compensation
		Ultra-light ultra-thin Sanyo New-style linear PCM recording pen	Ultra-light ultra-thin Sanyo New-style linear recording pen

Step 2: and extracting key words from the preprocessed short text of the news headline through a jieba word segmentation tool. Such as: for short text S ₁ : the method is characterized in that the Google buys thousands of ICM patents to deal with apple litigation, and a keyword set { "patents", "apples" } can be obtained.

And step 3: the invention uses CNDBpedia external knowledge base developed by the knowledge workshop laboratory of the university of double denier to obtain the entity related to the keyword according to the provided ment2ent entity link API.

And 4, step 4: for the set of entities obtained in the previous step, the BerT is used to obtain a vector representation of the set of entities and the news headlines. Entity E is then calculated using the cosine similarity _i With news headlines S _i The degree of similarity of (a) to (b),and selecting the entity with the highest similarity score to be added into the candidate entity set. For the keyword "apple", entity set E { "apple company", "apple (movie work)", "apple (fruit)" } can be obtained, and after the cosine similarity calculation, the following result E { "apple company" is obtained: 90.26, "apple (movie work)": 87.74, "apple (fruit)": 87.88, so the "apple firm" with the highest score is added to the set of candidate entities here.

And 5: because short text itself may have irregular expressions, not every keyword may be successfully linked to an entity in the external knowledge base, where OOV problems may arise.

For Chinese entity chaining, there are two main reasons for the OOV problem: (1) the entity is not covered by an external repository. (2) The word segmentation result of the short text is incorrect.

In order to solve the OOV problem, the method takes keywords as nodes to construct a domain knowledge graph. In particular, the present invention uses a fixed-size sliding window to collect keyword co-occurrence information. The weights between two keyword nodes are calculated using point-by-Point Mutual Information (PMI). The higher the probability that two keywords appear simultaneously in the text, the stronger the correlation between the two keywords. When the value of PMI is less than 0, the relationship between the two keywords is considered weakly correlated. Only if the PMI value is greater than 0 will an edge be created between the two keys. The PMI calculation process is as follows:

here, # W (i) represents the number of sliding windows in the corpus that contain the keyword i, # W (i, j) represents the number of sliding windows that contain both the keyword i and the keyword j, and # W represents the total number of sliding windows.

With the help of the domain knowledge graph, when an entity link has an OOV problem, the neighborhood information of the keyword can be inquired from the domain knowledge graph. And sorting according to the weight value calculated by the PMI, and taking out the neighbor of the top three. And firstly, replacing the original keyword with the neighbor with the lowest rank, re-linking the entity from the external knowledge base, and if the OOV problem still occurs, sequentially taking out the next ranked neighbor for re-linking until success or the traversal is finished.

In the news headline short text S ₂ In the judgment result of the infringement case of the all-side data approval paper, the keyword 'infringement case' has an OOV problem when operated at ment2 ent. At the moment, the 'litigation case' is used as a node, neighbor information in the domain knowledge graph is inquired to obtain a node set { 'litigation request', 'copyright method', 'complaining valley' }, the 'litigation request' is used for replacing the 'litigation case' key word, the key word is re-linked to an external knowledge base, no OOV problem occurs, and therefore the 'litigation request' is added into the candidate entity set.

Step 6: the semantic enhancement can make up for the defect of short text information deficiency. And for the candidate entity set obtained in the last step, sequentially linking the candidate entity set to an external knowledge base to obtain the interpretation information related to the entities and enrich the semantic information of the short text.

For a candidate entity "litigation request", its interpretation information K is available ₁ The concept of litigation request has broad and narrow meaning in foreign civil litigation. In a broad sense, litigation requests are requests that are submitted to a court to require the court to make a decision "};

for the candidate entity "apple Corp", its interpretation information K can be obtained ₂ Is { "apple company is Mei {National high-tech company "}.

And 7: for short Chinese texts, the words are not uniformly distributed, so that a fine-tuned pre-training model BERT is used to obtain the character-level semantic information. There are two reasons to use character-level embedding instead of word embedding: (1) the news headline is short in length, and word embedding has the problem of data sparseness. (2) The TextCNN is facilitated to extract N-grams information between a plurality of consecutive words.

Suppose that the length of the news headline short text S is n, the length of the interpretation information K is l, and the vector dimension is d. If the length of the news headline or the explanatory information is not long enough, the news headline or the explanatory information is used<PAD>To fill in the sentence, whereas the superfluous parts are truncated. Thus, we can obtain the short text semantic matrix W _s And interpreting the information semantic matrix W _k 。

A d-dimensional vector representation representing the ith word in the news headline short text S,

representing a vector stitching operation. Thus, the semantically enhanced feature representation matrix

And 8: although CNN is not suitable for learning long-distance semantic information, it can better learn local information of news headline short text. Therefore, the invention adopts the TextCNN to capture deep semantic information, which mainly comprises a convolution layer, a pooling layer and a full-link layer.

Convolutional layer of

The convolution kernel with the size acts on the semantic matrix with the length of n + l words, and deep semantic features can be obtained.

c _i ＝f(w·x _i:i+h-1 +b) (7)

Here, the

Representing a bias term, f is a nonlinear activation function, and finally a new feature matrix c can be obtained.

c＝[c ₁ ,c ₂ ,…,c _n-h+1 ] (8)

The pooling layer is used to capture the most important eigenvalues and randomly initializes the eigenvalues to 0 using dropout. This is a regularization approach to avoid model overfitting. And then splicing the feature matrixes obtained by convolution kernels with different sizes into one block, and inputting the full connection layer for classification.

And step 9: and outputting the probability value of each category by using a softmax activation function to obtain a final classification result.

The invention mainly solves the problems of insufficient short text information and OOV in natural language processing. A dual knowledge base model based on an external knowledge base and a domain knowledge base is provided. The model acquires semantic enhancement information of the short text by using a CNDBPedia external knowledge base. When an entity link fails, a substitution is made by finding the appropriate keyword in the domain knowledge graph. TextCNN is then used to capture features between multiple consecutive words. Finally, the full connectivity layer is used for classification.

The invention has been described by way of the above examples, but it is obvious that the examples are given for illustrative purposes only and are not intended to limit the invention to the scope of the examples. Workers skilled in the art will recognize that changes may be made in form and detail without departing from the spirit and scope of the invention, and therefore the invention is not limited to the specific forms and details described above.

Claims

1. A news headline short text classification method based on double knowledge maps comprises the following steps:

step 1: preprocessing a news headline short text to remove special characters; removing stop words according to the work-great stop word list;

step 2: extracting key words in news titles through a jieba word segmentation tool;

and step 3: linking the keywords to an external knowledge base through an API (application programming interface) provided by a CNDBpedia external knowledge base to obtain an entity set;

and 4, step 4: disambiguating the obtained entity set through cosine similarity to obtain a candidate entity set; for the entity set obtained in the step 3, using BERT to obtain the vector representation of the entity set and the news headline; entity E is then calculated using the cosine similarity _i With news headlines S _i Selecting an entity with the highest similarity score and adding the entity into the candidate entity set;

and 5: constructing a domain knowledge graph based on global keyword co-occurrence information to solve the OOV problem;

5.1) constructing a domain knowledge graph by taking the keywords as nodes; in particular, a fixed-size sliding window is used to collect keyword co-occurrence information; calculating a weight between two keyword nodes using a point-by-point mutual information PMI; the higher the probability that two keywords appear simultaneously in the text is, the stronger the correlation between the two keywords is; when the value of PMI is less than 0, the relationship between these two keywords is considered weakly correlated; only if the PMI value is greater than 0, an edge is created between the two keywords; the PMI calculation process is as follows:

here, # W (i) represents the number of sliding windows in the corpus that contain the keyword i, # W (i, j) represents the number of sliding windows that contain both the keyword i and the keyword j, and # W represents the total number of sliding windows;

5.2) when the entity link has OOV problem, inquiring the neighbor node of the keyword from the domain knowledge graph; sequencing the neighbor nodes according to the weight values calculated by the PMI, and taking out the neighbor nodes ranked in the first three; firstly, replacing the original keyword by using the neighbor node with the lowest rank, re-linking the entity from the external knowledge base, and if the OOV problem still occurs, sequentially taking out the neighbor with the next rank for re-linking until success or traversal end;

step 6: for each entity in the candidate entity set, acquiring explanation information related to the entity by linking to a CNDBpedia external knowledge base, and enriching context semantic information;

and 7: obtaining character-level vector representation of the explanation information of the original news title and the entity link by using BERT, and fusing the vector representation of the two parts to make up for the defect of insufficient short text information;

acquiring character-level semantic information by adopting a fine-tuned pre-training model BERT; using character level embedding instead of word embedding;

supposing that the length of a news headline short text S is n, the length of explanatory information K is l, and the vector dimension is d; if the length of the news headline or the explanatory information is not long enough, the news headline or the explanatory information is used<PAD>To fill in sentences, otherwise to truncate superfluous parts; thus, we can obtain the short text semantic matrix W _s And interpreting the information semantic matrix W _k ；

representing a vector splicing operation; thus, the semantically enhanced feature representation matrix

And 8: extracting N-grams characteristics among a plurality of continuous words by adopting TextCNN, and capturing deep semantic information;

convolutional layer adoption

The large convolution kernel acts on the semantic matrix with the length of n + l words to obtain deep semantic features;

c _i ＝f(w·x _i:i+h-1 +b) (7)

here, the

Representing a bias term, f is a nonlinear activation function, and finally a new feature matrix c can be obtained;

c＝[c ₁ ,c ₂ ,…,c _n-h+1 ] (8)

the pooling layer is used for capturing the most important characteristic value, and a dropout random initialization characteristic value is 0; dropout is a regularization means to avoid model overfitting; splicing feature matrixes obtained by convolution kernels with different sizes into one block, and inputting a full connection layer for classification;

and step 9: and outputting the probability value of each category through the softmax activation function to obtain a final classification result.

2. The dual-knowledge-graph-based news headline short text classification method of claim 1, wherein: the special characters in the step 1 comprise Chinese and English punctuation marks, English characters, numbers and special symbols.

3. The dual-knowledge-graph-based news headline short text classification method of claim 1, wherein: the TextCNN in step 8 comprises a convolution layer, a pooling layer and a full-link layer.