CN112685549B

CN112685549B - Document-related news element entity identification method and system integrating discourse semantics

Info

Publication number: CN112685549B
Application number: CN202110023176.9A
Authority: CN
Inventors: 线岩团; 王佳雯; 王剑; 余正涛; 郭军军; 相艳
Original assignee: Kunming University of Science and Technology
Current assignee: Kunming University of Science and Technology
Priority date: 2021-01-08
Filing date: 2021-01-08
Publication date: 2022-07-29
Anticipated expiration: 2041-01-08
Also published as: CN112685549A

Abstract

The invention relates to a method and a system for identifying an entity of a case-related news element blended with discourse semantics, belonging to the technical field of natural language processing. The method comprises the steps of crawling corpora from a Chinese news network large case key case module, deleting the first segment of the obtained news text, obtaining a news central sentence, constructing a database of the news central sentence and a news positive sentence corresponding to the news central sentence, learning semantic expression of chapters from the news central sentence by adopting a multi-head attention mechanism, and fusing the semantic expression with the news positive sentence; and obtaining the context information after the chapter semantics are blended through the Bi-LSTM, and finally identifying the element entities in the sentences by adopting a conditional random field. The invention provides a method for identifying the entity of the affair-related news elements blended with the chapter semantics aiming at the characteristics of ubiquitous component reference and component omission in the text sentences of the affair-related news, and effectively solves the problem of context and semantic deficiency. Provides powerful support for the follow-up work of analyzing the news and public sentiments involved in the case.

Description

Method and system for identifying entity of affair-related news element integrated with chapter semantics

Technical Field

The invention relates to a method and a system for identifying an entity of a case-related news element blended with discourse semantics, belonging to the technical field of natural language processing.

Background

In the field of public opinion analysis of news related events, high-quality news related element entities are the basis, premise and pillar of follow-up work, and can be widely applied to multiple aspects, such as: extracting the relation of the factors of the news involved in the case, building a knowledge map of the news involved in the case, tracking sensitive words of the news involved in the case and the like. Because the situation of component designation and component omission generally exists in the text sentence of the case-related news, semantic deletion becomes the key point and difficulty of identification of the element entity of the case-related news, which directly affects the accuracy of identification of the element entity, as shown in fig. 3. Therefore, the method and the system for identifying the entity of the affair-related news element under the condition of lacking the semantics are researched, and the method and the system for identifying the entity of the affair-related news element blended with the semantics of the chapters are realized.

Disclosure of Invention

The invention provides a method and a system for identifying an entity of a document-related news element blended with discourse semantics, which are used for relieving the problem of semantic deficiency and learning multi-level and multi-angle semantic understanding, thereby improving the identification effect.

The technical scheme of the invention is as follows: in a first aspect, the invention provides a method for identifying an entity of a news element involved in a case and incorporating into a chapter semantic, the method comprising the following specific steps:

Step1, firstly, deleting the first segment of the case-related news to obtain a news central sentence, then carrying out character-dividing marking on the obtained news main sentence and the news central sentence, and finally constructing a dictionary in one-to-one correspondence with the marked news main sentence and the news central sentence;

step2, converting the news center sentence and the news text sentence into character vectors by using a Skip-gram model;

step3, constructing a recognition model of the affair-related news element entity fused with the space semantic, and realizing the function of effectively extracting the affair-related news element entity.

As a further scheme of the present invention, the Step1 specifically comprises the following steps:

step1.1, crawling case-related news corpora from a Chinese news network major case key case module by using a web crawler program;

step1.2, filtering and denoising the crawled related news corpus to construct related news text-level corpus; storing the case-related news text-level linguistic data into a database;

step1.3, taking out the corpus of the file-related news text level from a Step1.2 database, forming the corpus of the file-related news text level through sentence segmentation, manually deleting the first segment of the file-related news text to obtain a news central sentence, corresponding the news central sentence with the news text sentence one by one, segmenting the file-related news central sentence and the news text sentence to form a corpus of the file-related news text level, and storing the corpus of the file-related news text level corpus into the database;

Step1.4, extracting the sentence-level corpus of the news related to the case from the Step1.3 database, manually labeling the category of the sentence-level corpus of the news related to the case according to the BIEOS label, and classifying the entity category of the key elements of the news related to the case to form the news related to the labeled corpus containing the central sentence of the news.

As a further scheme of the present invention, the Step2 specifically comprises the following steps:

step2.1, firstly, converting the news corpus related to the case into character vectors by using a Skip-gram model to form a character vector table, and converting each word in the news main sentence and the news central sentence into a character vector sequence by searching the character vector table.

As a further aspect of the present invention, the Step3 specifically includes:

step3.1, the entity recognition model of the case-related news elements integrated with the space chapter semantics has two inputs respectively, one is a news text sentence, and the other is a news central sentence; learning chapter semantic representation by using Multi-Head Attention and integrating a news central sentence into a news text sentence from different dimensions to obtain Multi-level semantic features integrated with chapter semantics;

step3.2, after obtaining the multilevel semantic features of the integrated discourse semantics, adopting Bi-LSTM to extract the context semantic features of the integrated discourse semantics;

Step3.3, adopting a conditional random field to perform constrained decoding on the Bi-LSTM output integrated with the semantic features of the chapters, identifying element entities in sentences, and constructing a case-related news element entity identification model integrated with the semantic features of the chapters.

In a second aspect, an embodiment of the present invention further provides a system for identifying an entity of a news element involved in a document merged with a chapter semantic, where the system includes modules for performing the method of the first aspect.

The beneficial effects of the invention are:

the invention provides a Multi-Head attachment-Bi-LSTM-CRF method integrated with chapter semantics for a case-related news element entity identification task. Aiming at the situations of component designation and component omission of a news text sentence related to case news, the method provides the problem that a news center sentence containing text semantics is merged into the news text sentence as text semantics, so that the context semantics is lost. The model learns the semantic representation of the chapters from the smell center sentence, fuses the semantic representation of the chapters with the news text sentence, acquires the context information fused with the semantic of the chapters by adopting Bi-LSTM, and identifies the element entities in the sentences by adopting conditional random fields. Therefore, the model can learn the semantic understanding of multiple layers and angles, the recognition effect is improved, and powerful support is provided for the subsequent news-related public opinion analysis work.

Drawings

FIG. 1 is a general flow diagram of the present invention;

FIG. 2 is a general model architecture diagram of the present invention;

FIG. 3 is a sample illustration;

fig. 4 is a comparison of experimental results for each class.

Detailed Description

Example 1: as shown in fig. 1-4, the method for identifying an entity of a news element involved in a case and incorporating into a chapter semantic includes the following steps:

step2, converting the news central sentence and the news text sentence into character vectors by using a Skip-gram model;

step1.4, extracting the sentence level linguistic data of the news related to the case from the Step1.3 database, manually marking the category of the sentence level linguistic data of the news related to the case according to the BIEOS label, and dividing the entity category of the news related to the case into 6 categories: victims, criminal suspects, places of crime, police investigating the case, courts reviewing the case, and other non-essential entity classes. Forming case-related news marking corpora containing the news central sentences.

As a further aspect of the present invention, the Step3 specifically includes:

The calculation of merging the news center sentence into the news text sentence can be divided into the following three parts:

(1) firstly, taking a news main sentence as key-value and a news central sentence as query, and projecting the news central sentence and the news main sentence into four different expression sub-spaces through linear change.

(2) And then, zooming the attention of the dot product of the news central sentence and the news central sentence in the presentation subspace, namely performing dot product calculation on the news central sentence query and the news central sentence key in the presentation subspace to obtain a mapping score from the news central sentence to the news central sentence, and compressing the score to be between 0 and 1 through soft max. And multiplying the mapping score by the news text sentence value, and fusing the news center sentence related to the case news into the news text sentence to obtain the news text sentence characteristic fused with the text semantic characteristic in the expression subspace.

(3) And finally, splicing the feature results obtained from the 4 different expression subspaces together to obtain a multilevel semantic feature E fusing chapter semantics.

After the multilevel semantic feature E fused with the discourse semantics is obtained, the Bi-LSTM is adopted to extract the context semantic feature fused with the discourse semantics. And cascading the forward and reverse LSTM hidden states to obtain multilevel and more comprehensive semantic features.

And (3) performing constrained decoding on the Bi-LSTM output integrated with the chapter semantic features by using a conditional random field, and constructing an entity recognition model of the case-related news elements integrated with the chapter semantic features.

The experiment of the invention adopts a TensorFlow 1.13.2 framework, and the sentence length settings of the central sentence and the news text sentence are consistent and are 120 characters. In the training process, an Adam optimization algorithm is used, and the learning rate is 0.004; the dimension of the word vector is 128; the neuron number of the single layer of LSTM is 128; the number of iterations and batches were 31 and 16, respectively.

Using the accuracy P, recall R, and F1 values as evaluation indexes of the element entity recognition result, the calculation of 3 evaluation indexes is as follows:

where CE is the number of correctly identified element entities, IE is the number of identified element entities, and SE is the number of sample element entities.

Different from the common named entities, the element entity categories are: criminal suspects, victims, places of record, police investigating the case and court reviewing the case. The 5-type element entity linguistic data are obtained by crawling large case module data in a Chinese news network, the whole linguistic data comprise 97 cases and 2000 pieces of data, a training set, a verification set and a test set are divided according to the ratio of 7:1:2, and sentences in the linguistic data and various element entities are distributed as shown in table 1.

TABLE 1 case-related News corpus statistics

The case news central sentence constituting the external knowledge is obtained by reducing the first segment of the news chapters. After the original corpus is obtained, firstly, a dictionary corresponding to the original news sentences and the sentences in the news one by one is built, then, the sentences are subjected to word segmentation, and finally, the corpus after word segmentation is labeled. The labeling samples are shown in table 2.

TABLE 2 sample of news corpus annotation relating to case

In the experiment, a BIEOS label is adopted to label each word of a news central sentence and a news text, B _ prefix is a starting label of an element entity, I _ prefix is a middle word label of the element entity, E _ prefix is an ending label of the element entity, S represents a single element entity, O is a non-element entity class, and a labeling sample is shown in a table 2. And finally, pre-training the labeled corpus by using a Skip-gram model to generate a word vector, wherein the vector dimension is 128. After pre-training, each word is numbered to generate a unique id, and the corresponding word and id are stored in an id2word and word2id dictionary, an id2word dictionary.

The experiment is mainly divided into the following three parts: the device comprises a comparison experiment part, an ablation experiment part and an output test part.

The models used in the comparative experimental section are the reference models used herein are:

Bi-LSTM-CRF: and acquiring the context information of the news text by a Bi-LSTM network, and acquiring the label information of the news text by a CRF layer.

Bi-LSTM-Self-orientation-CRF: a Self-extension mechanism is adopted, so that each word in the sentence has global semantic information; the context semantics of a news text sentence are acquired by Bi-LSTM, then global semantics are acquired by Self-orientation, and finally CRF is used for decoding.

Multi-Head attachment-Bi-LSTM-CRF: the Multi-Head Attention model allows the model to understand the sequence of inputs from different angles. In the experiment, 4 multi-head words are adopted to obtain multi-angle semantic information from a new positive-smelling sentence, then Bi-LSTM is adopted to obtain global semantics, and finally CRF is used for decoding. The results of the comparative experiments are shown in table 3.

TABLE 3 comparison of extraction methods of news key elements

As can be seen from the experimental results, the three indexes of the Bi-LSTM-CRF model are the lowest; all indexes of the experimental results of the Bi-LSTM-Self-orientation-CRF model are higher than those of the Bi-LSTM-CRF model; p, R, F1 values are respectively improved by 4%, 14% and 10% compared with the Bi-LSTM-CRF model; compared with the Bi-LSTM-Self-orientation-CRF model, the experimental result of the Multi-Head orientation-Bi-LSTM-CRF model is greatly improved; the invention adopts the Multi-Head attachment to integrate into the knowledge of the chapter semantics by combining the characteristic that the Multi-Head attachment can obtain the Multi-dimensional important semantic features of the sentence from different semantic spaces, leads the chapter semantics to supplement the missing semantics in the sentence in different semantic spaces, and then adopts Bi-LSTM to capture the global semantic information integrated with the chapter semantics, thereby realizing the comprehensive improvement of the model, and compared with the Multi-Head attachment-Bi-LSTM-CRF model which does not integrate the chapter semantics, the three index values are respectively improved by 1 percent, 4 percent and 3 percent.

The above results show that the Multi-Head attachment-Bi-LSTM-CRF model integrated with chapter semantics provided by the invention can supplement the semantic information missing in the sentence by integrating chapter semantics, thereby improving the performance of element entity identification.

In case of news corpus, there are 5 case element categories. The experimental results of the various classes in the different models are shown in fig. 4.

As can be seen from fig. 4, the best class of the 4 model recognition results is "criminal suspect", and the worst class is "place of case". On the recognition result of the criminal suspect, the recognition effect of the Multi-Head orientation-Bi-LSTM-CRF model is best 84%, and the recognition effect of the Multi-Head orientation-Bi-LSTM-CRF model integrated with the chapter semantics is 83% for a little time; the optimal effect of the Multi-Head orientation-Bi-LSTM-CRF model which integrates discourse semantics into the identification result of the victim is 78 percent, and the Multi-Head orientation-Bi-LSTM-CRF model is 77 percent for a little time; on the recognition result of the 'case-and-place', the optimal Multi-Head Attention-Bi-LSTM-CRF model integrated with the discourse semantics is 42%, and the recognition result of the Bi-LSTM-CRF model is 38% for a little time. The maximum F value of the Multi-Head Attention-Bi-LSTM-CRF model which integrates discourse semantics on the recognition result of the investigation police is 58 percent. The Multi-Head orientation-Bi-LSTM-CRF model which integrates discourse semantics on the recognition result of the 'court of management' has the best effect of 79%.

In summary, the Multi-Head attachment-Bi-LSTM-CRF model merged into the discourse semantics does not sufficiently show superiority in the recognition effect of the criminal suspect and the victim compared with other models, but is far superior to other three models in the recognition effects of the three categories of case ground, case-finding police and the audition court.

In summary, the recognition effect of the Multi-Head attachment-Bi-LSTM-CRF model integrated with the space and chapter semantics is the best.

In order to further verify the effectiveness of the Bi-LSTM-CRF model adopting Multi-Head Attention fused into chapter semantics, all parts are deleted and compared respectively, and whether the extraction of the element entities by each part is effective or not is analyzed.

Table 4 ablation experimental results

As can be seen from Table 4, the incorporation of chapter semantics has a practical role in the task of element entity recognition. The R value and the F1 value of the Multi-Head Attention-CRF model are low, the P value is reduced by 5% after the discourse semantics are integrated, and the R value and the F1 value are respectively improved by 3% and 2%. The reason is that the Multi-Head attachment is adopted to integrate chapter semantics into a news text sentence from different dimensions, so that the problem of semantic missing is relieved, but semantic information carried by the news text is ignored, and the P value is lower than that of the Multi-Head attachment-CRF model. In the Multi-Head attachment-Bi-LSTM-CRF model integrated with discourse semantics, the accuracy, the recall rate and the F1 value are respectively improved by 7 percent, 29 percent and 23 percent after the Bi-LSTM is added. Therefore, after the Bi-Head orientation is added, the model acquires the chapter semantics through the Multi-Head orientation, and the Bi-LSTM acquires the context semantic information integrated with the chapter semantics, thereby really realizing the Multi-level and Multi-angle semantic understanding.

The following is an embodiment of the system of the present invention, and the embodiment of the present invention further provides a system for identifying an entity of a news element involved in a case, which incorporates into a chapter semantic, and the system includes a module for executing the method of the first aspect.

The dictionary construction module: the method is used for deleting the first segment of the case-related news to obtain a news central sentence, carrying out character-dividing marking on the obtained news main sentence and the news central sentence, and finally constructing a dictionary in one-to-one correspondence with the marked news main sentence and the news central sentence;

a character vector conversion module: the system comprises a browser and a browser, wherein the browser is used for converting a news central sentence and a news text sentence into character vectors by using a Skip-gram model;

constructing an extraction model and an extraction entity module: the method is used for constructing a recognition model of the affair-involved news element entities fused with discourse semantics and realizing the function of effectively extracting the affair-involved news element entities.

In a possible implementation manner, the dictionary construction module is specifically configured to:

crawling case-related news corpora from a Chinese news network major case key case module by using a web crawler program;

filtering and denoising the crawled involved news corpus to construct a text-level corpus of the involved news; storing the case-related news text-level linguistic data into a database;

Taking out the text-level corpus of the case-related news from a database, forming sentence-level corpus of the case-related news text through sentence segmentation processing, manually deleting the first segment of the case-related news text to obtain a news central sentence, corresponding the news central sentence to the news text sentence one by one, separating the case-related news central sentence and the news text sentence to form a sentence-level corpus containing the case-related news text, and storing the corpus of the case-related news text sentence-level corpus into the database;

and (4) taking out the sentence-level linguistic data of the news related to the case from the database, manually carrying out category marking on the sentence-level linguistic data of the news related to the case according to the BIEOS label, and classifying the entity categories of the elements of the news related to the case to form the news related to the case marked linguistic data containing the central sentence of the news.

In a possible implementation manner, the character vector conversion module is specifically configured to:

firstly, converting the case-related news corpus into character vectors by using a Skip-gram model to form a character vector table, and converting each word in a news main sentence and a news central sentence into a character vector sequence by searching the character vector table.

In a possible implementation, the extraction model and extraction entity building module is specifically configured to:

Because the entity recognition model of the case-related news elements integrated with the space and chapter semantics has two inputs respectively, one is a news text sentence, and the other is a news central sentence; learning chapter semantic representation by using Multi-Head Attention and integrating a news central sentence into a news text sentence from different dimensions to obtain Multi-level semantic features integrated with chapter semantics;

after obtaining the multilevel semantic features blended with the discourse semantics, extracting the context semantic features blended with the discourse semantics by adopting Bi-LSTM;

and performing constrained decoding on the Bi-LSTM output integrated with the chapter semantic features by adopting a conditional random field, identifying element entities in sentences, and constructing an incident news element entity identification model integrated with chapter semantics.

While the present invention has been described in detail with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, and various changes can be made without departing from the spirit of the present invention within the knowledge of those skilled in the art.

Claims

1. A method for identifying an entity of a news element involved in a case and integrated with discourse semantics is characterized by comprising the following steps: the method comprises the following specific steps:

step3, constructing a recognition model of the affair-related news element entity fused with the chapter semantics, and realizing the function of effectively extracting the affair-related news element entity;

the specific steps of Step3 include:

2. The method for identifying entities of news related elements merged with discourse semantics according to claim 1, wherein: the specific steps of Step1 are as follows:

step1.3, taking out the text-level corpus of the case-related news from a Step1.2 database, forming sentence-level corpus of the case-related news text through sentence segmentation processing, manually deleting the first segment of the case-related news text to obtain a news central sentence, segmenting the news central sentence and the news main sentence to form a sentence-level corpus containing the case-related news text, and storing the corpus of the sub-level corpus of the case-related news positive sentence into the database;

step1.4, extracting the sentence-level corpus of the news related to the case from the Step1.3 database, manually marking the category of the sentence-level corpus of the news related to the case according to the BIEOS label, classifying the entity category of the news related to the case element to form the news related to the marked corpus containing the central news sentence, and enabling the central news sentence to correspond to the main news sentence one by one.

3. The method for identifying entities of news related elements merged with discourse semantics according to claim 1, wherein: the specific steps of Step2 are as follows:

4. A system for identifying an entity of a news element involved in a discourse merged into a discourse semantic, comprising means for performing the method of any one of claims 1 to 3.