CN108897857B

CN108897857B - Chinese text subject sentence generating method facing field

Info

Publication number: CN108897857B
Application number: CN201810696452.6A
Authority: CN
Inventors: 宋晖; 刘栩彤; 戴龙其; 叶长晖; 岳万琛
Original assignee: Donghua University
Current assignee: Donghua University
Priority date: 2018-06-28
Filing date: 2018-06-28
Publication date: 2021-08-27
Anticipated expiration: 2038-06-28
Also published as: CN108897857A

Abstract

The invention provides a Chinese text subject sentence generation method facing to the field, which is characterized by comprising the following steps: and establishing a corresponding domain knowledge map facing a domain text data set, extracting semantic information of the text by applying a deep neural network model, classifying the text according to a topic sentence pattern, and finally generating a topic sentence of the text. The method obtains the data set concept model and the content description mode characteristic by a method of creating a domain knowledge graph, and labels and classifies the text data by using a deep learning model so as to generate the subject sentence of the text and realize the inquiry and statistics based on knowledge. The method has strong application applicability and has a good theme sentence generating effect on the limited field data set.

Description

Chinese text subject sentence generating method facing field

Technical Field

The invention relates to a method for extracting a theme of a Chinese text, in particular to a method for summarizing field text description characteristics based on a field data set and generating a theme sentence for the text.

Background

In recent years, with the development of artificial intelligence technology, computers have gained a lot of achievements with application value in natural language understanding. Topic extraction is an important branch in the field of text mining, and plays an important role in search engines, text classification, information statistics and the like. How to refine and accurately extract the subject information from the text is the key for understanding the language expression content, and is a research hotspot in the field.

Because of the diversity and complexity of Chinese semantics and sentence structure, it is difficult to directly extract the subject from the text. In order to obtain main information of a text, the existing methods mainly extract topic keywords from the text, and are mainly classified into methods based on statistical analysis and semantic analysis.

Statistical-based methods typically find topic keywords in text by computing statistics such as word frequency, word co-occurrence, or word weight. The method ignores the semantic features of the text, so that the extracted result easily contains noise data and the accuracy is not high. Semantic-based methods usually rely on a priori knowledge of humans to extract key information from text by using predefined semantic templates or by introducing external knowledge bases. Compared with a statistical method, the semantic-based method greatly improves the accuracy, but the implementation process is very complicated, and the mobility is relatively poor.

The use of subject words to represent textual information ignores the association between subject words and does not accurately capture the factual knowledge of the textual statement.

With the introduction of knowledge graph concepts and the development of neural network models, many researchers have tried to represent knowledge in triplets (entities, relationships, entities) or (entities, attributes, attribute values), construct graph expression models, and extract knowledge instances from text using supervised or semi-supervised learning methods. For example, the entities, relationships, attributes, etc. are expressed in vector form, and the neural network model is used for training to obtain corresponding classification or other related information. These techniques have been widely used in a knowledge question answering system, an intelligent robot, or the like.

Disclosure of Invention

The technical problem to be solved by the invention is as follows: the existing topic extraction method cannot obtain complete topic content narration, and mainly describes texts through topic keywords. Aiming at text data with strong domain, knowledge graph structures facing to open fields hardly reflect description modes of knowledge in different fields accurately, and topic information contained in the text is summarized.

In order to solve the technical problems, the technical scheme of the invention is to provide a method for automatically constructing a knowledge graph facing to the field and generating a subject sentence for a Chinese text, and to implement knowledge-based query and statistics. For clarity of discussion of the invention, the field of preferred embodiments is: the method for describing the version of the urban management case event is characterized by comprising the following steps of:

step 1: creating domain knowledge graphs

Each piece of data in the city management case event data set describes case specific information in the form of a Chinese sentence, the data set of the event information of the city management case is processed by part of speech tagging, word frequency statistics and weight sequencing, applying LDA theme clustering algorithm to the processed city management case event information data set, carrying out iterative theme clustering according to layers, finding entity category, description and layer relationship layer by layer to obtain a series of theme entries containing examples and corresponding descriptors, and obtains the hierarchical structure among the subject terms of different contents, then applies the K-means algorithm to cluster all the words obtained by the LDA subject clustering algorithm, abstracting an entity concept according to the clustering result to form a domain knowledge graph taking (entity, state description) and (entity, behavior and action description) as basic composition units;

step 2: semantic information extraction

Defining semantic labels for each type of entity and description according to a domain knowledge graph, labeling the semantic labels in a training set, training a BLSTM-CRF model by using the training set to realize the prediction of the semantic labels, wherein the BLSTM-CRF model comprises an input layer, a BLSTM layer, a CRF layer and an output layer, wherein:

in an input layer, a sentence is represented as a vector list, and each vector in the vector list is a word vector corresponding to each word in the sentence;

the BLSTM layer is a bidirectional LSTM neural network and consists of a forward LSTM part and a backward LSTM part, the output of the BLSTM layer is a probability matrix, and each value in the probability matrix represents the probability that a corresponding word in a sentence is marked as a corresponding semantic label;

the CRF layer is a undirected graph model;

for the sentence, the output of the output layer is the serial number of the semantic label corresponding to each word in the sentence;

and step 3: topic statement sentence generation

Classifying the text with predicted semantic labels on the sentence pattern level of the theme statement based on a classification model of a Bi-LSTM training theme sentence pattern, further determining the theme sentence pattern to which each piece of data belongs, extracting results and classification results according to the domain knowledge map content and semantic information, finally determining the word sequence to be extracted from the text and the arrangement sentence pattern of the word sequence, and generating a complete text theme sentence.

Preferably, in step 1, applying the LDA topic clustering algorithm to the preprocessed city management case event information data set includes the following steps:

step 101, performing LDA operation on an event information data set of a city management case to generate n subject entries, wherein n is more than or equal to 2 and less than or equal to 10, each subject entry comprises 10 subject words, and the subject words are sorted in a descending order according to TF-IDF weight;

step 102, according to the theme vocabulary entry obtained in the step 101, screening out an event containing the theme vocabulary entry combination in the city management case event information data set;

103, performing LDA topic clustering operation again in the event information corresponding to each topic entry, and mining the specific event content type under the large category;

104, removing the events screened in the step 102 from the current city management case event information data set, and repeating the step 101 to find the hidden subject terms in the current city management case event information data set;

and 105, after obtaining a new subject entry, repeating the steps 102, 103 and 104 until the new subject entry does not appear through LDA subject clustering operation.

Preferably, in the step 1, applying the K-means algorithm comprises the following steps: combining all the subject terms pairwise, calculating the degree of co-occurrence, if the degree of co-occurrence between the two subject terms is high, proving that the two subject terms are associated, combining the results of word frequency statistics and part of speech tagging, determining the example words and the descriptor in the subject terms, and finally determining the connection structure between the basic composition units of the map.

Preferably, in step 2, the method for forming the training set includes:

and (3) manually labeling N pieces of data, screening out words with semantic labels to form a labeled word set ws, then retrieving the unlabeled training data set, finding out words contained in the labeled word set ws, automatically labeling the words with corresponding semantic labels, and combining a large amount of data obtained by automatic labeling after manual correction with manually labeled data to form a final training set.

The method overcomes the defects of the existing text theme extraction method, obtains the characteristics of a data set concept model and a content description mode by a method of creating a domain knowledge map, and performs labeling and classification training on text data by using a deep learning model so as to generate the theme sentences of the text. The method has strong application applicability, has good subject sentence generating effect on the limited field data set, and can realize the query and statistics based on the knowledge graph on the text set.

Drawings

FIG. 1 is a schematic flow diagram of a method provided by the present invention;

FIG. 2 is a diagram of the domain knowledge map structure provided in this embodiment;

FIG. 3 is a schematic diagram of a BLSTM-CRF model of a semantic information extraction part in the invention;

fig. 4 is a model effect comparison line graph obtained by training a semantic information extraction part applied to neural network models of different depths according to the present embodiment;

FIG. 5 is a graph showing the comparison of model effect obtained by training the subject sentence classification part with neural network models of different depths according to this embodiment.

Detailed Description

In order to make the invention more comprehensible, preferred embodiments are described in detail below with reference to the accompanying drawings.

Extracting the subject statement of the text not only needs to extract the key words in the text, but also organizes the key words into short sentences in a correct sentence pattern. For example, for a sentence in the city community management domain: the lawn in the happy community has white garbage. "the generated topic phrases should be: "the lawn has white garbage. "

In order to accomplish the aim, the Chinese text subject sentence generating method facing the field provided by the invention divides the whole subject statement generating process into 3 steps: (1) establishing a domain knowledge graph (2), extracting semantic information (3), classifying sentence patterns and generating a theme. Fig. 1 is a flow chart of the implementation of this process.

Step 1: creating domain knowledge graphs

Each piece of data in the city management case event information data set describes case specific information in the form of a Chinese sentence, one piece of data represents one case, and 63890 pieces of case information in the data set describe the case specific information, which relate to various categories of city community management case events, including public facility security, public environment maintenance, public affair consultation, city security check and the like.

Because the field limitation of the event description data of the city management case is strong, and in the concept distribution, the case description repetition rate and the similarity are high for certain popular categories, in the example, the entity information and the relationship attribute information in the data are extracted by combining the statistical analysis and the topic probability model.

Generally, text data has different domain characteristics according to different contents of expression. For a particular data set, the entities, relationships, attributes or other descriptions contained in the data set are generally within the same domain, with stronger domain characteristics. In order to accurately calculate the concept and the theme sentence pattern distribution of texts in a data set, the invention provides a knowledge graph structure of knowledge description in the field by using methods such as statistics, text theme clustering, vocabulary clustering and the like on the basis of the original concept of the knowledge graph. In the process of creating, according to the result of iterative text topic clustering, the texts are layered in the original data set, and topic clustering is repeatedly performed in data subsets of different layers so as to discover the topic contents, examples and descriptions hidden in the texts.

As a structured semantic knowledge base, a knowledge graph describes concepts in the physical world and their interrelations in symbolic form. Open domain-oriented knowledge-graphs generally have (entities, relationships, entities) or (entities, attributes, attribute values) as basic building blocks. In order to better describe data, the form of (entity, relationship, entity) and (entity, attribute value) triples in a general graph is changed according to the domain characteristics of knowledge narration in a data set, and an adaptive graph knowledge unit and an associated structure are provided so as to better describe domain knowledge and are summarized as (entity, behavior description) and (entity, state description). The corresponding concepts are extracted by abstracting the entities, the relations and the attribute examples in the data set, and the domain knowledge graph is formed.

After the data preprocessing is finished, firstly, the data set is subjected to operations of part-of-speech tagging, word frequency statistics and weight ordering. And (3) applying an LDA theme clustering algorithm to perform iterative theme clustering on the data set according to the layers, and finding the entity category, description and layer belonging relation layer by layer, wherein the specific process is as follows.

1) LDA operation is carried out on the city management case event information data set, so that n (n is more than or equal to 2 and less than or equal to 10) subject entries are generated, each subject entry comprises 10 subject words, and the subject words are sorted in a descending order according to TF-IDF weight.

2) According to the subject entries obtained in the step 1), screening out events containing the subject entry combinations in the event information data set of the city management plan.

3) And performing LDA topic clustering operation again in the event information corresponding to each topic entry, and mining the specific event content type under the large category.

4) Removing the events screened out in the step 2) from the current data set, and repeating the step 1) to find the hidden subject entry in the current city management case event information data set.

5) After obtaining the new subject entry, repeating the steps 2), 3) and 4) until the new subject entry does not appear in the LDA operation result.

By the hierarchical topic clustering mode, a series of topic terms comprising examples and descriptors corresponding to the examples can be obtained, and the hierarchical structure among the topic terms of different contents is obtained.

And (5) clustering all the words obtained by the LDA by using K-means, and abstracting an entity concept according to a clustering result.

In order to find the relationship between concepts, pairwise combination is carried out in each entry, the degree of co-occurrence is calculated, if the degree of co-occurrence between two words is high, the fact that association exists between the two words is proved, and the example words and the descriptor in the two words can be determined by combining the results of word frequency statistics and part of speech tagging. Finally, determining the connection structure between the basic composition units of the atlas.

Fig. 2 is a knowledge map of the management field of urban communities, which includes 13 entity categories such as public facilities, general articles, certificates, activities, organizations, public staff, etc., and description categories of more than ten behaviors or states such as "occupation", "damage", and "inspection". Unlike the triple form of general (entity, relationship, entity) and (entity, attribute value), in the present graph, the knowledge representation form is generally dominated by (entity, state description) and (entity, behavior action description).

Step 2: and extracting semantic information.

And creating semantic labels for different types of texts according to the composition units of the knowledge graph, and labeling the training data set. In the labeling process, in order to obtain a large number of labeled samples, a remote supervision technology is applied on the basis of manual labeling. After the labeled data are obtained, sequence labeling training is carried out on the basis of a BLSTM-CRF neural network model so as to predict semantic labels corresponding to words on an unlabeled data set, and therefore the purpose of semantic information extraction is achieved. In this embodiment, the specific steps are as follows:

in order to extract the entity corresponding to the domain knowledge graph or the behavior description and the state description related to the entity from the text data, semantic tags are defined for each type of entity and description mode according to the domain knowledge graph, and a training set is labeled according to the semantic tags. The label contents are shown in table 1:

TABLE 1 semantic tags and their meanings

And (3) taking the semantic information extraction process as a sequence labeling process, and training a BLSTM-CRF model to realize the prediction of the label. The BLSTM-CRF model was implemented in combination with the models proposed by colobert in 2011 and Huang in 2015.

The BLSTM-CRF model needs a large amount of marking data, and is time-consuming only depending on manual marking. The remote supervision process is as follows:

firstly, 5000 pieces of data are labeled manually, and vocabularies with semantic labels are screened out to form a labeled word set ws. The unlabeled training data set is then searched and the words contained in the labeled word set ws are found and automatically labeled with the corresponding labels. And combining a large amount of data obtained by automatic labeling after manual correction with manually labeled data to form a final training set.

Before training the model, each sentence s in the training set is divided into several words w₁，w₂，：：：w_nAnd follow these words by { w₁/tag₁，w₂/tag₂：：：w_n/tag_nIs marked with w_iRepresenting each word, tag, in the sentence_iRepresentative word w_iA corresponding semantic tag. An entity or descriptive phrase may be composed of multiple words, labeled in IOBE mode for text. Wherein:

'B-': indicates the beginning of an entity or descriptor;

'I-': represents the middle of an entity or descriptor;

'E-': represents the end of an entity or descriptor;

an ` O `: indicating that the current word does not belong to any of the other tags.

The model tags are composed of IOBE tags and semantic tags in Table 1, such as "B-OCCUPY" indicates that the current word is the beginning of a phrase of type "OCCUPY".

The BLSTM-CRF model consists of several parts of an input layer, BLSTM, CRF and an output layer, as shown in fig. 3.

In the input layer, the sentence s will be represented as a vector list s ═ w₁，w₂，w₃：：：w_n) In the form of (1), each vector in the list is a word vector corresponding to each word in the sentence s.

The BLSTM, bi-directional LSTM neural network, is composed of forward LSTM and backward LSTM. The unit structure of LSTM adds long-time memory module on the basis of RNN structure. The module comprises an input gate, an output gate and a forgetting gate. Through the bidirectional propagation mode, the BLSTM network can well capture the context characteristics of sentences and express the semantic characteristics of the sentences.

The output of BLSTM is a probability matrix A_n*kEach value A in the matrix_ijRepresenting the probability that the ith word in the sentence s is labeled as the jth semantic tag.

The conditional random field model (CRF) is an undirected graph model, combines the characteristics of a maximum entropy model (MEMMs) and a Hidden Markov Model (HMM), and achieves better effect on sequence labeling problems such as part-of-speech labeling, named entity recognition and the like. The conditional state transition probability matrix is calculated in the global scope, and a labeling sequence which best accords with the current sentence is found out.

For sentence s, the final output of the model is where each word w_iThe sequence number of the corresponding semantic tag.

And step 3: topic statement sentence generation

Classifying the text with predicted semantic labels on the sentence pattern level of the theme statement based on a classification model of a Bi-LSTM training theme sentence pattern, further determining the theme sentence pattern to which each piece of data belongs, extracting results and classification results according to the domain knowledge map content and semantic information, finally determining the word sequence to be extracted from the text and the arrangement sentence pattern of the word sequence, and generating a complete text theme sentence. In this embodiment, the method specifically includes the following steps:

establishing a BLSTM model to classify sentences in the data set on the topic sentence structure level so as to determine which phrases with semantic labels need to be extracted from the sentences and organize the phrases into sentences in what form. The subject sentence structure is shown in Table 2:

table 2 topic sentence structure

For a given sentence s, it is denoted as s ═ w₁/tag₁，w₂/tag₂：：：w_n/tag_nForm (a) }, wherein w_iStands for the word in s, tag_iRepresents w_iFor each sentence { s | { w |)₁/tag₁，w₂/tag₂：：：w_n/tag_nAll have a t_iCorresponding to this, the topic sentence structure to which the sentence belongs is represented.

In the input layer of the BLSTM model, the sentence s will be represented as a vector list s ═ w₁，w₂，w₃：：：w_n) In the form of (1), each vector in the list is composed of two parts, one part is a word vector and the other part is a vector representation of the semantic tag corresponding to the word.

The sequence number t of the topic sentence structure corresponding to each sentence s_iI.e. the output of the model.

For example, the sentence "happiness/B-ORG cell/E-ORG lawn/B-PUBLIC go/E-PUBLIC has/B-OCCUPY white/B-OBJECTS junk/E-OBJECTS. ", belongs to the subject sentence structure 1: "OBJECTS/AD _ CER OCCUPYPUBLIC", the subject sentence of this text is: "there is white rubbish on the lawn. ".

While the invention has been described with respect to a preferred embodiment, it will be understood by those skilled in the art that the foregoing and other changes, omissions and deviations in the form and detail thereof may be made without departing from the scope of this invention. Those skilled in the art can make various changes, modifications and equivalent arrangements, which are equivalent to the embodiments of the present invention, without departing from the spirit and scope of the present invention, and which may be made by utilizing the techniques disclosed above; meanwhile, any changes, modifications and variations of the above-described embodiments, which are equivalent to those of the technical spirit of the present invention, are within the scope of the technical solution of the present invention.

And (3) verification experiment: in order to evaluate the effectiveness of the method, the model is verified from 3 aspects of model structure, parameter adjustment and subject generation accuracy.

Model structure: and comparing the BLSTM model, the LSTM-CRF model and the BLSTM-CRF model in the semantic information extraction part, and comparing the LSTM model and the BLSTM model in the topic sentence structure classification part. The results obtained are shown in table 3:

TABLE 3 comparison of F1 information for different models

Compared with the BLSTM-CRF model, the LSTM-CRF model removes the backward propagation LSTM part, and the BLSTM model is directly connected with a SoftMax layer behind the probability matrix to obtain the final sequence labeling result. The best results were obtained with the FL value of 0.913275 for the BLSTM-CRF model compared to LSTM-CRF and BLSTM.

In the topic sentence structure classification model, BLSTM has better performance than LSTM, and F1 is 0.916465.

Parameter adjustment: in the experiment, several important parameters in the model are adjusted to achieve the best performance of the model.

In the semantic information labeling model, using Adam optimizer and adjusting the Keep prob, Learning rate (Learning rate) and Hidden node number (Hidden nodes) of the model according to the control variable method, respectively, and the obtained data is shown in table 4:

TABLE 4 different parameter values and corresponding model F1 values

According to experimental data, the Keep prob value, the Learning rate value and the Hidden nodes value in the semantic information extraction model are finally set to be 0.6, 0.003 and 320.

The same method is applied to adjust parameters in the subject sentence structure classification model, and finally the Keep prob value, the Learning rate value and the Hidden nodes value of the model are set to be 0.7, 0.002 and 128.

The statistics of the results obtained by applying different parameter combinations to different model structures are shown in fig. 4 and 5, wherein the parameters on the horizontal axis represent the Keep prob, the Learning rate and the Hidden nodes of the model respectively.

Accuracy of generated subject sentence: by applying the method, the subject sentences are automatically generated on the test set, and part of data is automatically screened out in the test set for manual statistics, so that the subject statement accuracy and the overall subject statement accuracy of different event types in the table 5 are obtained.

TABLE 5 subject matter Generation accuracy for different types of data

As can be seen from the statistical data in the table, the topic generation accuracy rate of the invention is the best in the event types of 'article stacking' and 'article damage', and can reach 85%. It performs slightly worse in data with event type "other". But the overall topic generation accuracy can still reach 70.5%. Therefore, the method and the device can achieve better effect in the aspect of generating the theme sentence.

Claims

1. A Chinese text subject sentence generation method facing to the field is characterized by comprising the following steps:

step 1: creating domain knowledge graphs

step 2: semantic information extraction

the CRF layer is a undirected graph model;

and step 3: topic statement sentence generation

2. The method for generating a domain-oriented Chinese text topic sentence of claim 1, wherein in the step 1, the step of applying LDA topic clustering algorithm to the preprocessed city management case event information data set comprises the following steps:

3. The method for generating a domain-oriented Chinese text subject sentence according to claim 1, wherein the step 1 of applying the K-means algorithm comprises the following steps: combining all the subject terms pairwise, calculating the degree of co-occurrence, if the degree of co-occurrence between the two subject terms is high, proving that the two subject terms are associated, combining the results of word frequency statistics and part of speech tagging, determining the example words and the descriptor in the subject terms, and finally determining the connection structure between the basic composition units of the map.

4. The method for generating a domain-oriented Chinese text topic sentence of claim 1, wherein in the step 2, the method for forming the training set comprises: