CN108897857B - Chinese text subject sentence generating method facing field - Google Patents

Chinese text subject sentence generating method facing field Download PDF

Info

Publication number
CN108897857B
CN108897857B CN201810696452.6A CN201810696452A CN108897857B CN 108897857 B CN108897857 B CN 108897857B CN 201810696452 A CN201810696452 A CN 201810696452A CN 108897857 B CN108897857 B CN 108897857B
Authority
CN
China
Prior art keywords
sentence
subject
text
data set
theme
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810696452.6A
Other languages
Chinese (zh)
Other versions
CN108897857A (en
Inventor
宋晖
刘栩彤
戴龙其
叶长晖
岳万琛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Donghua University
Original Assignee
Donghua University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Donghua University filed Critical Donghua University
Priority to CN201810696452.6A priority Critical patent/CN108897857B/en
Publication of CN108897857A publication Critical patent/CN108897857A/en
Application granted granted Critical
Publication of CN108897857B publication Critical patent/CN108897857B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/258Heading extraction; Automatic titling; Numbering

Abstract

The invention provides a Chinese text subject sentence generation method facing to the field, which is characterized by comprising the following steps: and establishing a corresponding domain knowledge map facing a domain text data set, extracting semantic information of the text by applying a deep neural network model, classifying the text according to a topic sentence pattern, and finally generating a topic sentence of the text. The method obtains the data set concept model and the content description mode characteristic by a method of creating a domain knowledge graph, and labels and classifies the text data by using a deep learning model so as to generate the subject sentence of the text and realize the inquiry and statistics based on knowledge. The method has strong application applicability and has a good theme sentence generating effect on the limited field data set.

Description

Chinese text subject sentence generating method facing field
Technical Field
The invention relates to a method for extracting a theme of a Chinese text, in particular to a method for summarizing field text description characteristics based on a field data set and generating a theme sentence for the text.
Background
In recent years, with the development of artificial intelligence technology, computers have gained a lot of achievements with application value in natural language understanding. Topic extraction is an important branch in the field of text mining, and plays an important role in search engines, text classification, information statistics and the like. How to refine and accurately extract the subject information from the text is the key for understanding the language expression content, and is a research hotspot in the field.
Because of the diversity and complexity of Chinese semantics and sentence structure, it is difficult to directly extract the subject from the text. In order to obtain main information of a text, the existing methods mainly extract topic keywords from the text, and are mainly classified into methods based on statistical analysis and semantic analysis.
Statistical-based methods typically find topic keywords in text by computing statistics such as word frequency, word co-occurrence, or word weight. The method ignores the semantic features of the text, so that the extracted result easily contains noise data and the accuracy is not high. Semantic-based methods usually rely on a priori knowledge of humans to extract key information from text by using predefined semantic templates or by introducing external knowledge bases. Compared with a statistical method, the semantic-based method greatly improves the accuracy, but the implementation process is very complicated, and the mobility is relatively poor.
The use of subject words to represent textual information ignores the association between subject words and does not accurately capture the factual knowledge of the textual statement.
With the introduction of knowledge graph concepts and the development of neural network models, many researchers have tried to represent knowledge in triplets (entities, relationships, entities) or (entities, attributes, attribute values), construct graph expression models, and extract knowledge instances from text using supervised or semi-supervised learning methods. For example, the entities, relationships, attributes, etc. are expressed in vector form, and the neural network model is used for training to obtain corresponding classification or other related information. These techniques have been widely used in a knowledge question answering system, an intelligent robot, or the like.
Disclosure of Invention
The technical problem to be solved by the invention is as follows: the existing topic extraction method cannot obtain complete topic content narration, and mainly describes texts through topic keywords. Aiming at text data with strong domain, knowledge graph structures facing to open fields hardly reflect description modes of knowledge in different fields accurately, and topic information contained in the text is summarized.
In order to solve the technical problems, the technical scheme of the invention is to provide a method for automatically constructing a knowledge graph facing to the field and generating a subject sentence for a Chinese text, and to implement knowledge-based query and statistics. For clarity of discussion of the invention, the field of preferred embodiments is: the method for describing the version of the urban management case event is characterized by comprising the following steps of:
step 1: creating domain knowledge graphs
Each piece of data in the city management case event data set describes case specific information in the form of a Chinese sentence, the data set of the event information of the city management case is processed by part of speech tagging, word frequency statistics and weight sequencing, applying LDA theme clustering algorithm to the processed city management case event information data set, carrying out iterative theme clustering according to layers, finding entity category, description and layer relationship layer by layer to obtain a series of theme entries containing examples and corresponding descriptors, and obtains the hierarchical structure among the subject terms of different contents, then applies the K-means algorithm to cluster all the words obtained by the LDA subject clustering algorithm, abstracting an entity concept according to the clustering result to form a domain knowledge graph taking (entity, state description) and (entity, behavior and action description) as basic composition units;
step 2: semantic information extraction
Defining semantic labels for each type of entity and description according to a domain knowledge graph, labeling the semantic labels in a training set, training a BLSTM-CRF model by using the training set to realize the prediction of the semantic labels, wherein the BLSTM-CRF model comprises an input layer, a BLSTM layer, a CRF layer and an output layer, wherein:
in an input layer, a sentence is represented as a vector list, and each vector in the vector list is a word vector corresponding to each word in the sentence;
the BLSTM layer is a bidirectional LSTM neural network and consists of a forward LSTM part and a backward LSTM part, the output of the BLSTM layer is a probability matrix, and each value in the probability matrix represents the probability that a corresponding word in a sentence is marked as a corresponding semantic label;
the CRF layer is a undirected graph model;
for the sentence, the output of the output layer is the serial number of the semantic label corresponding to each word in the sentence;
and step 3: topic statement sentence generation
Classifying the text with predicted semantic labels on the sentence pattern level of the theme statement based on a classification model of a Bi-LSTM training theme sentence pattern, further determining the theme sentence pattern to which each piece of data belongs, extracting results and classification results according to the domain knowledge map content and semantic information, finally determining the word sequence to be extracted from the text and the arrangement sentence pattern of the word sequence, and generating a complete text theme sentence.
Preferably, in step 1, applying the LDA topic clustering algorithm to the preprocessed city management case event information data set includes the following steps:
step 101, performing LDA operation on an event information data set of a city management case to generate n subject entries, wherein n is more than or equal to 2 and less than or equal to 10, each subject entry comprises 10 subject words, and the subject words are sorted in a descending order according to TF-IDF weight;
step 102, according to the theme vocabulary entry obtained in the step 101, screening out an event containing the theme vocabulary entry combination in the city management case event information data set;
103, performing LDA topic clustering operation again in the event information corresponding to each topic entry, and mining the specific event content type under the large category;
104, removing the events screened in the step 102 from the current city management case event information data set, and repeating the step 101 to find the hidden subject terms in the current city management case event information data set;
and 105, after obtaining a new subject entry, repeating the steps 102, 103 and 104 until the new subject entry does not appear through LDA subject clustering operation.
Preferably, in the step 1, applying the K-means algorithm comprises the following steps: combining all the subject terms pairwise, calculating the degree of co-occurrence, if the degree of co-occurrence between the two subject terms is high, proving that the two subject terms are associated, combining the results of word frequency statistics and part of speech tagging, determining the example words and the descriptor in the subject terms, and finally determining the connection structure between the basic composition units of the map.
Preferably, in step 2, the method for forming the training set includes:
and (3) manually labeling N pieces of data, screening out words with semantic labels to form a labeled word set ws, then retrieving the unlabeled training data set, finding out words contained in the labeled word set ws, automatically labeling the words with corresponding semantic labels, and combining a large amount of data obtained by automatic labeling after manual correction with manually labeled data to form a final training set.
The method overcomes the defects of the existing text theme extraction method, obtains the characteristics of a data set concept model and a content description mode by a method of creating a domain knowledge map, and performs labeling and classification training on text data by using a deep learning model so as to generate the theme sentences of the text. The method has strong application applicability, has good subject sentence generating effect on the limited field data set, and can realize the query and statistics based on the knowledge graph on the text set.
Drawings
FIG. 1 is a schematic flow diagram of a method provided by the present invention;
FIG. 2 is a diagram of the domain knowledge map structure provided in this embodiment;
FIG. 3 is a schematic diagram of a BLSTM-CRF model of a semantic information extraction part in the invention;
fig. 4 is a model effect comparison line graph obtained by training a semantic information extraction part applied to neural network models of different depths according to the present embodiment;
FIG. 5 is a graph showing the comparison of model effect obtained by training the subject sentence classification part with neural network models of different depths according to this embodiment.
Detailed Description
In order to make the invention more comprehensible, preferred embodiments are described in detail below with reference to the accompanying drawings.
Extracting the subject statement of the text not only needs to extract the key words in the text, but also organizes the key words into short sentences in a correct sentence pattern. For example, for a sentence in the city community management domain: the lawn in the happy community has white garbage. "the generated topic phrases should be: "the lawn has white garbage. "
In order to accomplish the aim, the Chinese text subject sentence generating method facing the field provided by the invention divides the whole subject statement generating process into 3 steps: (1) establishing a domain knowledge graph (2), extracting semantic information (3), classifying sentence patterns and generating a theme. Fig. 1 is a flow chart of the implementation of this process.
Step 1: creating domain knowledge graphs
Each piece of data in the city management case event information data set describes case specific information in the form of a Chinese sentence, one piece of data represents one case, and 63890 pieces of case information in the data set describe the case specific information, which relate to various categories of city community management case events, including public facility security, public environment maintenance, public affair consultation, city security check and the like.
Because the field limitation of the event description data of the city management case is strong, and in the concept distribution, the case description repetition rate and the similarity are high for certain popular categories, in the example, the entity information and the relationship attribute information in the data are extracted by combining the statistical analysis and the topic probability model.
Generally, text data has different domain characteristics according to different contents of expression. For a particular data set, the entities, relationships, attributes or other descriptions contained in the data set are generally within the same domain, with stronger domain characteristics. In order to accurately calculate the concept and the theme sentence pattern distribution of texts in a data set, the invention provides a knowledge graph structure of knowledge description in the field by using methods such as statistics, text theme clustering, vocabulary clustering and the like on the basis of the original concept of the knowledge graph. In the process of creating, according to the result of iterative text topic clustering, the texts are layered in the original data set, and topic clustering is repeatedly performed in data subsets of different layers so as to discover the topic contents, examples and descriptions hidden in the texts.
As a structured semantic knowledge base, a knowledge graph describes concepts in the physical world and their interrelations in symbolic form. Open domain-oriented knowledge-graphs generally have (entities, relationships, entities) or (entities, attributes, attribute values) as basic building blocks. In order to better describe data, the form of (entity, relationship, entity) and (entity, attribute value) triples in a general graph is changed according to the domain characteristics of knowledge narration in a data set, and an adaptive graph knowledge unit and an associated structure are provided so as to better describe domain knowledge and are summarized as (entity, behavior description) and (entity, state description). The corresponding concepts are extracted by abstracting the entities, the relations and the attribute examples in the data set, and the domain knowledge graph is formed.
After the data preprocessing is finished, firstly, the data set is subjected to operations of part-of-speech tagging, word frequency statistics and weight ordering. And (3) applying an LDA theme clustering algorithm to perform iterative theme clustering on the data set according to the layers, and finding the entity category, description and layer belonging relation layer by layer, wherein the specific process is as follows.
1) LDA operation is carried out on the city management case event information data set, so that n (n is more than or equal to 2 and less than or equal to 10) subject entries are generated, each subject entry comprises 10 subject words, and the subject words are sorted in a descending order according to TF-IDF weight.
2) According to the subject entries obtained in the step 1), screening out events containing the subject entry combinations in the event information data set of the city management plan.
3) And performing LDA topic clustering operation again in the event information corresponding to each topic entry, and mining the specific event content type under the large category.
4) Removing the events screened out in the step 2) from the current data set, and repeating the step 1) to find the hidden subject entry in the current city management case event information data set.
5) After obtaining the new subject entry, repeating the steps 2), 3) and 4) until the new subject entry does not appear in the LDA operation result.
By the hierarchical topic clustering mode, a series of topic terms comprising examples and descriptors corresponding to the examples can be obtained, and the hierarchical structure among the topic terms of different contents is obtained.
And (5) clustering all the words obtained by the LDA by using K-means, and abstracting an entity concept according to a clustering result.
In order to find the relationship between concepts, pairwise combination is carried out in each entry, the degree of co-occurrence is calculated, if the degree of co-occurrence between two words is high, the fact that association exists between the two words is proved, and the example words and the descriptor in the two words can be determined by combining the results of word frequency statistics and part of speech tagging. Finally, determining the connection structure between the basic composition units of the atlas.
Fig. 2 is a knowledge map of the management field of urban communities, which includes 13 entity categories such as public facilities, general articles, certificates, activities, organizations, public staff, etc., and description categories of more than ten behaviors or states such as "occupation", "damage", and "inspection". Unlike the triple form of general (entity, relationship, entity) and (entity, attribute value), in the present graph, the knowledge representation form is generally dominated by (entity, state description) and (entity, behavior action description).
Step 2: and extracting semantic information.
And creating semantic labels for different types of texts according to the composition units of the knowledge graph, and labeling the training data set. In the labeling process, in order to obtain a large number of labeled samples, a remote supervision technology is applied on the basis of manual labeling. After the labeled data are obtained, sequence labeling training is carried out on the basis of a BLSTM-CRF neural network model so as to predict semantic labels corresponding to words on an unlabeled data set, and therefore the purpose of semantic information extraction is achieved. In this embodiment, the specific steps are as follows:
in order to extract the entity corresponding to the domain knowledge graph or the behavior description and the state description related to the entity from the text data, semantic tags are defined for each type of entity and description mode according to the domain knowledge graph, and a training set is labeled according to the semantic tags. The label contents are shown in table 1:
TABLE 1 semantic tags and their meanings
Figure BDA0001711929630000061
And (3) taking the semantic information extraction process as a sequence labeling process, and training a BLSTM-CRF model to realize the prediction of the label. The BLSTM-CRF model was implemented in combination with the models proposed by colobert in 2011 and Huang in 2015.
The BLSTM-CRF model needs a large amount of marking data, and is time-consuming only depending on manual marking. The remote supervision process is as follows:
firstly, 5000 pieces of data are labeled manually, and vocabularies with semantic labels are screened out to form a labeled word set ws. The unlabeled training data set is then searched and the words contained in the labeled word set ws are found and automatically labeled with the corresponding labels. And combining a large amount of data obtained by automatic labeling after manual correction with manually labeled data to form a final training set.
Before training the model, each sentence s in the training set is divided into several words w1,w2,:::wnAnd follow these words by { w1/tag1,w2/tag2:::wn/tagnIs marked with wiRepresenting each word, tag, in the sentenceiRepresentative word wiA corresponding semantic tag. An entity or descriptive phrase may be composed of multiple words, labeled in IOBE mode for text. Wherein:
'B-': indicates the beginning of an entity or descriptor;
'I-': represents the middle of an entity or descriptor;
'E-': represents the end of an entity or descriptor;
an ` O `: indicating that the current word does not belong to any of the other tags.
The model tags are composed of IOBE tags and semantic tags in Table 1, such as "B-OCCUPY" indicates that the current word is the beginning of a phrase of type "OCCUPY".
The BLSTM-CRF model consists of several parts of an input layer, BLSTM, CRF and an output layer, as shown in fig. 3.
In the input layer, the sentence s will be represented as a vector list s ═ w1,w2,w3:::wn) In the form of (1), each vector in the list is a word vector corresponding to each word in the sentence s.
The BLSTM, bi-directional LSTM neural network, is composed of forward LSTM and backward LSTM. The unit structure of LSTM adds long-time memory module on the basis of RNN structure. The module comprises an input gate, an output gate and a forgetting gate. Through the bidirectional propagation mode, the BLSTM network can well capture the context characteristics of sentences and express the semantic characteristics of the sentences.
The output of BLSTM is a probability matrix An*kEach value A in the matrixijRepresenting the probability that the ith word in the sentence s is labeled as the jth semantic tag.
The conditional random field model (CRF) is an undirected graph model, combines the characteristics of a maximum entropy model (MEMMs) and a Hidden Markov Model (HMM), and achieves better effect on sequence labeling problems such as part-of-speech labeling, named entity recognition and the like. The conditional state transition probability matrix is calculated in the global scope, and a labeling sequence which best accords with the current sentence is found out.
For sentence s, the final output of the model is where each word wiThe sequence number of the corresponding semantic tag.
And step 3: topic statement sentence generation
Classifying the text with predicted semantic labels on the sentence pattern level of the theme statement based on a classification model of a Bi-LSTM training theme sentence pattern, further determining the theme sentence pattern to which each piece of data belongs, extracting results and classification results according to the domain knowledge map content and semantic information, finally determining the word sequence to be extracted from the text and the arrangement sentence pattern of the word sequence, and generating a complete text theme sentence. In this embodiment, the method specifically includes the following steps:
establishing a BLSTM model to classify sentences in the data set on the topic sentence structure level so as to determine which phrases with semantic labels need to be extracted from the sentences and organize the phrases into sentences in what form. The subject sentence structure is shown in Table 2:
table 2 topic sentence structure
Figure BDA0001711929630000081
For a given sentence s, it is denoted as s ═ w1/tag1,w2/tag2:::wn/tagnForm (a) }, wherein wiStands for the word in s, tagiRepresents wiFor each sentence { s | { w |)1/tag1,w2/tag2:::wn/tagnAll have a tiCorresponding to this, the topic sentence structure to which the sentence belongs is represented.
In the input layer of the BLSTM model, the sentence s will be represented as a vector list s ═ w1,w2,w3:::wn) In the form of (1), each vector in the list is composed of two parts, one part is a word vector and the other part is a vector representation of the semantic tag corresponding to the word.
The sequence number t of the topic sentence structure corresponding to each sentence siI.e. the output of the model.
For example, the sentence "happiness/B-ORG cell/E-ORG lawn/B-PUBLIC go/E-PUBLIC has/B-OCCUPY white/B-OBJECTS junk/E-OBJECTS. ", belongs to the subject sentence structure 1: "OBJECTS/AD _ CER OCCUPYPUBLIC", the subject sentence of this text is: "there is white rubbish on the lawn. ".
While the invention has been described with respect to a preferred embodiment, it will be understood by those skilled in the art that the foregoing and other changes, omissions and deviations in the form and detail thereof may be made without departing from the scope of this invention. Those skilled in the art can make various changes, modifications and equivalent arrangements, which are equivalent to the embodiments of the present invention, without departing from the spirit and scope of the present invention, and which may be made by utilizing the techniques disclosed above; meanwhile, any changes, modifications and variations of the above-described embodiments, which are equivalent to those of the technical spirit of the present invention, are within the scope of the technical solution of the present invention.
And (3) verification experiment: in order to evaluate the effectiveness of the method, the model is verified from 3 aspects of model structure, parameter adjustment and subject generation accuracy.
Model structure: and comparing the BLSTM model, the LSTM-CRF model and the BLSTM-CRF model in the semantic information extraction part, and comparing the LSTM model and the BLSTM model in the topic sentence structure classification part. The results obtained are shown in table 3:
TABLE 3 comparison of F1 information for different models
Figure BDA0001711929630000091
Compared with the BLSTM-CRF model, the LSTM-CRF model removes the backward propagation LSTM part, and the BLSTM model is directly connected with a SoftMax layer behind the probability matrix to obtain the final sequence labeling result. The best results were obtained with the FL value of 0.913275 for the BLSTM-CRF model compared to LSTM-CRF and BLSTM.
In the topic sentence structure classification model, BLSTM has better performance than LSTM, and F1 is 0.916465.
Parameter adjustment: in the experiment, several important parameters in the model are adjusted to achieve the best performance of the model.
In the semantic information labeling model, using Adam optimizer and adjusting the Keep prob, Learning rate (Learning rate) and Hidden node number (Hidden nodes) of the model according to the control variable method, respectively, and the obtained data is shown in table 4:
TABLE 4 different parameter values and corresponding model F1 values
Figure BDA0001711929630000092
Figure BDA0001711929630000101
According to experimental data, the Keep prob value, the Learning rate value and the Hidden nodes value in the semantic information extraction model are finally set to be 0.6, 0.003 and 320.
The same method is applied to adjust parameters in the subject sentence structure classification model, and finally the Keep prob value, the Learning rate value and the Hidden nodes value of the model are set to be 0.7, 0.002 and 128.
The statistics of the results obtained by applying different parameter combinations to different model structures are shown in fig. 4 and 5, wherein the parameters on the horizontal axis represent the Keep prob, the Learning rate and the Hidden nodes of the model respectively.
Accuracy of generated subject sentence: by applying the method, the subject sentences are automatically generated on the test set, and part of data is automatically screened out in the test set for manual statistics, so that the subject statement accuracy and the overall subject statement accuracy of different event types in the table 5 are obtained.
TABLE 5 subject matter Generation accuracy for different types of data
Figure BDA0001711929630000102
As can be seen from the statistical data in the table, the topic generation accuracy rate of the invention is the best in the event types of 'article stacking' and 'article damage', and can reach 85%. It performs slightly worse in data with event type "other". But the overall topic generation accuracy can still reach 70.5%. Therefore, the method and the device can achieve better effect in the aspect of generating the theme sentence.

Claims (4)

1. A Chinese text subject sentence generation method facing to the field is characterized by comprising the following steps:
step 1: creating domain knowledge graphs
Each piece of data in the city management case event data set describes case specific information in the form of a Chinese sentence, the data set of the event information of the city management case is processed by part of speech tagging, word frequency statistics and weight sequencing, applying LDA theme clustering algorithm to the processed city management case event information data set, carrying out iterative theme clustering according to layers, finding entity category, description and layer relationship layer by layer to obtain a series of theme entries containing examples and corresponding descriptors, and obtains the hierarchical structure among the subject terms of different contents, then applies the K-means algorithm to cluster all the words obtained by the LDA subject clustering algorithm, abstracting an entity concept according to the clustering result to form a domain knowledge graph taking (entity, state description) and (entity, behavior and action description) as basic composition units;
step 2: semantic information extraction
Defining semantic labels for each type of entity and description according to a domain knowledge graph, labeling the semantic labels in a training set, training a BLSTM-CRF model by using the training set to realize the prediction of the semantic labels, wherein the BLSTM-CRF model comprises an input layer, a BLSTM layer, a CRF layer and an output layer, wherein:
in an input layer, a sentence is represented as a vector list, and each vector in the vector list is a word vector corresponding to each word in the sentence;
the BLSTM layer is a bidirectional LSTM neural network and consists of a forward LSTM part and a backward LSTM part, the output of the BLSTM layer is a probability matrix, and each value in the probability matrix represents the probability that a corresponding word in a sentence is marked as a corresponding semantic label;
the CRF layer is a undirected graph model;
for the sentence, the output of the output layer is the serial number of the semantic label corresponding to each word in the sentence;
and step 3: topic statement sentence generation
Classifying the text with predicted semantic labels on the sentence pattern level of the theme statement based on a classification model of a Bi-LSTM training theme sentence pattern, further determining the theme sentence pattern to which each piece of data belongs, extracting results and classification results according to the domain knowledge map content and semantic information, finally determining the word sequence to be extracted from the text and the arrangement sentence pattern of the word sequence, and generating a complete text theme sentence.
2. The method for generating a domain-oriented Chinese text topic sentence of claim 1, wherein in the step 1, the step of applying LDA topic clustering algorithm to the preprocessed city management case event information data set comprises the following steps:
step 101, performing LDA operation on an event information data set of a city management case to generate n subject entries, wherein n is more than or equal to 2 and less than or equal to 10, each subject entry comprises 10 subject words, and the subject words are sorted in a descending order according to TF-IDF weight;
step 102, according to the theme vocabulary entry obtained in the step 101, screening out an event containing the theme vocabulary entry combination in the city management case event information data set;
103, performing LDA topic clustering operation again in the event information corresponding to each topic entry, and mining the specific event content type under the large category;
104, removing the events screened in the step 102 from the current city management case event information data set, and repeating the step 101 to find the hidden subject terms in the current city management case event information data set;
and 105, after obtaining a new subject entry, repeating the steps 102, 103 and 104 until the new subject entry does not appear through LDA subject clustering operation.
3. The method for generating a domain-oriented Chinese text subject sentence according to claim 1, wherein the step 1 of applying the K-means algorithm comprises the following steps: combining all the subject terms pairwise, calculating the degree of co-occurrence, if the degree of co-occurrence between the two subject terms is high, proving that the two subject terms are associated, combining the results of word frequency statistics and part of speech tagging, determining the example words and the descriptor in the subject terms, and finally determining the connection structure between the basic composition units of the map.
4. The method for generating a domain-oriented Chinese text topic sentence of claim 1, wherein in the step 2, the method for forming the training set comprises:
and (3) manually labeling N pieces of data, screening out words with semantic labels to form a labeled word set ws, then retrieving the unlabeled training data set, finding out words contained in the labeled word set ws, automatically labeling the words with corresponding semantic labels, and combining a large amount of data obtained by automatic labeling after manual correction with manually labeled data to form a final training set.
CN201810696452.6A 2018-06-28 2018-06-28 Chinese text subject sentence generating method facing field Active CN108897857B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810696452.6A CN108897857B (en) 2018-06-28 2018-06-28 Chinese text subject sentence generating method facing field

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810696452.6A CN108897857B (en) 2018-06-28 2018-06-28 Chinese text subject sentence generating method facing field

Publications (2)

Publication Number Publication Date
CN108897857A CN108897857A (en) 2018-11-27
CN108897857B true CN108897857B (en) 2021-08-27

Family

ID=64347150

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810696452.6A Active CN108897857B (en) 2018-06-28 2018-06-28 Chinese text subject sentence generating method facing field

Country Status (1)

Country Link
CN (1) CN108897857B (en)

Families Citing this family (25)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109543089A (en) * 2018-11-30 2019-03-29 南方电网科学研究院有限责任公司 A kind of classification method, system and the relevant apparatus of network security information data
CN109684394B (en) * 2018-12-13 2021-05-18 北京百度网讯科技有限公司 Text generation method, device, equipment and storage medium
CN109800419A (en) * 2018-12-18 2019-05-24 武汉西山艺创文化有限公司 A kind of game sessions lines generation method and system
CN109697679A (en) * 2018-12-27 2019-04-30 厦门智融合科技有限公司 Intellectual property services guidance method and system
CN109902145B (en) * 2019-01-18 2021-04-20 中国科学院信息工程研究所 Attention mechanism-based entity relationship joint extraction method and system
CN110245234A (en) * 2019-03-27 2019-09-17 中国海洋大学 A kind of multi-source data sample correlating method based on ontology and semantic similarity
CN110134792B (en) * 2019-05-22 2022-03-08 北京金山数字娱乐科技有限公司 Text recognition method and device, electronic equipment and storage medium
CN111191039B (en) * 2019-09-30 2021-04-13 腾讯科技(深圳)有限公司 Knowledge graph creation method, knowledge graph creation device and computer readable storage medium
CN110705255B (en) * 2019-10-12 2021-05-25 京东数字科技控股有限公司 Method and device for detecting association relation between sentences
CN110852068A (en) * 2019-10-15 2020-02-28 武汉工程大学 Method for extracting sports news subject term based on BilSTM-CRF
CN110888991B (en) * 2019-11-28 2023-12-01 哈尔滨工程大学 Sectional type semantic annotation method under weak annotation environment
CN111050266B (en) * 2019-12-20 2021-07-30 朱凤邹 Method and system for performing function control based on earphone detection action
CN111209389B (en) * 2019-12-31 2023-08-11 天津外国语大学 Movie story generation method
CN111291205B (en) * 2020-01-22 2023-06-13 北京百度网讯科技有限公司 Knowledge graph construction method, device, equipment and medium
CN111597328B (en) * 2020-05-27 2022-10-18 青岛大学 New event theme extraction method
CN111897914B (en) * 2020-07-20 2023-09-19 杭州叙简科技股份有限公司 Entity information extraction and knowledge graph construction method for comprehensive pipe rack field
CN111897921A (en) * 2020-08-04 2020-11-06 广西财经学院 Text retrieval method based on word vector learning and mode mining fusion expansion
CN111814482B (en) * 2020-09-03 2020-12-11 平安国际智慧城市科技股份有限公司 Text key data extraction method and system and computer equipment
CN112541359B (en) * 2020-11-27 2024-02-02 北京百度网讯科技有限公司 Document content identification method, device, electronic equipment and medium
CN112487306B (en) * 2020-12-07 2023-01-17 华东师范大学 Automatic event marking and classifying method based on knowledge graph
CN112597285B (en) * 2020-12-10 2021-08-10 太极计算机股份有限公司 Man-machine interaction method and system based on knowledge graph
CN112836482B (en) * 2021-02-09 2024-02-23 浙江工商大学 Method and device for generating problem by sequence generation model based on template
CN113407716B (en) * 2021-05-14 2022-08-19 桂林电子科技大学 Human behavior text data set construction and processing method based on crowdsourcing
CN114091446A (en) * 2021-11-24 2022-02-25 北京有竹居网络技术有限公司 Method and device for generating text
CN114722158B (en) * 2022-06-01 2022-09-02 中科航迈数控软件(深圳)有限公司 Method and system for matching numerical control machine tool manufacturing process based on subject word clustering

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101599071A (en) * 2009-07-10 2009-12-09 华中科技大学 The extraction method of conversation text topic
CN102096633A (en) * 2010-12-10 2011-06-15 东华大学 Application field oriented software quality standard evaluating method
CN106919674A (en) * 2017-02-20 2017-07-04 广东省中医院 A kind of knowledge Q-A system and intelligent search method built based on Wiki semantic networks
CN107463607A (en) * 2017-06-23 2017-12-12 昆明理工大学 The domain entities hyponymy of bluebeard compound vector sum bootstrapping study obtains and method for organizing

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9424524B2 (en) * 2013-12-02 2016-08-23 Qbase, LLC Extracting facts from unstructured text

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101599071A (en) * 2009-07-10 2009-12-09 华中科技大学 The extraction method of conversation text topic
CN102096633A (en) * 2010-12-10 2011-06-15 东华大学 Application field oriented software quality standard evaluating method
CN106919674A (en) * 2017-02-20 2017-07-04 广东省中医院 A kind of knowledge Q-A system and intelligent search method built based on Wiki semantic networks
CN107463607A (en) * 2017-06-23 2017-12-12 昆明理工大学 The domain entities hyponymy of bluebeard compound vector sum bootstrapping study obtains and method for organizing

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
开放领域下复杂文本的关系抽取;盛美伦;《中国优秀硕士学位论文全文数据库信息科技辑》;20150615;全文 *

Also Published As

Publication number Publication date
CN108897857A (en) 2018-11-27

Similar Documents

Publication Publication Date Title
CN108897857B (en) Chinese text subject sentence generating method facing field
CN108984724B (en) Method for improving emotion classification accuracy of specific attributes by using high-dimensional representation
CN110929030B (en) Text abstract and emotion classification combined training method
CN110413986A (en) A kind of text cluster multi-document auto-abstracting method and system improving term vector model
CN111737496A (en) Power equipment fault knowledge map construction method
CN110321563B (en) Text emotion analysis method based on hybrid supervision model
CN112101041B (en) Entity relationship extraction method, device, equipment and medium based on semantic similarity
CN110619051B (en) Question sentence classification method, device, electronic equipment and storage medium
CN111950273A (en) Network public opinion emergency automatic identification method based on emotion information extraction analysis
CN111274790B (en) Chapter-level event embedding method and device based on syntactic dependency graph
CN110633365A (en) Word vector-based hierarchical multi-label text classification method and system
CN111581368A (en) Intelligent expert recommendation-oriented user image drawing method based on convolutional neural network
CN115952292B (en) Multi-label classification method, apparatus and computer readable medium
CN113157859A (en) Event detection method based on upper concept information
CN111753058A (en) Text viewpoint mining method and system
CN113515632A (en) Text classification method based on graph path knowledge extraction
CN114417851A (en) Emotion analysis method based on keyword weighted information
CN115292490A (en) Analysis algorithm for policy interpretation semantics
CN111428502A (en) Named entity labeling method for military corpus
CN113486143A (en) User portrait generation method based on multi-level text representation and model fusion
Xiao et al. Chinese questions classification in the law domain
Zhu et al. Design of knowledge graph retrieval system for legal and regulatory framework of multilevel latent semantic indexing
CN113516198A (en) Cultural resource text classification method based on memory network and graph neural network
CN113761123A (en) Keyword acquisition method and device, computing equipment and storage medium
CN113779987A (en) Event co-reference disambiguation method and system based on self-attention enhanced semantics

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant