CN109508459A

CN109508459A - A method of extracting theme and key message from news

Info

Publication number: CN109508459A
Application number: CN201811313654.4A
Authority: CN
Inventors: 杨红飞
Original assignee: Hangzhou Firestone Technology Co Ltd
Current assignee: Huoshi Creation Technology Co ltd
Priority date: 2018-11-06
Filing date: 2018-11-06
Publication date: 2019-03-22
Anticipated expiration: 2038-11-06
Also published as: CN109508459B

Abstract

The method that the invention discloses a kind of to extract theme and key message from news, comprising the following steps: html tag is carried out to news content and is handled；To treated, news content carries out theme mark and serializing mark respectively, obtains the corresponding serializing label of each word in the corresponding theme label of news content and news content；It creates theme and key message extracts model, which includes a seq2seq network and a fully-connected network, and from the state output of the coding stage of seq2seq network, training pattern obtains optimized parameter for the input of fully-connected network；It is injected into extraction model after carrying out html tag processing to the news content not marked, obtains optimal theme label and serializing label, news generic is obtained according to theme label, the corresponding slot position value of news content is obtained according to serializing label.This method uses seq2seq+attention+crf, strengthens the dependence of disaggregated model and slot filling model, reduces the complexity of text marking, while reducing project development complexity.

Description

A method of extracting theme and key message from news

Technical field

The present invention relates to text classification and information extraction fields, more particularly to one kind, and theme and crucial letter are extracted from news The method of breath.

Background technique

Theme of news extracts the scope for belonging to text classification, and the slot filling in key message extraction belongs to the model of information extraction Farmland is all the chief component of natural language processing.Text classification correlative study can trace back to the fifties in last century earliest, It was to be classified by Expert Rules (Pattern), or even once develop at the beginning of the eighties and established using knowledge engineering at that time Expert system, the advantage of doing so is that short, adaptable and fast solves the problems, such as top, it is apparent that ceiling is very low, it is not only time-consuming and laborious, it covers The range and accuracy rate of lid are all very limited.Later along with internet after the development of statistical learning method, the especially nineties Online amount of text increases and the rise of machine learning subject, has gradually formed a set of warp for solving the problems, such as large-scale text categorization Allusion quotation playing method, the main set pattern in this stage are manual features engineering+shallow-layer disaggregated models.Entire text classification problem is just split into Feature Engineering and classifier two parts.

Feature Engineering often most takes time and effort in machine learning, but extremely important.For abstract, engineering Habit problem is the process for converting data to information and refining knowledge again, is characterized in the process of " data -- > information ", determines to finish The upper limit of fruit, and classifier is the process of " information -- > knowledge ", then is to approach this upper limit.However Feature Engineering is different from Sorter model does not have very strong versatility, generally requires to combine the understanding to feature task.Where text classification problem Naturally also there is its distinctive characteristic processing logic in natural language field, this classification task of tradition point largely works also here. Text feature engineering is divided into three Text Pretreatment, feature extraction, text representation parts, and final purpose is to convert text to count The intelligible format of calculation machine, and encapsulate the information for being sufficiently used for classification, i.e., very strong feature representation ability.Classifier is substantially Statistical classification method, substantial majority machine learning method are all applied in text classification field, such as simple pattra leaves This sorting algorithm (Bayes), KNN, SVM, maximum entropy and neural network etc..

Unstructured data as natural language sentences is converted into structural data, then utilizes powerful inquiry work Tool, such as SQL.This method that meaning is obtained from text is referred to as information extraction, and information extracting system search is a large amount of non-structural Change text, finds certain types of entity and relationship, and be used to fill organized database.These databases can be used To find the answer of particular problem.It is broadly divided into name Entity recognition, relationship is extracted.

Naming Entity recognition (NER) is a classical problem in natural language processing, and application is also extremely wide.Than Name, place name are such as identified from a word, and the name of product, identification medicine name etc. are identified from the search of electric business. Traditional generally acknowledges that relatively good Processing Algorithm is condition random field (CRF), it is a kind of discriminate probabilistic model, is random field One kind being usually used in mark or analytical sequence data, such as natural language text or biological sequence.It is simply to say to apply in NER It is to give a series of feature to go to predict the label of each word.

Relationship is extracted mainly between the semantic classification entity, and the Relation extraction technology of existing mainstream is divided into have supervision Learning method, semi-supervised learning method and unsupervised three kinds of learning method:

1, the learning method of supervision as classification problem, designs Relation extraction task effective special according to training data Then sign uses trained classifier projected relationship to learn various disaggregated models.The problem of this method, is to need big The artificial mark training corpus of amount, and corpus labeling work usually takes time and effort very much.

2, semi-supervised learning method mainly uses Bootstrapping to carry out Relation extraction.For the relationship to be extracted, This method sets several sub-instance by hand first, then iteratively from data from corresponding relationship templates of the relationship of extraction and more Example.

3, unsupervised learning method assumes to possess the entity of identical semantic relation to possessing similar contextual information.Cause This can use each entity to contextual information is corresponded to represent the semantic relation of the entity pair, and to the language of all entities pair Adopted relationship is clustered.

Compared with other two methods, there is the learning method of supervision that can extract more effective feature, accuracy rate and calls together The rate of returning is all higher.Therefore the learning method of supervision receives the concern of more and more scholars.

In nowadays most applications, name Entity recognition, relationship extraction are all that individual task is executing, and are less used It says in conjunction with text classification.Currently used entity, the method that Relation extraction method is assembly line: one sentence of input, it is first It is first named Entity recognition, combination of two then is carried out to the entity identified, then carry out relationship classification, finally presence The triple of entity relationship is as input.The method of assembly line there are the shortcomings that have: 1) error propagation, the mistake of Entity recognition module Misunderstanding influences following relationship classification performance；2) existing relationship between two subtasks is ignored.3) producing need not Redundancy then carry out relationship classification again, those are not related due to being matched two-by-two to the entity identified Entity promotes error rate to can bring about redundant information.

Existing text classification and slot filling are all intended only as individual model to train, and not only ignore between task Dependence, and aggravated the development cycle of entire project, increased the workload of text marking.Text classification and information What the mode of all common supervised learning of extraction was realized, and supervised learning must have enough sample datas, the mark of sample Note is to compare the work taken time and effort, and mark quality also because of people.In this case, task reads the complexity of mark just Bigger and quality also more Customers ' Legal Right.Solve the problems, such as that natural language processing is the most popular with deep learning at present, but depth The period for practising general training is longer, and the task the how especially prominent, seriously constrains the iteration of project.

Summary of the invention

Slot filling is a vital task in natural language understanding, is for extracting various role's letters relevant with event Breath and attribute information.News category and slot filling usually are divided into two independent models to train, and two moulds Type is incoherent.But say that slot filling is to rely on news category on operational angle, different classes of problem will fill Slot type be also different.Technical solution provided by the invention is to instruct news category and slot filling as a model Practice, multiple tasks are incorporated into a task, the correlation between task has been fully considered, can avoid to a greater degree in this way News category and the unmatched problem of slot type, reduce the development cycle, improve the accuracy of result.

The method of the present invention mainly uses the scheme of seq2seq+attention+crf to solve, specifically includes the following steps:

(1) html tag is carried out to the news content crawled on webpage to handle；

(2) to treated, news content carries out theme mark and serializing mark respectively, and it is corresponding to obtain news content The corresponding serializing label of each word in theme label and news content；The theme is noted for the mark affiliated class of news Not；Serializing mark is primarily directed in the case where having marked theme, determining the relevant role of theme or attribute Information；

(3) it creates theme and key message extracts model, which includes a seq2seq network and a fully connected network Network, the state output of the input of fully-connected network from the coding stage of seq2seq network；

(4) news data that step (2) has marked is injected into the seq2seq network for extracting model, to news content In word encoded, cataloged procedure is as follows: first in news content each word carry out embedding vectorization processing, Vectorization matrix is obtained, then vectorization matrix is injected into coding BiLstm bidirectional circulating neural network, obtains outputs Output matrix and finalState end-state matrix；

(5) it is directed to theme label, finalState matrix is injected into the fully-connected network for extracting model, is obtained Logic matrix and actual theme label are done cross entropy and handle to obtain penalty values category_ by logic intermediate result matrix loss；

(6) for serializing label, outputs matrix progress attention attention mechanism is converted to obtain Attention attention matrix；

(7) the decoding BiLstm for attention matrix and outputs matrix being input to seq2seq network together is bis- Into Recognition with Recurrent Neural Network, decode_outputs decoded output matrix is obtained, calculates decode_ with crf loss function Outputs matrix penalty values solt_loss corresponding with serializing label；

(8) the whole loss value loss that category_loss is obtained extracting network plus solt_loss, then utilizes Gradient descent method carries out backpropagation to loss, obtains the optimized parameter for extracting model；

(9) it after carrying out html tag processing to the news content not marked, is injected into theme and key message extracts mould In type, optimal theme label and serializing label is obtained, news generic is obtained according to theme label, is marked according to serializing Label obtain the corresponding slot position value of news content, i.e. role or attribute information.

Further, in the step (4), embedding vectorization processing tool is carried out to each word in news content Body are as follows: the good embedding word vector of pre-training is directly injected into seq2seq network with the method for transfer learning, is being instructed It does not need to be updated the parameter in embedding word vector during practicing.

Further, in the step (6), outputs matrix progress attention attention mechanism is converted to obtain During attention attention matrix, by the way of Self attention and Multi-head, solves tradition Attention model can not parallelization the shortcomings that, promote effect and performance.

Further, in the step (9), theme and key message extract model output theme label matrix and serializing Label matrix using softmax as activation primitive, obtains the theme label of maximum probability as optimal in theme label matrix Theme label；For serializing label matrix, decode_outputs decoded output matrix is subjected to condition random field crf solution Code obtains optimal serializing label.

It is mentioned the beneficial effects of the present invention are: being extracted the invention proposes a kind of disposable solution theme of news with key message The method taken, the present invention use seq2seq+attention+crf scheme, enhance disaggregated model and slot filling model according to The relationship of relying, reduces the complexity of text marking, while can reduce project development complexity.

Detailed description of the invention

Fig. 1 is the implementation process schematic diagram of one embodiment of the invention.

Specific embodiment

Invention is further described in detail in the following with reference to the drawings and specific embodiments.

A kind of method for extracting theme and key message from news provided by the invention, comprising the following steps:

(1) html tag is carried out to the news content crawled on webpage to handle；

(2) to treated, news content carries out theme mark and serializing mark respectively, and it is corresponding to obtain news content The corresponding serializing label of each word in theme label and news content；

The main mark news generic of theme mark, for example, for financial institution, it will news relevant to information of inviting outside investment It is labeled as 1, other news are labeled as 0；

Serializing mark is primarily directed in the case where having marked theme, determining the relevant role of theme or attribute Information, such as financing event relevant for information of inviting outside investment, corresponding role are investor, by investor etc., corresponding attribute Inferior for the financing amount of money, financing wheel, these corresponding roles and attribute are exactly slot position；

Each word in news content is carried out in embedding vectorization treatment process, it will with the method for transfer learning The good embedding word vector of pre-training is directly injected into seq2seq network, is not needed in the training process pair Parameter in embedding word vector is updated.

(6) for serializing label, outputs matrix progress attention attention mechanism is converted to obtain Attention attention matrix by the way of Self attention and Multi-head, solves biography in the process Unite attention model can not parallelization the shortcomings that, promote effect and performance；

Theme and key message extract model output theme label matrix and serializing label matrix, for theme label square Battle array, using softmax as activation primitive, obtains the theme label of maximum probability as optimal theme label；Serializing is marked Matrix is signed, decodes decode_outputs decoded output matrix progress condition random field crf to obtain optimal serializing label.

Such as the processing of the method for the present invention is carried out to following news:

It " steps the auspicious high-end intelligent woman issued based on big data algorithm and produces ultrasonic special machine Chinese mythology goddess Resona 8, she includes The multinomial intelligence such as the automatic volume navigation of fetus cranium brain, the self-navigation of fetus face, the automatic volume navigation of fetal rhythm, intelligent basin baselap sound Using, will for the pre-natal diagnosis of women, postpartum recovery, healthy reproduction bring heart to heart take good care of "；

As shown in Figure 1, the news is input in theme and key message extraction model, model is with word in coding stage Basic unit carries out embedding, f-lstm, b-lstm respectively and obtains outputs output matrix and the final shape of finalState State matrix；FinalState end-state matrix is carried out full connection to handle to obtain final theme label；In decryption phase, The corresponding attention of outpouts and outputs is injected into decryption network together, carried out respectively in decryption network lstm, Crf decode handles to obtain final serializing label, and serializing label is finally converted into corresponding slot position value.

The foregoing is merely preferable implementation examples of the invention, are not intended to restrict the invention, it is all in spirit of that invention and Within principle, any modification, equivalent replacement, improvement and so on be should all be included in the protection scope of the present invention.

Claims

1. a kind of method for extracting theme and key message from news, which comprises the following steps:

(1) html tag is carried out to the news content crawled on webpage to handle；

(2) to treated, news content carries out theme mark and serializing mark respectively, obtains the corresponding theme of news content The corresponding serializing label of each word in label and news content；The theme is noted for mark news generic；Institute Serializing mark is stated primarily directed in the case where having marked theme, determining the relevant role of theme or attribute information；

(3) it creates theme and key message extracts model, which includes a seq2seq network and a fully-connected network, State output of the input of fully-connected network from the coding stage of seq2seq network；

(4) news data that step (2) has marked is injected into the seq2seq network for extracting model, in news content Word is encoded, and cataloged procedure is as follows: being carried out embedding vectorization processing to each word in news content first, is obtained Then vectorization matrix is injected into coding BiLstm bidirectional circulating neural network by vectorization matrix, obtain outputs output Matrix and finalState end-state matrix；

(5) it is directed to theme label, finalState matrix is injected into the fully-connected network for extracting model, is obtained in logic Between matrix of consequence, logic matrix and actual theme label are done into cross entropy and handle to obtain penalty values category_loss；

(6) for serializing label, outputs matrix progress attention attention mechanism is converted to obtain attention note Meaning torque battle array；

(7) it follows decoding BiLstm that attention matrix and outputs matrix are input to seq2seq network together is two-way In ring neural network, decode_outputs decoded output matrix is obtained, calculates decode_outputs square with crf loss function Battle array penalty values solt_loss corresponding with serializing label；

(9) it after carrying out html tag processing to the news content not marked, is injected into theme and key message extracts in model, Optimal theme label and serializing label is obtained, news generic is obtained according to theme label, is obtained according to serializing label To the corresponding slot position value of news content, i.e. role or attribute information.

2. a kind of method for extracting theme and key message from news according to claim 1, which is characterized in that described In step (4), embedding vectorization processing is carried out to each word in news content specifically: with the method for transfer learning The good embedding word vector of pre-training is directly injected into seq2seq network, is not needed in the training process pair Parameter in embedding word vector is updated.

3. a kind of method for extracting theme and key message from news according to claim 1, which is characterized in that described In step (6), convert outputs matrix progress attention attention mechanism to obtain the mistake of attention attention matrix Cheng Zhong, by the way of Self attention and Multi-head, solving traditional attention model can not parallelization Disadvantage promotes effect and performance.

4. a kind of method for extracting theme and key message from news according to claim 1, which is characterized in that described In step (9), theme and key message extract model output theme label matrix and serializing label matrix, for theme label Matrix obtains the theme label of maximum probability as optimal theme label using softmax as activation primitive；For serializing Label matrix decodes decode_outputs decoded output matrix progress condition random field crf to obtain optimal serializing mark Label.

5. a kind of method for extracting theme and key message from news according to claim 1, which is characterized in that the party Method uses seq2seq+attention+crf, strengthens the dependence of disaggregated model and slot filling model, reduces text marking Complexity, while reducing project development complexity.