CN109947894A

CN109947894A - A kind of text label extraction system

Info

Publication number: CN109947894A
Application number: CN201910008718.8A
Authority: CN
Inventors: 孔洋洋; 钱程; 朱劲松
Original assignee: Beijing Chehui Technology Co Ltd
Current assignee: Beijing Chehui Technology Co Ltd
Priority date: 2019-01-04
Filing date: 2019-01-04
Publication date: 2019-06-28
Anticipated expiration: 2039-01-04
Also published as: CN109947894B

Abstract

This application discloses a kind of text label extraction systems.The system includes: acquisition module, for obtaining tag library；Training module, for utilizing tag library training coder-decoder model；Extraction module, for the label using the coder-decoder model extraction text.According to the technical solution of the application, the label of the text for such as article and model can be generated, user is facilitated to find the information needed by label.

Description

A kind of text label extraction system

Technical field

This application involves machine learning field more particularly to a kind of text label extraction systems.

Background technique

For the portal website towards fields such as such as automobile, tourism, films, there can be an a large amount of article, and for The forum of website, the network user can issue many models.For the ease of classifying to article or model, need to these Article or model are labelled.

Currently, the article of portal website or the label of model, are that web editor is drafted out according to the content of article, because But label is generated by artificial treatment.

In the prior art for the processing mode of label, so that the repetitive rate that label occurs is higher, editor can use as far as possible Before used label perhaps the setting of label with the personal emotion edited is excessively related causes and article or model Content mismatches, in fact it could happen that the too large or too small problem of label range is unable to get accurately label, is also not easy to user Search, to be difficult to further be applied according to label, such as to user's portrait or to user's push content or extensively It accuses.

Summary of the invention

In view of this, present applicant proposes a kind of text label extraction system, to reduce the difficulty of tag extraction.

According to the one aspect of the application, a kind of text label extraction system is proposed, which includes:

Module is obtained, for obtaining tag library；

Training module, for utilizing tag library training coder-decoder model；

Extraction module, for the label using the coder-decoder model extraction text.

Preferably, module is obtained, is also used to obtain tag library by the method for unsupervised learning, it is preferable that the no prison The method that educational inspector practises is name placement.

Preferably, the coder-decoder model includes encoder model and decoder model；

Wherein, the encoder model and/or decoder model use neural fusion,

Preferably, the neural network be Recognition with Recurrent Neural Network.

Preferably, stating encoder model is h_t=f (h_t-1,x_t), wherein f () is tanh excitation function, x_tIt is defeated for current layer Enter, h_tFor the output of current layer, h_t-1For upper one layer of output.

Preferably, the decoder model is

Wherein, S_t-1For the input of this layer, y_t-1For upper one layer of output, C_tFor the output of encoder model, as pass through Cross the semantic coding that encoder obtains, y_tFor the output of current layer, S_tFor the input of current layer, g () is SoftMax function, is obtained To in y_t-1,y_t-2,…,y₁,C_tWord order situation next but one output be y_tProbability.

Preferably, the decoder model uses attention mechanism.

Preferably, the decoded model calculates the output of received encoding model according to the following formula:

Wherein,

T_xFor the length of the sentence a of encoding model input, that is, word quantity.a_tjJ-th when to export t-th of word The Automobile driving coefficient of word, h_jFor the semantic coding output of j-th of word.Automobile driving coefficient a_tjBe using SoftMax function is to e_tjBe normalized, obtain each word to some word decoding stage significance level, that is, Attention should be more placed on certain words by we.e_tjCalculation such as above formula,W_aAnd V_aIt is all weight matrix, passes through Training obtains optimum value.S_t-1For the input of this layer of decoding stage, h_jFor the output of j-th of word of coding stage.

Preferably, the tag library is the tag library of automotive field, and the text is the text of automotive field.

Preferably, which further includes user's portrait module, for realizing user's portrait according to the label.According to this Shen Technical solution please can generate the label of the text for such as article and model, and facilitating user to find by label needs The information wanted, and drawn a portrait in turn according to the user that label enriches user, it is convenient to classify to user, carry out targetedly information Push.Further, it is also possible to which by the study to popular label the label of unexpected winner can be improved, and popular label can be with body Reveal the temperature of user's concern, provides help for network operator's formulation strategy of website.

Other features and advantage are by the following detailed description will be given in the detailed implementation section.

Detailed description of the invention

The attached drawing constituted part of this application is used to provide further understanding of the present application, the schematic reality of the application Mode and its explanation are applied for explaining the application.In the accompanying drawings:

Fig. 1 is text label extraction system schematic diagram provided by the embodiments of the present application；

Fig. 2 is the undirected illustrated example provided by the embodiments of the present application being made of word；

Fig. 3 is coder-decoder model schematic provided by the embodiments of the present application；

Fig. 4 is the coder-decoder model schematic based on attention mechanism of the embodiment of the present application.

Specific embodiment

It should be noted that in the absence of conflict, the spy in embodiment and each embodiment in the application Sign can be combined with each other.

The application is described in detail below with reference to the accompanying drawings and in conjunction with embodiment.

Fig. 1 shows text label extraction system provided by the embodiments of the present application, specifically includes:

Module is obtained, for obtaining tag library；Obtain the mode of label in the embodiment of the present application without limitation, such as can In the way of determining tag library after a certain number of articles of manual read perhaps model or utilize unsupervised pattern learning Tag library is obtained, can also tag library be obtained by other means.

Training module, for utilizing tag library training coder-decoder model；It, can be with after getting tag library Coder-decoder model is trained using the tag library, coder-decoder model can be neural fusion Model, such as the model realized by Recognition with Recurrent Neural Network (RNN).

Extraction module, for the label using the coder-decoder model extraction text；It is completed using tag library After the training of coder-decoder model, coder-decoder model treatment text can use, to realize to mark The extraction of label.

Text label extraction system further includes user's portrait module, for realizing user's portrait according to the label.

Preferably, the method for unsupervised pattern learning can be name placement (PositionRank).PositionRank PositionRank method based on page-ranking (PageRank).PositionRank method integrates PageRank to calculate text The importance score of word in chapter, and also consider position and the frequency consideration of word.In PositionRank method, first Text is carried out part-of-speech tagging, obtained nouns and adjectives is withdrawn as candidate word, and candidate word is constituted into a word Non-directed graph, as shown in Fig. 2, the nouns and adjectives for passing through part-of-speech tagging is the node of non-directed graph.In addition, using a fixation Size windows segment text, if two candidate words in the same window, will just be connected between the two words with line.It obtains After non-directed graph, according to the principle of PageRank, the score of each of non-directed graph node is calculated:

In (1) formula, what S was indicated is PageRank score matrix,That represent is the adjacency matrix of non-directed graph, S (t+1) For the score matrix at t+1 moment, by adjacency matrixIt is multiplied to obtain with t moment score matrix.

Wherein adjacency matrixNeed to be normalized operation, the value in matrix before calculatingCalculated as below. m_ijWhat is indicated is that for j-th of node to the weight of i-th of node, what V was represented is non-directed graph in non-directed graph, | V | representative is node Quantity, what is be normalized is exactly

Simultaneously in order to ensure non-directed graph will not fall into figure circulation, it is added to damping factor α, while the position that word is added is inclined AmountChanged to obtain (2) formula by (1) formula:

Position deviatorSpecific formula for calculation is as follows:

p₁, p₂Equal representatives be word initial score, the position occurred in the text with word is inversely proportional and term frequencies It is directly proportional, if the 5th, 6,7 position first word occurs in the text, thenp₁+p₂+…+p_|v|What is represented is institute There is the total score of word, then uses p₁The specific gravity that first word accounts in all words can be obtained divided by total score, finally Obtain position deviator

In summary formula (1) and (2), available following formula:

Mark is v_iThe PositionRank score value of node, α are damping factors,For v_iPosition deviator.It is v_jThe PositionRank score value of node, w_jiIt is from v_jNode is to v_iThe weight of node, Adj (v_i) that represent is v_iSection The adjoint matrix of point, O (v_j) that represent is v_jNode it is all go out to side weight and.

Pass through above-mentioned PositionRank method, the tag set of available text.Preferably, before ranking being taken 10 word is included in list of labels.Further, check whether combination of two occurred three times or more than three times in original text, If it is present this label synthesized is also added in keyword label list, the value of PositionRank is two labels The sum of respective PositionRank value.Finally obtained list of labels is arranged from high to low according to PositionRank value Sequence, the label of ranking first five or first three can be used as the label of article, model.By the processing to a large amount of articles and model, Much many labels, to get tag library.Preferably for the tag library got, artificial screening can also be carried out, Incorrect label is removed, the quality of tag library is improved.

In the embodiment of the present application, using the tag library of acquisition, text is obtained by way of end-to-end deep learning Label.Such as utilize tag library training coder-decoder model.Decoder-encoder model includes decoder model and volume Code device model, generally uses neural fusion, such as at least one of decoder model and encoder model that can use Recognition with Recurrent Neural Network (RNN) Lai Shixian.

Fig. 3 shows coder-decoder structural schematic diagram, and wherein encoder model and decoder model are all made of circulation Neural fusion.For encoder model, the hidden layer of starting receives a text input, and obtained result passes through weight meter The input of next hidden layer is combined as after calculation with next layer of input text, until end-of-encode.I.e. some state is It is related with current input and Last status, it can be expressed as:

h_t=f (h_t-1,x_t)

After obtaining stateful information, the semantic coding of generation is exactly h_t, then h_tInput as decoding RNN.

For decoder model, the hiding layer state s at a certain moment_tBy upper layer state S_t-1, y_t-1And semantic coding C_t It determines:

S_t=f (S_t-1,y_t-1,C_t)

Wherein y_tIt is by upper one layer of input y_t-1, semantic coding C_tAnd hiding layer state S_tIt determines:

P(y_t|y_t-1,y_t-2,…,y₁,C_t)=g (S_t,y_t-1,C_t)

Wherein, f () and g () is excitation function, and f () is tanh function, and g () is softmax function.

Use single semantic coding h_tIt can use in each hidden layer of decoder model, in the sufficiently long situation of text Under, semantic coding h_tAll information may not necessarily be all kept, so in decoding stage, semantic coding used in different layers It should be distinguished, be needed plus attention (Attention) mechanism.

In decoding stage, each decoded semantic coding input of step can all re-start calculating, defeated according to coding stage H out₁,h₂,h₃…h_tIt is weighted summation, is needed according to Multilayer Perception (Multi-layer Perception, MLP) model meter Output sequence t is calculated for the respective weights a of the hidden layer of each list entries j_tj, then all hidden layers are weighted Average, that finally obtain is exactly the semantic coding C under current state_t, formula is as follows:

Respective weights a_ijCalculating calculated by following formula:

Wherein, e_tjIt is the output of MLP model:

It can be model shown in Fig. 4 by the model modification in Fig. 3 using above-mentioned attention mechanism, wherein X1-X6 table Show that input text, Y1-Y7 indicate output text, the part that middle arrow is concentrated is coding vector.

After training obtains coder-decoder model through the above way, the coder-decoder model can use A large amount of text is handled, to extract the label in text.Preferably, it after extracting outgoing label, can also incite somebody to action The label is added in tag library, is further trained to coder-decoder model, to further increase encoder-solution The accuracy of code device.

It, can be with after the tag library using automotive field trains coder-decoder model by taking automotive field as an example Tag extraction is carried out to the text of automotive field.For example, A motor corporation is proposed a new model B, and get advertising slogan: " A Company releases new model B " in XX XX month XX day weight pound.It, can be according to above-mentioned by trained coder-decoder model Advertising slogan carries out tag extraction.For tag library, A can be already present in tag library as label, and by instruction Experienced coder-decoder model can also extract new label B, so as to using A and B as the label of the advertising slogan. Further, the label B extracted can also be added in tag library, for further to coder-decoder model into Row training.

For registering user, if it has issued model or text in portal websites such as such as automobile, tourism, food and drink Chapter can also draw a portrait to it using the label extracted, and user's portrait of the user is obtained.

The foregoing is merely the better embodiments of the application, all the application's not to limit the application Within spirit and principle, any modification, equivalent replacement, improvement and so on be should be included within the scope of protection of this application.

Claims

1. a kind of text label extraction system, which is characterized in that the system includes:

Module is obtained, for obtaining tag library；

Training module, for utilizing tag library training coder-decoder model；

Extraction module, for the label using the coder-decoder model extraction text.

2. system according to claim 1, which is characterized in that obtain module, be also used to the method by unsupervised learning Obtain tag library, it is preferable that the method for the unsupervised learning is name placement.

3. system according to claim 1, which is characterized in that the coder-decoder model includes encoder model And decoder model；

Wherein, the encoder model and/or decoder model use neural fusion,

Preferably, the neural network be Recognition with Recurrent Neural Network.

4. system according to claim 3, which is characterized in that the encoder model is h_t=f (h_t-1,x_t), wherein f () For tanh excitation function, x_tFor current layer input, h_tFor the output of current layer, h_t-1For upper one layer of output.

5. system according to claim 3 or 4, which is characterized in that the decoder model is

Wherein, S_t-1For the input of current layer, y_t-1For upper one layer of output, C_tFor the output of encoder model, y_tFor current layer Output, S_tFor the input of current layer, g () is SoftMax function.

6. system according to claim 5, which is characterized in that the decoder model uses attention mechanism.

7. system according to claim 5, which is characterized in that the decoder model is defeated by received encoder model It calculates according to the following formula out:

Wherein,

T_xFor the length of the sentence a of encoding model input, a_tjThe Automobile driving system of j-th of word when to export t-th of word Number, h_jIt is exported for the semantic coding of j-th of word,W_aAnd V_aIt is all weight matrix.

8. system according to claim 1, which is characterized in that the tag library is the tag library of automotive field, the text This is the text of automotive field.

9. system according to claim 1, which is characterized in that the system further includes user's portrait module, for according to institute It states label and realizes user's portrait.