CN109947894A - A kind of text label extraction system - Google Patents

A kind of text label extraction system Download PDF

Info

Publication number
CN109947894A
CN109947894A CN201910008718.8A CN201910008718A CN109947894A CN 109947894 A CN109947894 A CN 109947894A CN 201910008718 A CN201910008718 A CN 201910008718A CN 109947894 A CN109947894 A CN 109947894A
Authority
CN
China
Prior art keywords
model
label
text
tag library
decoder model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910008718.8A
Other languages
Chinese (zh)
Other versions
CN109947894B (en
Inventor
孔洋洋
钱程
朱劲松
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Chehui Technology Co Ltd
Original Assignee
Beijing Chehui Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Chehui Technology Co Ltd filed Critical Beijing Chehui Technology Co Ltd
Priority to CN201910008718.8A priority Critical patent/CN109947894B/en
Publication of CN109947894A publication Critical patent/CN109947894A/en
Application granted granted Critical
Publication of CN109947894B publication Critical patent/CN109947894B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

This application discloses a kind of text label extraction systems.The system includes: acquisition module, for obtaining tag library;Training module, for utilizing tag library training coder-decoder model;Extraction module, for the label using the coder-decoder model extraction text.According to the technical solution of the application, the label of the text for such as article and model can be generated, user is facilitated to find the information needed by label.

Description

A kind of text label extraction system
Technical field
This application involves machine learning field more particularly to a kind of text label extraction systems.
Background technique
For the portal website towards fields such as such as automobile, tourism, films, there can be an a large amount of article, and for The forum of website, the network user can issue many models.For the ease of classifying to article or model, need to these Article or model are labelled.
Currently, the article of portal website or the label of model, are that web editor is drafted out according to the content of article, because But label is generated by artificial treatment.
In the prior art for the processing mode of label, so that the repetitive rate that label occurs is higher, editor can use as far as possible Before used label perhaps the setting of label with the personal emotion edited is excessively related causes and article or model Content mismatches, in fact it could happen that the too large or too small problem of label range is unable to get accurately label, is also not easy to user Search, to be difficult to further be applied according to label, such as to user's portrait or to user's push content or extensively It accuses.
Summary of the invention
In view of this, present applicant proposes a kind of text label extraction system, to reduce the difficulty of tag extraction.
According to the one aspect of the application, a kind of text label extraction system is proposed, which includes:
Module is obtained, for obtaining tag library;
Training module, for utilizing tag library training coder-decoder model;
Extraction module, for the label using the coder-decoder model extraction text.
Preferably, module is obtained, is also used to obtain tag library by the method for unsupervised learning, it is preferable that the no prison The method that educational inspector practises is name placement.
Preferably, the coder-decoder model includes encoder model and decoder model;
Wherein, the encoder model and/or decoder model use neural fusion,
Preferably, the neural network be Recognition with Recurrent Neural Network.
Preferably, stating encoder model is ht=f (ht-1,xt), wherein f () is tanh excitation function, xtIt is defeated for current layer Enter, htFor the output of current layer, ht-1For upper one layer of output.
Preferably, the decoder model is
Wherein, St-1For the input of this layer, yt-1For upper one layer of output, CtFor the output of encoder model, as pass through Cross the semantic coding that encoder obtains, ytFor the output of current layer, StFor the input of current layer, g () is SoftMax function, is obtained To in yt-1,yt-2,…,y1,CtWord order situation next but one output be ytProbability.
Preferably, the decoder model uses attention mechanism.
Preferably, the decoded model calculates the output of received encoding model according to the following formula:
Wherein,
TxFor the length of the sentence a of encoding model input, that is, word quantity.atjJ-th when to export t-th of word The Automobile driving coefficient of word, hjFor the semantic coding output of j-th of word.Automobile driving coefficient atjBe using SoftMax function is to etjBe normalized, obtain each word to some word decoding stage significance level, that is, Attention should be more placed on certain words by we.etjCalculation such as above formula,WaAnd VaIt is all weight matrix, passes through Training obtains optimum value.St-1For the input of this layer of decoding stage, hjFor the output of j-th of word of coding stage.
Preferably, the tag library is the tag library of automotive field, and the text is the text of automotive field.
Preferably, which further includes user's portrait module, for realizing user's portrait according to the label.According to this Shen Technical solution please can generate the label of the text for such as article and model, and facilitating user to find by label needs The information wanted, and drawn a portrait in turn according to the user that label enriches user, it is convenient to classify to user, carry out targetedly information Push.Further, it is also possible to which by the study to popular label the label of unexpected winner can be improved, and popular label can be with body Reveal the temperature of user's concern, provides help for network operator's formulation strategy of website.
Other features and advantage are by the following detailed description will be given in the detailed implementation section.
Detailed description of the invention
The attached drawing constituted part of this application is used to provide further understanding of the present application, the schematic reality of the application Mode and its explanation are applied for explaining the application.In the accompanying drawings:
Fig. 1 is text label extraction system schematic diagram provided by the embodiments of the present application;
Fig. 2 is the undirected illustrated example provided by the embodiments of the present application being made of word;
Fig. 3 is coder-decoder model schematic provided by the embodiments of the present application;
Fig. 4 is the coder-decoder model schematic based on attention mechanism of the embodiment of the present application.
Specific embodiment
It should be noted that in the absence of conflict, the spy in embodiment and each embodiment in the application Sign can be combined with each other.
The application is described in detail below with reference to the accompanying drawings and in conjunction with embodiment.
Fig. 1 shows text label extraction system provided by the embodiments of the present application, specifically includes:
Module is obtained, for obtaining tag library;Obtain the mode of label in the embodiment of the present application without limitation, such as can In the way of determining tag library after a certain number of articles of manual read perhaps model or utilize unsupervised pattern learning Tag library is obtained, can also tag library be obtained by other means.
Training module, for utilizing tag library training coder-decoder model;It, can be with after getting tag library Coder-decoder model is trained using the tag library, coder-decoder model can be neural fusion Model, such as the model realized by Recognition with Recurrent Neural Network (RNN).
Extraction module, for the label using the coder-decoder model extraction text;It is completed using tag library After the training of coder-decoder model, coder-decoder model treatment text can use, to realize to mark The extraction of label.
Text label extraction system further includes user's portrait module, for realizing user's portrait according to the label.
Preferably, the method for unsupervised pattern learning can be name placement (PositionRank).PositionRank PositionRank method based on page-ranking (PageRank).PositionRank method integrates PageRank to calculate text The importance score of word in chapter, and also consider position and the frequency consideration of word.In PositionRank method, first Text is carried out part-of-speech tagging, obtained nouns and adjectives is withdrawn as candidate word, and candidate word is constituted into a word Non-directed graph, as shown in Fig. 2, the nouns and adjectives for passing through part-of-speech tagging is the node of non-directed graph.In addition, using a fixation Size windows segment text, if two candidate words in the same window, will just be connected between the two words with line.It obtains After non-directed graph, according to the principle of PageRank, the score of each of non-directed graph node is calculated:
In (1) formula, what S was indicated is PageRank score matrix,That represent is the adjacency matrix of non-directed graph, S (t+1) For the score matrix at t+1 moment, by adjacency matrixIt is multiplied to obtain with t moment score matrix.
Wherein adjacency matrixNeed to be normalized operation, the value in matrix before calculatingCalculated as below. mijWhat is indicated is that for j-th of node to the weight of i-th of node, what V was represented is non-directed graph in non-directed graph, | V | representative is node Quantity, what is be normalized is exactly
Simultaneously in order to ensure non-directed graph will not fall into figure circulation, it is added to damping factor α, while the position that word is added is inclined AmountChanged to obtain (2) formula by (1) formula:
Position deviatorSpecific formula for calculation is as follows:
p1, p2Equal representatives be word initial score, the position occurred in the text with word is inversely proportional and term frequencies It is directly proportional, if the 5th, 6,7 position first word occurs in the text, thenp1+p2+…+p|v|What is represented is institute There is the total score of word, then uses p1The specific gravity that first word accounts in all words can be obtained divided by total score, finally Obtain position deviator
In summary formula (1) and (2), available following formula:
Mark is viThe PositionRank score value of node, α are damping factors,For viPosition deviator.It is vjThe PositionRank score value of node, wjiIt is from vjNode is to viThe weight of node, Adj (vi) that represent is viSection The adjoint matrix of point, O (vj) that represent is vjNode it is all go out to side weight and.
Pass through above-mentioned PositionRank method, the tag set of available text.Preferably, before ranking being taken 10 word is included in list of labels.Further, check whether combination of two occurred three times or more than three times in original text, If it is present this label synthesized is also added in keyword label list, the value of PositionRank is two labels The sum of respective PositionRank value.Finally obtained list of labels is arranged from high to low according to PositionRank value Sequence, the label of ranking first five or first three can be used as the label of article, model.By the processing to a large amount of articles and model, Much many labels, to get tag library.Preferably for the tag library got, artificial screening can also be carried out, Incorrect label is removed, the quality of tag library is improved.
In the embodiment of the present application, using the tag library of acquisition, text is obtained by way of end-to-end deep learning Label.Such as utilize tag library training coder-decoder model.Decoder-encoder model includes decoder model and volume Code device model, generally uses neural fusion, such as at least one of decoder model and encoder model that can use Recognition with Recurrent Neural Network (RNN) Lai Shixian.
Fig. 3 shows coder-decoder structural schematic diagram, and wherein encoder model and decoder model are all made of circulation Neural fusion.For encoder model, the hidden layer of starting receives a text input, and obtained result passes through weight meter The input of next hidden layer is combined as after calculation with next layer of input text, until end-of-encode.I.e. some state is It is related with current input and Last status, it can be expressed as:
ht=f (ht-1,xt)
After obtaining stateful information, the semantic coding of generation is exactly ht, then htInput as decoding RNN.
For decoder model, the hiding layer state s at a certain momenttBy upper layer state St-1, yt-1And semantic coding Ct It determines:
St=f (St-1,yt-1,Ct)
Wherein ytIt is by upper one layer of input yt-1, semantic coding CtAnd hiding layer state StIt determines:
P(yt|yt-1,yt-2,…,y1,Ct)=g (St,yt-1,Ct)
Wherein, f () and g () is excitation function, and f () is tanh function, and g () is softmax function.
Use single semantic coding htIt can use in each hidden layer of decoder model, in the sufficiently long situation of text Under, semantic coding htAll information may not necessarily be all kept, so in decoding stage, semantic coding used in different layers It should be distinguished, be needed plus attention (Attention) mechanism.
In decoding stage, each decoded semantic coding input of step can all re-start calculating, defeated according to coding stage H out1,h2,h3…htIt is weighted summation, is needed according to Multilayer Perception (Multi-layer Perception, MLP) model meter Output sequence t is calculated for the respective weights a of the hidden layer of each list entries jtj, then all hidden layers are weighted Average, that finally obtain is exactly the semantic coding C under current statet, formula is as follows:
Respective weights aijCalculating calculated by following formula:
Wherein, etjIt is the output of MLP model:
It can be model shown in Fig. 4 by the model modification in Fig. 3 using above-mentioned attention mechanism, wherein X1-X6 table Show that input text, Y1-Y7 indicate output text, the part that middle arrow is concentrated is coding vector.
After training obtains coder-decoder model through the above way, the coder-decoder model can use A large amount of text is handled, to extract the label in text.Preferably, it after extracting outgoing label, can also incite somebody to action The label is added in tag library, is further trained to coder-decoder model, to further increase encoder-solution The accuracy of code device.
It, can be with after the tag library using automotive field trains coder-decoder model by taking automotive field as an example Tag extraction is carried out to the text of automotive field.For example, A motor corporation is proposed a new model B, and get advertising slogan: " A Company releases new model B " in XX XX month XX day weight pound.It, can be according to above-mentioned by trained coder-decoder model Advertising slogan carries out tag extraction.For tag library, A can be already present in tag library as label, and by instruction Experienced coder-decoder model can also extract new label B, so as to using A and B as the label of the advertising slogan. Further, the label B extracted can also be added in tag library, for further to coder-decoder model into Row training.
For registering user, if it has issued model or text in portal websites such as such as automobile, tourism, food and drink Chapter can also draw a portrait to it using the label extracted, and user's portrait of the user is obtained.
The foregoing is merely the better embodiments of the application, all the application's not to limit the application Within spirit and principle, any modification, equivalent replacement, improvement and so on be should be included within the scope of protection of this application.

Claims (9)

1. a kind of text label extraction system, which is characterized in that the system includes:
Module is obtained, for obtaining tag library;
Training module, for utilizing tag library training coder-decoder model;
Extraction module, for the label using the coder-decoder model extraction text.
2. system according to claim 1, which is characterized in that obtain module, be also used to the method by unsupervised learning Obtain tag library, it is preferable that the method for the unsupervised learning is name placement.
3. system according to claim 1, which is characterized in that the coder-decoder model includes encoder model And decoder model;
Wherein, the encoder model and/or decoder model use neural fusion,
Preferably, the neural network be Recognition with Recurrent Neural Network.
4. system according to claim 3, which is characterized in that the encoder model is ht=f (ht-1,xt), wherein f () For tanh excitation function, xtFor current layer input, htFor the output of current layer, ht-1For upper one layer of output.
5. system according to claim 3 or 4, which is characterized in that the decoder model is
Wherein, St-1For the input of current layer, yt-1For upper one layer of output, CtFor the output of encoder model, ytFor current layer Output, StFor the input of current layer, g () is SoftMax function.
6. system according to claim 5, which is characterized in that the decoder model uses attention mechanism.
7. system according to claim 5, which is characterized in that the decoder model is defeated by received encoder model It calculates according to the following formula out:
Wherein,
TxFor the length of the sentence a of encoding model input, atjThe Automobile driving system of j-th of word when to export t-th of word Number, hjIt is exported for the semantic coding of j-th of word,WaAnd VaIt is all weight matrix.
8. system according to claim 1, which is characterized in that the tag library is the tag library of automotive field, the text This is the text of automotive field.
9. system according to claim 1, which is characterized in that the system further includes user's portrait module, for according to institute It states label and realizes user's portrait.
CN201910008718.8A 2019-01-04 2019-01-04 Text label extraction system Active CN109947894B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910008718.8A CN109947894B (en) 2019-01-04 2019-01-04 Text label extraction system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910008718.8A CN109947894B (en) 2019-01-04 2019-01-04 Text label extraction system

Publications (2)

Publication Number Publication Date
CN109947894A true CN109947894A (en) 2019-06-28
CN109947894B CN109947894B (en) 2020-04-14

Family

ID=67007904

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910008718.8A Active CN109947894B (en) 2019-01-04 2019-01-04 Text label extraction system

Country Status (1)

Country Link
CN (1) CN109947894B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110717338A (en) * 2019-07-29 2020-01-21 北京车慧科技有限公司 Typical opinion generation device based on user comments
CN111324738A (en) * 2020-05-15 2020-06-23 支付宝(杭州)信息技术有限公司 Method and system for determining text label
CN113836443A (en) * 2021-09-28 2021-12-24 土巴兔集团股份有限公司 Article auditing method and related equipment thereof

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105930314A (en) * 2016-04-14 2016-09-07 清华大学 Text summarization generation system and method based on coding-decoding deep neural networks
CN108763284A (en) * 2018-04-13 2018-11-06 华南理工大学 A kind of question answering system implementation method based on deep learning and topic model
US20180329884A1 (en) * 2017-05-12 2018-11-15 Rsvp Technologies Inc. Neural contextual conversation learning
CN109086357A (en) * 2018-07-18 2018-12-25 深圳大学 Sensibility classification method, device, equipment and medium based on variation autocoder
CN109117483A (en) * 2018-07-27 2019-01-01 清华大学 The training method and device of neural network machine translation model

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105930314A (en) * 2016-04-14 2016-09-07 清华大学 Text summarization generation system and method based on coding-decoding deep neural networks
US20180329884A1 (en) * 2017-05-12 2018-11-15 Rsvp Technologies Inc. Neural contextual conversation learning
CN108763284A (en) * 2018-04-13 2018-11-06 华南理工大学 A kind of question answering system implementation method based on deep learning and topic model
CN109086357A (en) * 2018-07-18 2018-12-25 深圳大学 Sensibility classification method, device, equipment and medium based on variation autocoder
CN109117483A (en) * 2018-07-27 2019-01-01 清华大学 The training method and device of neural network machine translation model

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
CORINA FLORESCU等: "PositionRank:An Unsupervised Approach to Keyphrase Extraction from Scholarly Documents", 《ASSOCIATION FOR COMPUTATIONAL LINGUISTICS》 *
李亚超等: "神经机器翻译综述", 《计算机学报》 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110717338A (en) * 2019-07-29 2020-01-21 北京车慧科技有限公司 Typical opinion generation device based on user comments
CN111324738A (en) * 2020-05-15 2020-06-23 支付宝(杭州)信息技术有限公司 Method and system for determining text label
CN113836443A (en) * 2021-09-28 2021-12-24 土巴兔集团股份有限公司 Article auditing method and related equipment thereof

Also Published As

Publication number Publication date
CN109947894B (en) 2020-04-14

Similar Documents

Publication Publication Date Title
CN108021616B (en) Community question-answer expert recommendation method based on recurrent neural network
CN111966917B (en) Event detection and summarization method based on pre-training language model
CN109933664B (en) Fine-grained emotion analysis improvement method based on emotion word embedding
CN110019839B (en) Medical knowledge graph construction method and system based on neural network and remote supervision
CN108804612B (en) Text emotion classification method based on dual neural network model
CN109492229B (en) Cross-domain emotion classification method and related device
CN110598000A (en) Relationship extraction and knowledge graph construction method based on deep learning model
CN107729309A (en) A kind of method and device of the Chinese semantic analysis based on deep learning
CN110909164A (en) Text enhancement semantic classification method and system based on convolutional neural network
CN110188362A (en) Text handling method and device
CN109960728A (en) A kind of open field conferencing information name entity recognition method and system
CN102119385A (en) Method and subsystem for searching media content within a content-search-service system
CN111046941A (en) Target comment detection method and device, electronic equipment and storage medium
CN109947894A (en) A kind of text label extraction system
CN111414561B (en) Method and device for presenting information
CN112183056A (en) Context-dependent multi-classification emotion analysis method and system based on CNN-BilSTM framework
CN112579870A (en) Training method, device and equipment for searching matching model and storage medium
CN103886020A (en) Quick search method of real estate information
CN109472022A (en) New word identification method and terminal device based on machine learning
CN115659008B (en) Information pushing system, method, electronic equipment and medium for big data information feedback
CN114818717A (en) Chinese named entity recognition method and system fusing vocabulary and syntax information
CN115630145A (en) Multi-granularity emotion-based conversation recommendation method and system
CN112784602A (en) News emotion entity extraction method based on remote supervision
CN114298157A (en) Short text sentiment classification method, medium and system based on public sentiment big data analysis
CN115438674A (en) Entity data processing method, entity linking method, entity data processing device, entity linking device and computer equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant