CN109947894A - A kind of text label extraction system - Google Patents
A kind of text label extraction system Download PDFInfo
- Publication number
- CN109947894A CN109947894A CN201910008718.8A CN201910008718A CN109947894A CN 109947894 A CN109947894 A CN 109947894A CN 201910008718 A CN201910008718 A CN 201910008718A CN 109947894 A CN109947894 A CN 109947894A
- Authority
- CN
- China
- Prior art keywords
- model
- label
- text
- tag library
- decoder model
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Machine Translation (AREA)
Abstract
This application discloses a kind of text label extraction systems.The system includes: acquisition module, for obtaining tag library;Training module, for utilizing tag library training coder-decoder model;Extraction module, for the label using the coder-decoder model extraction text.According to the technical solution of the application, the label of the text for such as article and model can be generated, user is facilitated to find the information needed by label.
Description
Technical field
This application involves machine learning field more particularly to a kind of text label extraction systems.
Background technique
For the portal website towards fields such as such as automobile, tourism, films, there can be an a large amount of article, and for
The forum of website, the network user can issue many models.For the ease of classifying to article or model, need to these
Article or model are labelled.
Currently, the article of portal website or the label of model, are that web editor is drafted out according to the content of article, because
But label is generated by artificial treatment.
In the prior art for the processing mode of label, so that the repetitive rate that label occurs is higher, editor can use as far as possible
Before used label perhaps the setting of label with the personal emotion edited is excessively related causes and article or model
Content mismatches, in fact it could happen that the too large or too small problem of label range is unable to get accurately label, is also not easy to user
Search, to be difficult to further be applied according to label, such as to user's portrait or to user's push content or extensively
It accuses.
Summary of the invention
In view of this, present applicant proposes a kind of text label extraction system, to reduce the difficulty of tag extraction.
According to the one aspect of the application, a kind of text label extraction system is proposed, which includes:
Module is obtained, for obtaining tag library;
Training module, for utilizing tag library training coder-decoder model;
Extraction module, for the label using the coder-decoder model extraction text.
Preferably, module is obtained, is also used to obtain tag library by the method for unsupervised learning, it is preferable that the no prison
The method that educational inspector practises is name placement.
Preferably, the coder-decoder model includes encoder model and decoder model;
Wherein, the encoder model and/or decoder model use neural fusion,
Preferably, the neural network be Recognition with Recurrent Neural Network.
Preferably, stating encoder model is ht=f (ht-1,xt), wherein f () is tanh excitation function, xtIt is defeated for current layer
Enter, htFor the output of current layer, ht-1For upper one layer of output.
Preferably, the decoder model is
Wherein, St-1For the input of this layer, yt-1For upper one layer of output, CtFor the output of encoder model, as pass through
Cross the semantic coding that encoder obtains, ytFor the output of current layer, StFor the input of current layer, g () is SoftMax function, is obtained
To in yt-1,yt-2,…,y1,CtWord order situation next but one output be ytProbability.
Preferably, the decoder model uses attention mechanism.
Preferably, the decoded model calculates the output of received encoding model according to the following formula:
Wherein,
TxFor the length of the sentence a of encoding model input, that is, word quantity.atjJ-th when to export t-th of word
The Automobile driving coefficient of word, hjFor the semantic coding output of j-th of word.Automobile driving coefficient atjBe using
SoftMax function is to etjBe normalized, obtain each word to some word decoding stage significance level, that is,
Attention should be more placed on certain words by we.etjCalculation such as above formula,WaAnd VaIt is all weight matrix, passes through
Training obtains optimum value.St-1For the input of this layer of decoding stage, hjFor the output of j-th of word of coding stage.
Preferably, the tag library is the tag library of automotive field, and the text is the text of automotive field.
Preferably, which further includes user's portrait module, for realizing user's portrait according to the label.According to this Shen
Technical solution please can generate the label of the text for such as article and model, and facilitating user to find by label needs
The information wanted, and drawn a portrait in turn according to the user that label enriches user, it is convenient to classify to user, carry out targetedly information
Push.Further, it is also possible to which by the study to popular label the label of unexpected winner can be improved, and popular label can be with body
Reveal the temperature of user's concern, provides help for network operator's formulation strategy of website.
Other features and advantage are by the following detailed description will be given in the detailed implementation section.
Detailed description of the invention
The attached drawing constituted part of this application is used to provide further understanding of the present application, the schematic reality of the application
Mode and its explanation are applied for explaining the application.In the accompanying drawings:
Fig. 1 is text label extraction system schematic diagram provided by the embodiments of the present application;
Fig. 2 is the undirected illustrated example provided by the embodiments of the present application being made of word;
Fig. 3 is coder-decoder model schematic provided by the embodiments of the present application;
Fig. 4 is the coder-decoder model schematic based on attention mechanism of the embodiment of the present application.
Specific embodiment
It should be noted that in the absence of conflict, the spy in embodiment and each embodiment in the application
Sign can be combined with each other.
The application is described in detail below with reference to the accompanying drawings and in conjunction with embodiment.
Fig. 1 shows text label extraction system provided by the embodiments of the present application, specifically includes:
Module is obtained, for obtaining tag library;Obtain the mode of label in the embodiment of the present application without limitation, such as can
In the way of determining tag library after a certain number of articles of manual read perhaps model or utilize unsupervised pattern learning
Tag library is obtained, can also tag library be obtained by other means.
Training module, for utilizing tag library training coder-decoder model;It, can be with after getting tag library
Coder-decoder model is trained using the tag library, coder-decoder model can be neural fusion
Model, such as the model realized by Recognition with Recurrent Neural Network (RNN).
Extraction module, for the label using the coder-decoder model extraction text;It is completed using tag library
After the training of coder-decoder model, coder-decoder model treatment text can use, to realize to mark
The extraction of label.
Text label extraction system further includes user's portrait module, for realizing user's portrait according to the label.
Preferably, the method for unsupervised pattern learning can be name placement (PositionRank).PositionRank
PositionRank method based on page-ranking (PageRank).PositionRank method integrates PageRank to calculate text
The importance score of word in chapter, and also consider position and the frequency consideration of word.In PositionRank method, first
Text is carried out part-of-speech tagging, obtained nouns and adjectives is withdrawn as candidate word, and candidate word is constituted into a word
Non-directed graph, as shown in Fig. 2, the nouns and adjectives for passing through part-of-speech tagging is the node of non-directed graph.In addition, using a fixation
Size windows segment text, if two candidate words in the same window, will just be connected between the two words with line.It obtains
After non-directed graph, according to the principle of PageRank, the score of each of non-directed graph node is calculated:
In (1) formula, what S was indicated is PageRank score matrix,That represent is the adjacency matrix of non-directed graph, S (t+1)
For the score matrix at t+1 moment, by adjacency matrixIt is multiplied to obtain with t moment score matrix.
Wherein adjacency matrixNeed to be normalized operation, the value in matrix before calculatingCalculated as below.
mijWhat is indicated is that for j-th of node to the weight of i-th of node, what V was represented is non-directed graph in non-directed graph, | V | representative is node
Quantity, what is be normalized is exactly
Simultaneously in order to ensure non-directed graph will not fall into figure circulation, it is added to damping factor α, while the position that word is added is inclined
AmountChanged to obtain (2) formula by (1) formula:
Position deviatorSpecific formula for calculation is as follows:
p1, p2Equal representatives be word initial score, the position occurred in the text with word is inversely proportional and term frequencies
It is directly proportional, if the 5th, 6,7 position first word occurs in the text, thenp1+p2+…+p|v|What is represented is institute
There is the total score of word, then uses p1The specific gravity that first word accounts in all words can be obtained divided by total score, finally
Obtain position deviator
In summary formula (1) and (2), available following formula:
Mark is viThe PositionRank score value of node, α are damping factors,For viPosition deviator.It is vjThe PositionRank score value of node, wjiIt is from vjNode is to viThe weight of node, Adj (vi) that represent is viSection
The adjoint matrix of point, O (vj) that represent is vjNode it is all go out to side weight and.
Pass through above-mentioned PositionRank method, the tag set of available text.Preferably, before ranking being taken
10 word is included in list of labels.Further, check whether combination of two occurred three times or more than three times in original text,
If it is present this label synthesized is also added in keyword label list, the value of PositionRank is two labels
The sum of respective PositionRank value.Finally obtained list of labels is arranged from high to low according to PositionRank value
Sequence, the label of ranking first five or first three can be used as the label of article, model.By the processing to a large amount of articles and model,
Much many labels, to get tag library.Preferably for the tag library got, artificial screening can also be carried out,
Incorrect label is removed, the quality of tag library is improved.
In the embodiment of the present application, using the tag library of acquisition, text is obtained by way of end-to-end deep learning
Label.Such as utilize tag library training coder-decoder model.Decoder-encoder model includes decoder model and volume
Code device model, generally uses neural fusion, such as at least one of decoder model and encoder model that can use
Recognition with Recurrent Neural Network (RNN) Lai Shixian.
Fig. 3 shows coder-decoder structural schematic diagram, and wherein encoder model and decoder model are all made of circulation
Neural fusion.For encoder model, the hidden layer of starting receives a text input, and obtained result passes through weight meter
The input of next hidden layer is combined as after calculation with next layer of input text, until end-of-encode.I.e. some state is
It is related with current input and Last status, it can be expressed as:
ht=f (ht-1,xt)
After obtaining stateful information, the semantic coding of generation is exactly ht, then htInput as decoding RNN.
For decoder model, the hiding layer state s at a certain momenttBy upper layer state St-1, yt-1And semantic coding Ct
It determines:
St=f (St-1,yt-1,Ct)
Wherein ytIt is by upper one layer of input yt-1, semantic coding CtAnd hiding layer state StIt determines:
P(yt|yt-1,yt-2,…,y1,Ct)=g (St,yt-1,Ct)
Wherein, f () and g () is excitation function, and f () is tanh function, and g () is softmax function.
Use single semantic coding htIt can use in each hidden layer of decoder model, in the sufficiently long situation of text
Under, semantic coding htAll information may not necessarily be all kept, so in decoding stage, semantic coding used in different layers
It should be distinguished, be needed plus attention (Attention) mechanism.
In decoding stage, each decoded semantic coding input of step can all re-start calculating, defeated according to coding stage
H out1,h2,h3…htIt is weighted summation, is needed according to Multilayer Perception (Multi-layer Perception, MLP) model meter
Output sequence t is calculated for the respective weights a of the hidden layer of each list entries jtj, then all hidden layers are weighted
Average, that finally obtain is exactly the semantic coding C under current statet, formula is as follows:
Respective weights aijCalculating calculated by following formula:
Wherein, etjIt is the output of MLP model:
It can be model shown in Fig. 4 by the model modification in Fig. 3 using above-mentioned attention mechanism, wherein X1-X6 table
Show that input text, Y1-Y7 indicate output text, the part that middle arrow is concentrated is coding vector.
After training obtains coder-decoder model through the above way, the coder-decoder model can use
A large amount of text is handled, to extract the label in text.Preferably, it after extracting outgoing label, can also incite somebody to action
The label is added in tag library, is further trained to coder-decoder model, to further increase encoder-solution
The accuracy of code device.
It, can be with after the tag library using automotive field trains coder-decoder model by taking automotive field as an example
Tag extraction is carried out to the text of automotive field.For example, A motor corporation is proposed a new model B, and get advertising slogan: " A
Company releases new model B " in XX XX month XX day weight pound.It, can be according to above-mentioned by trained coder-decoder model
Advertising slogan carries out tag extraction.For tag library, A can be already present in tag library as label, and by instruction
Experienced coder-decoder model can also extract new label B, so as to using A and B as the label of the advertising slogan.
Further, the label B extracted can also be added in tag library, for further to coder-decoder model into
Row training.
For registering user, if it has issued model or text in portal websites such as such as automobile, tourism, food and drink
Chapter can also draw a portrait to it using the label extracted, and user's portrait of the user is obtained.
The foregoing is merely the better embodiments of the application, all the application's not to limit the application
Within spirit and principle, any modification, equivalent replacement, improvement and so on be should be included within the scope of protection of this application.
Claims (9)
1. a kind of text label extraction system, which is characterized in that the system includes:
Module is obtained, for obtaining tag library;
Training module, for utilizing tag library training coder-decoder model;
Extraction module, for the label using the coder-decoder model extraction text.
2. system according to claim 1, which is characterized in that obtain module, be also used to the method by unsupervised learning
Obtain tag library, it is preferable that the method for the unsupervised learning is name placement.
3. system according to claim 1, which is characterized in that the coder-decoder model includes encoder model
And decoder model;
Wherein, the encoder model and/or decoder model use neural fusion,
Preferably, the neural network be Recognition with Recurrent Neural Network.
4. system according to claim 3, which is characterized in that the encoder model is ht=f (ht-1,xt), wherein f ()
For tanh excitation function, xtFor current layer input, htFor the output of current layer, ht-1For upper one layer of output.
5. system according to claim 3 or 4, which is characterized in that the decoder model is
Wherein, St-1For the input of current layer, yt-1For upper one layer of output, CtFor the output of encoder model, ytFor current layer
Output, StFor the input of current layer, g () is SoftMax function.
6. system according to claim 5, which is characterized in that the decoder model uses attention mechanism.
7. system according to claim 5, which is characterized in that the decoder model is defeated by received encoder model
It calculates according to the following formula out:
Wherein,
TxFor the length of the sentence a of encoding model input, atjThe Automobile driving system of j-th of word when to export t-th of word
Number, hjIt is exported for the semantic coding of j-th of word,WaAnd VaIt is all weight matrix.
8. system according to claim 1, which is characterized in that the tag library is the tag library of automotive field, the text
This is the text of automotive field.
9. system according to claim 1, which is characterized in that the system further includes user's portrait module, for according to institute
It states label and realizes user's portrait.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910008718.8A CN109947894B (en) | 2019-01-04 | 2019-01-04 | Text label extraction system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910008718.8A CN109947894B (en) | 2019-01-04 | 2019-01-04 | Text label extraction system |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109947894A true CN109947894A (en) | 2019-06-28 |
CN109947894B CN109947894B (en) | 2020-04-14 |
Family
ID=67007904
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910008718.8A Active CN109947894B (en) | 2019-01-04 | 2019-01-04 | Text label extraction system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109947894B (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110717338A (en) * | 2019-07-29 | 2020-01-21 | 北京车慧科技有限公司 | Typical opinion generation device based on user comments |
CN111324738A (en) * | 2020-05-15 | 2020-06-23 | 支付宝(杭州)信息技术有限公司 | Method and system for determining text label |
CN113836443A (en) * | 2021-09-28 | 2021-12-24 | 土巴兔集团股份有限公司 | Article auditing method and related equipment thereof |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105930314A (en) * | 2016-04-14 | 2016-09-07 | 清华大学 | Text summarization generation system and method based on coding-decoding deep neural networks |
CN108763284A (en) * | 2018-04-13 | 2018-11-06 | 华南理工大学 | A kind of question answering system implementation method based on deep learning and topic model |
US20180329884A1 (en) * | 2017-05-12 | 2018-11-15 | Rsvp Technologies Inc. | Neural contextual conversation learning |
CN109086357A (en) * | 2018-07-18 | 2018-12-25 | 深圳大学 | Sensibility classification method, device, equipment and medium based on variation autocoder |
CN109117483A (en) * | 2018-07-27 | 2019-01-01 | 清华大学 | The training method and device of neural network machine translation model |
-
2019
- 2019-01-04 CN CN201910008718.8A patent/CN109947894B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105930314A (en) * | 2016-04-14 | 2016-09-07 | 清华大学 | Text summarization generation system and method based on coding-decoding deep neural networks |
US20180329884A1 (en) * | 2017-05-12 | 2018-11-15 | Rsvp Technologies Inc. | Neural contextual conversation learning |
CN108763284A (en) * | 2018-04-13 | 2018-11-06 | 华南理工大学 | A kind of question answering system implementation method based on deep learning and topic model |
CN109086357A (en) * | 2018-07-18 | 2018-12-25 | 深圳大学 | Sensibility classification method, device, equipment and medium based on variation autocoder |
CN109117483A (en) * | 2018-07-27 | 2019-01-01 | 清华大学 | The training method and device of neural network machine translation model |
Non-Patent Citations (2)
Title |
---|
CORINA FLORESCU等: "PositionRank:An Unsupervised Approach to Keyphrase Extraction from Scholarly Documents", 《ASSOCIATION FOR COMPUTATIONAL LINGUISTICS》 * |
李亚超等: "神经机器翻译综述", 《计算机学报》 * |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110717338A (en) * | 2019-07-29 | 2020-01-21 | 北京车慧科技有限公司 | Typical opinion generation device based on user comments |
CN111324738A (en) * | 2020-05-15 | 2020-06-23 | 支付宝(杭州)信息技术有限公司 | Method and system for determining text label |
CN113836443A (en) * | 2021-09-28 | 2021-12-24 | 土巴兔集团股份有限公司 | Article auditing method and related equipment thereof |
Also Published As
Publication number | Publication date |
---|---|
CN109947894B (en) | 2020-04-14 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108021616B (en) | Community question-answer expert recommendation method based on recurrent neural network | |
CN111966917B (en) | Event detection and summarization method based on pre-training language model | |
CN109933664B (en) | Fine-grained emotion analysis improvement method based on emotion word embedding | |
CN110019839B (en) | Medical knowledge graph construction method and system based on neural network and remote supervision | |
CN108804612B (en) | Text emotion classification method based on dual neural network model | |
CN109492229B (en) | Cross-domain emotion classification method and related device | |
CN110598000A (en) | Relationship extraction and knowledge graph construction method based on deep learning model | |
CN107729309A (en) | A kind of method and device of the Chinese semantic analysis based on deep learning | |
CN110909164A (en) | Text enhancement semantic classification method and system based on convolutional neural network | |
CN110188362A (en) | Text handling method and device | |
CN109960728A (en) | A kind of open field conferencing information name entity recognition method and system | |
CN102119385A (en) | Method and subsystem for searching media content within a content-search-service system | |
CN111046941A (en) | Target comment detection method and device, electronic equipment and storage medium | |
CN109947894A (en) | A kind of text label extraction system | |
CN111414561B (en) | Method and device for presenting information | |
CN112183056A (en) | Context-dependent multi-classification emotion analysis method and system based on CNN-BilSTM framework | |
CN112579870A (en) | Training method, device and equipment for searching matching model and storage medium | |
CN103886020A (en) | Quick search method of real estate information | |
CN109472022A (en) | New word identification method and terminal device based on machine learning | |
CN115659008B (en) | Information pushing system, method, electronic equipment and medium for big data information feedback | |
CN114818717A (en) | Chinese named entity recognition method and system fusing vocabulary and syntax information | |
CN115630145A (en) | Multi-granularity emotion-based conversation recommendation method and system | |
CN112784602A (en) | News emotion entity extraction method based on remote supervision | |
CN114298157A (en) | Short text sentiment classification method, medium and system based on public sentiment big data analysis | |
CN115438674A (en) | Entity data processing method, entity linking method, entity data processing device, entity linking device and computer equipment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |