CN108170678A

CN108170678A - A kind of text entities abstracting method and system

Info

Publication number: CN108170678A
Application number: CN201711450896.3A
Authority: CN
Inventors: 晋彤; 张中弦
Original assignee: Guangzhou Yun Run Great Data Services Co Ltd
Current assignee: Guangzhou Yun Run Great Data Services Co Ltd
Priority date: 2017-12-27
Filing date: 2017-12-27
Publication date: 2018-06-15

Abstract

The invention discloses a kind of text entities abstracting method and system, the text entities abstracting method includes acquisition urtext；According to preset entity dictionary, the entity word not being indexed in the entity dictionary is searched for from the urtext and forms testing material collection；According to the testing material collection, the preset double-deck neural network extraction model of training；According to the preset double-deck neural network extraction model and the testing material collection, predict novel entities word and update the novel entities word into the preset entity dictionary.The word or network neologisms that dictionary do not include can be identified by the text entities abstracting method, improve the accuracy and efficiency that text entities extract.

Description

A kind of text entities abstracting method and system

Technical field

The present invention relates to natural language processing technique fields, and in particular to a kind of text entities abstracting method and system.

Background technology

With the continuous development of science and technology especially information technology, interpersonal exchange way is from simple Face-to-face exchange is developed to more and more using " text " this linguistic form as information carrier.Example the most apparent is just It is digital library and web page text.Unquestionably, can be that user's acquisition information is carried to effective management of these language resources For very big facility.But with the development of network communication, the quantity of online available text information drastically expands, it might even be possible to say It is that exponentially grade increases, if it is not only time-consuming and laborious to be classified by hand to these texts as before again, and accuracy rate Also can not ensure.Therefore the text entities abstracting method based on natural language processing technique comes into being.

At present, the method that text entities extract can realize the text classification of magnanimity big data, at the same still information extraction, Question answering system, syntactic analysis, machine translation, the important foundation towards application fields such as the metadata marks of Semantic Web. Existing text entities abstracting method relies primarily on dictionary, word in matched text is identified based on dictionary, so as to obtain Entity with dictionary matching, the entity in usual text mainly include name, place name, mechanism name, proper noun etc..But Since existing text entities abstracting method excessively relies on dictionary, the word or network neologisms do not included for dictionary, it is impossible to identify Out, the accuracy and efficiency that text entities extract is reduced.

Invention content

The object of the present invention is to provide a kind of text entities abstracting method and system, can identify word that dictionary do not include or Network neologisms improve the accuracy and efficiency that text entities extract.

For solution more than technical problem, the embodiment of the present invention provides a kind of text entities abstracting method, including：

Acquire urtext；

According to preset entity dictionary, the entity word not being indexed in the entity dictionary is searched for from the urtext Form testing material collection；

According to the testing material collection, the preset double-deck neural network extraction model of training；

According to the preset double-deck neural network extraction model and the testing material collection, novel entities word is predicted and by institute It states in the update to the preset entity dictionary of novel entities word.

Preferably, the text entities abstracting method further includes：

According to SVM complex nucleus fonction composition convolution kernel function and substance feature kernel function, entity word disaggregated model is established；

According to the entity word disaggregated model, classification annotation is carried out to the novel entities word；

According to preset loss function, the novel entities word is verified.

Preferably, it is described according to the testing material collection, the preset double-deck neural network extraction model of training, specific packet It includes：

According to Skip-gram algorithms and Bag-of-words algorithms, establish the preset double-deck neural network and extract mould Type；

It is calculated according to the testing material collection, the property parameters of the Skip-gram algorithms and the Bag-of-words The property parameters of method generate joint term vector；

According to the joint term vector, the preset double-deck neural network extraction model of training.

Preferably, the text entities abstracting method further includes：

Noise reduction process is carried out to the urtext；

According to preset participle model, word segmentation processing is carried out to the urtext after noise reduction.

Preferably, it is described according to preset participle model, word segmentation processing is carried out to the urtext after noise reduction, specifically Including：

According to MMseg partitioning algorithms and CRF distinguished numbers, the preset participle model is established；

The ambiguity word in the urtext after noise reduction is sentenced according to the CRF distinguished numbers of the preset participle model It does not analyze；

Cutting processing is carried out to the urtext after noise reduction according to the MMseg partitioning algorithms of the preset participle model.

Preferably, it is described according to preset entity dictionary, it is searched for from the urtext and is not indexed to the entity word Entity word in library forms testing material collection, specifically includes：

According to the preset entity dictionary, the primary entities that the entity dictionary is indexed in the urtext are identified Word；

According to the primary entities word, syntactic analysis, context analysis and probability are carried out to the urtext Statistics obtains the entity word not being indexed in the entity dictionary, and forms the testing material collection.

Preferably, the preset double-deck neural network extraction model is：

Wherein, X_nFor the joint term vector, y_nFor the novel entities word of the prediction, N is the big of the testing material collection It is small；C is the parameter of softmax functions, and A is the term vector matrix of pre-training.

Preferably, the entity word disaggregated model is：

Wherein, λ is weight coefficient, 0 ＜＜ 1；E₁,E₂For two novel entities words；SFT includes tree for shortest path；CTK is The convolution tree kernel function；Equal is the substance feature kernel function；E₁·C_iFor entity word E₁I-th of class another characteristic, E₂·C_iFor entity word E₂I-th of class another characteristic, work as E₁When belonging to the i-th classification, E₁.C_iIt is 1, is otherwise 0；Work as E₁.C_i, E₂.C_iWhen being 1 simultaneously, the value of Equal is 1, is otherwise 0；M is class number.

The embodiment of the present invention further includes a kind of text entities extraction system, including：

Text collection module, for acquiring urtext；

Testing material collection generation module, for according to preset entity dictionary, searching for from the urtext and not including Testing material collection is formed to the entity word in the entity dictionary；

Model training module, for according to the testing material collection, the preset double-deck neural network extraction model of training；

Entity word prediction module, for according to the preset double-deck neural network extraction model and the testing material Collection predicts novel entities word and updates the novel entities word into the preset entity dictionary.

Preferably, the text entities extraction system further includes：

Disaggregated model establishes module, for according to SVM complex nucleus fonction composition convolution kernel function and substance feature kernel function, Establish entity word disaggregated model；

Classification annotation module, for according to the entity word disaggregated model, classification annotation to be carried out to the novel entities word；

Entity word authentication module, for according to preset loss function, being verified to the novel entities word.

Opposite and the prior art, a kind of advantageous effect of text entities abstracting method provided in an embodiment of the present invention are： The text entities abstracting method includes acquisition urtext；According to preset entity dictionary, searched for from the urtext The entity word not being indexed in the entity dictionary forms testing material collection；According to the testing material collection, training is preset double Layer neural network extraction model；According to the preset double-deck neural network extraction model and the testing material collection, prediction is new Entity word simultaneously will be in novel entities word update to the preset entity dictionary.It can by the text entities abstracting method The word or network neologisms that identification dictionary is not included, improve the accuracy and efficiency that text entities extract.The embodiment of the present invention is also A kind of text entities extraction system is provided.

Description of the drawings

Fig. 1 is a kind of flow chart of text entities abstracting method provided in an embodiment of the present invention；

Fig. 2 is a kind of schematic diagram of text entities extraction system provided in an embodiment of the present invention.

Specific embodiment

Below in conjunction with the attached drawing in the embodiment of the present invention, the technical solution in the embodiment of the present invention is carried out clear, complete Site preparation describes, it is clear that described embodiment is only part of the embodiment of the present invention, instead of all the embodiments.It is based on Embodiment in the present invention, those of ordinary skill in the art are obtained every other without creative efforts Embodiment shall fall within the protection scope of the present invention.

Referring to Fig. 1, it is a kind of flow chart of text entities abstracting method provided in an embodiment of the present invention, the text Entity abstracting method includes：

S1：Acquire urtext；

S2：According to preset entity dictionary, the reality not being indexed in the entity dictionary is searched for from the urtext Pronouns, general term for nouns, numerals and measure words forms testing material collection；

S3：According to the testing material collection, the preset double-deck neural network extraction model of training；

S4：According to the preset double-deck neural network extraction model and the testing material collection, prediction novel entities word is simultaneously It will be in novel entities word update to the preset entity dictionary.

For example, by taking the analysis of news website as an example, crawl the text that the network platform is delivered, and to crawl text back into Row pretreatment, forms the testing material collection；The testing material collection is input to the preset double-deck neural network to extract In model, the preset double-deck neural network extraction model will concentrate automatic learning text feature from the testing material, the One hidden layer extracts the feature of each word, second hidden layer extraction feature from word window, and is regarded as a series of Part and global structure, and pass through the parameter that back-propagation algorithm trains the preset double-deck neural network extraction model.It is logical The preset double-deck neural network extraction model after training is crossed, can identify word or net that the preset entity dictionary is not included Network neologisms improve the accuracy and efficiency that text entities extract.It also is able to expand the entity dictionary automatically simultaneously, it is ensured that base In the accuracy of the big data analysis of the text entities abstracting method and comprehensive.

In a kind of optional embodiment, the text entities abstracting method further includes：

According to preset loss function, the novel entities word is verified.

In the present embodiment, by constructing the preset loss function to being taken out by the preset double-deck neural network The novel entities word of model extraction is taken to be verified, avoids the problem that over-fitting.

In a kind of optional embodiment, S3：According to the testing material collection, the preset double-deck neural network of training extracts Model specifically includes：

Noise reduction process is carried out to the urtext；

It is described according to preset participle model in a kind of optional embodiment, to the urtext after noise reduction into Row word segmentation processing, specifically includes：

In this embodiment, by establishing with reference to MMseg partitioning algorithms and the preset participle mould of CRF distinguished numbers Type can solve the ambiguity problem that text participle occurs in the process, reduce the training time of the default participle model and carry The rate of height participle.

It is described according to preset entity dictionary in a kind of optional embodiment, it searches for from the urtext and does not receive The entity word recorded in the entity dictionary forms testing material collection, specifically includes：

For example, according to primary entities word " mansion ", can be united according to syntactic analysis, context analysis and probability Meter, extracts " xx mansions ".

In a kind of optional embodiment, the preset double-deck neural network extraction model is：

In a kind of optional embodiment, the entity word disaggregated model is：

Referring to Fig. 2, it is a kind of schematic diagram of text entities extraction system provided in an embodiment of the present invention, the text Entity extraction system includes：

Text collection module 1, for acquiring urtext；

Testing material collection generation module 2, for according to preset entity dictionary, searching for from the urtext and not including Testing material collection is formed to the entity word in the entity dictionary；

Model training module 3, for according to the testing material collection, the preset double-deck neural network extraction model of training；

Entity word prediction module 4, for according to the preset double-deck neural network extraction model and the testing material Collection predicts novel entities word and updates the novel entities word into the preset entity dictionary.

In a kind of optional embodiment, the text entities extraction system further includes：

In a kind of optional embodiment, the model training module includes：

Establishment of Neural Model module, for according to Skip-gram algorithms and Bag-of-words algorithms, described in foundation Preset bilayer neural network extraction model；

Training term vector generation module, for the property parameters according to the testing material collection, the Skip-gram algorithms And the property parameters of the Bag-of-words algorithms, generate joint term vector；

Neural network model training module, for according to the joint term vector, the preset double-deck neural network of training to be taken out Modulus type.

Text noise reduction module, for carrying out noise reduction process to the urtext；

Text word-dividing mode, for according to preset participle model, word segmentation processing to be carried out to the urtext after noise reduction.

In a kind of optional embodiment, the text word-dividing mode includes：

Participle model establishes module, for according to MMseg partitioning algorithms and CRF distinguished numbers, establishing described preset point Word model；

Ambiguity analysis module, for the CRF distinguished numbers according to the preset participle model to the original text after noise reduction Ambiguity word in this carries out discriminant analysis；

Text dividing module, for according to the MMseg partitioning algorithms of the preset participle model to original after noise reduction Text carries out cutting processing.

In a kind of optional embodiment, the testing material collection generation module includes：

Primary entities word identification module, for according to the preset entity dictionary, identifying and being included in the urtext To the primary entities word of the entity dictionary；

Testing material analysis module, for according to the primary entities word, the urtext is carried out syntactic analysis, on Hereafter scene analysis and probability statistics obtain the entity word not being indexed in the entity dictionary, and form the test language Material collection.

In a kind of optional embodiment, the entity word disaggregated model is：

It is the preferred embodiment of the present invention above, it is noted that for those skilled in the art, Various improvements and modifications may be made without departing from the principle of the present invention, these improvements and modifications are also considered as this hair Bright protection domain.

Claims

1. a kind of text entities abstracting method, which is characterized in that including：

Acquire urtext；

According to preset entity dictionary, the entity word not being indexed in the entity dictionary is searched for from the urtext and is formed Testing material collection；

According to the preset double-deck neural network extraction model and the testing material collection, prediction novel entities word simultaneously will be described new In entity word update to the preset entity dictionary.

2. text entities abstracting method as described in claim 1, which is characterized in that further include：

According to preset loss function, the novel entities word is verified.

3. text entities abstracting method as described in claim 1, which is characterized in that described according to the testing material collection, instruction Practice preset double-deck neural network extraction model, specifically include：

According to Skip-gram algorithms and Bag-of-words algorithms, the preset double-deck neural network extraction model is established；

According to the testing material collection, the property parameters of the Skip-gram algorithms and the Bag-of-words algorithms Property parameters generate joint term vector；

4. text entities abstracting method as described in claim 1, which is characterized in that further include：

Noise reduction process is carried out to the urtext；

5. text entities abstracting method as claimed in claim 4, which is characterized in that it is described according to preset participle model, it is right The urtext after noise reduction carries out word segmentation processing, specifically includes：

The ambiguity word in the urtext after noise reduction differentiate according to the CRF distinguished numbers of the preset participle model and is divided Analysis；

6. text entities abstracting method as described in claim 1, which is characterized in that it is described according to preset entity dictionary, from The entity word not being indexed in the entity dictionary is searched in the urtext and forms testing material collection, is specifically included：

According to the preset entity dictionary, the primary entities word that the entity dictionary is indexed in the urtext is identified；

According to the primary entities word, syntactic analysis, context analysis and probability statistics are carried out to the urtext, The entity word not being indexed in the entity dictionary is obtained, and forms the testing material collection.

7. text entities abstracting method as claimed in claim 3, which is characterized in that the preset double-deck neural network extracts Model is：

Wherein, X_nFor the joint term vector, y_nFor the novel entities word of the prediction, N is the size of the testing material collection；C is The parameter of softmax functions, A are the term vector matrix of pre-training.

8. text entities abstracting method as claimed in claim 2, which is characterized in that the entity word disaggregated model is：

Wherein, λ is weight coefficient, 0 ＜ λ ＜ 1；E₁,E₂For two novel entities words；SFT includes tree for shortest path；CTK is described Convolution tree kernel function；Equal is the substance feature kernel function；E₁·C_iFor entity word E₁I-th of class another characteristic, E₂·C_i For entity word E₂I-th of class another characteristic, work as E₁When belonging to the i-th classification, E₁.C_iIt is 1, is otherwise 0；Work as E₁.C_i,E₂.C_iSimultaneously When being 1, the value of Equal is 1, is otherwise 0；M is class number.

9. a kind of text entities extraction system, which is characterized in that including：

Text collection module, for acquiring urtext；

Testing material collection generation module, for according to preset entity dictionary, being searched for from the urtext and not being indexed to institute The entity word stated in entity dictionary forms testing material collection；

Entity word prediction module, for according to the preset double-deck neural network extraction model and the testing material collection, in advance It surveys novel entities word and updates the novel entities word into the preset entity dictionary.

10. text entities extraction system as claimed in claim 9, which is characterized in that further include：

Disaggregated model establishes module, for according to SVM complex nucleus fonction composition convolution kernel function and substance feature kernel function, establishing Entity word disaggregated model；