CN107807917A

CN107807917A - Method for extracting content of text, device, system and storage medium

Info

Publication number: CN107807917A
Application number: CN201710896296.3A
Authority: CN
Inventors: 刘克亮
Original assignee: Wind Change Technology (shenzhen) Co Ltd
Current assignee: Wind Change Technology (shenzhen) Co Ltd
Priority date: 2017-09-27
Filing date: 2017-09-27
Publication date: 2018-03-16

Abstract

The invention discloses a kind of method for extracting content of text, device, system and storage medium, methods described includes：The content of text extraction request that editor terminal is sent is received, and sends content of text and extracts the page to the editor terminal；Receive the book information that editor terminal extracts page transmission according to content of text；The book information includes book categories, books title, and author；According to the book information, book databases are inquired about and using the target text content of semantic analysis and default contents extraction the Rule Extraction books, and transmit to the editor terminal.The present invention realizes the semi-automation of target text contents extraction, on the basis of ensuring that extracted target text content is accurate, also improves target text contents extraction efficiency, while save time cost and human cost by the interaction of intelligent terminal and server.

Description

Method for extracting content of text, device, system and storage medium

Technical field

The present invention relates to natural language processing field, more particularly to a kind of method for extracting content of text, device, system and deposit Storage media.

Background technology

More and more perfect as teaching platform is more and more, people also gladly pay for online education, and with movement The fast development of terminal, mobile phone, computer etc. turns into the necessity in people's life, online reading also into people's hobby and practise It is used.Can be largely by manually to carrying in order to provide the user with the read resource of high quality, each platform, reader, APP etc. The resource of supply user is screened and identified, to show most excellent most valuable content.But in the case where being commercialized background, Artificial reading in full even full text intensive reading is only relied on to select the marrow content of books, although accuracy rate is higher, efficiency is low Under, time cost and human cost are huge.

The content of the invention

One embodiment of the present of invention technical problem to be solved is, there is provided a kind of method for extracting content of text, dress Put, system and storage medium, the semi-automation of target text contents extraction can be realized, ensured in extracted target text On the basis of holding accurately, target text contents extraction efficiency is also improved, while save time cost and human cost.

In order to solve the above-mentioned technical problem, An embodiment provides a kind of method for extracting content of text, bag Include following steps：

The content of text extraction request of editor terminal transmission is received, and it is whole to the editor to send the content of text extraction page End；

Receive the book information that editor terminal extracts page transmission according to content of text；The book information includes books class Not, books title, and author；

According to the book information, inquire about book databases and utilize semantic analysis and default contents extraction Rule Extraction The target text content of the books, and transmit to the editor terminal.

Preferably, it is described according to the book information, inquire about book databases and utilize semantic analysis and default content Extracting rule extracts the target text content of the books, and transmits to the editor terminal, is specially：

According to the book categories of books, books title, and author, book databases are inquired about to obtain the book text Content；

Semantic analysis is carried out to the content of text data of books to be extracted, and according in semantic analysis result matching rule base Corresponding contents extraction rule；

If the match is successful, extracted using the contents extraction rule from the content of text of the books in target text Hold, and the target text content of extraction is transmitted to the editor terminal；

If it fails to match, semantic analysis result is recorded, and establishes new contents extraction rule, and this is newly-established interior Hold extracting rule and be updated to rule base.

Preferably, the content of text data to books to be extracted, which carry out semantic analysis, includes：Text to extracting books This content-data is segmented and part-of-speech tagging；Entity mark is carried out to the result of participle；Build the pass between each word in data Connection relation；The entity mark includes name mark, time-labeling and numeral mark.

Preferably, the result of described pair of participle carries out entity mark, is specially：

Using the model of condition random field, according to the participle and part of speech mark made through machine learning to the content of text of books Note, while context, the part of speech of front and rear word and the length of word of the content of text using books, further to book The content of text of nationality carries out entity mark.

Preferably, contents extraction rule is the book text content sample according to selection, keyword, and with key The associated grammatical relation of word is trained analysis extraction；The rule base is to be built according to the content of text and semantic analysis of books It is vertical.

One embodiment of the present of invention additionally provides a kind of content of text extraction element, including：

Content of text extracts request reception unit, receives editor terminal and sends content of text extraction request, and sends text The contents extraction page is to the editor terminal；

Content of text extraction unit, the book information sent for receiving editor terminal according to the content of text extraction page, And according to the book information, inquire about book databases and utilize semantic analysis and default contents extraction Rule Extraction books Target text content, and transmit to the editor terminal；The book information includes book categories, books title, with And author.

One embodiment of the present of invention additionally provides a kind of content of text extraction element, including processor, memory and It is stored in the memory and is configured as by the computer program of the computing device, meter described in the computing device During calculation machine program, method for extracting content of text described above is realized.

One embodiment of the present of invention additionally provides a kind of storage medium, and the storage medium includes the computer journey of storage Sequence, wherein, equipment where controlling the storage medium when the computer program is run performs content of text described above and carried Take method.

One embodiment of the present of invention additionally provides a kind of content of text extraction system, including editor terminal and server；

Editor terminal, for sending content of text extraction request to server；

The server, asked for being extracted according to the content of text, send content of text and extract the page to the volume Collect terminal；

The editor terminal, it is additionally operable to obtain the book information that user chooses according to the content of text extraction page, and sends To server；The book information includes book categories, books title, and author；

The server, it is additionally operable to according to the book information, inquires about book databases and utilize semantic analysis and preset The contents extraction Rule Extraction books target text content, and transmit to the editor terminal.

Implement the embodiment of the present invention, have the advantages that：

Method for extracting content of text, device, system and the storage medium of the present invention, the text sent by receiving editor terminal This contents extraction is asked, and is sent content of text and extracted the page to the editor terminal；Editor terminal is received according to content of text Extract the book information that the page is sent；The book information includes book categories, books title, and author；According to the book Nationality information, inquire about in the target text of book databases and utilization semantic analysis and default contents extraction the Rule Extraction books Hold, and transmit to the editor terminal.The browsable server of responsible editor sends to the process of editor terminal and tentatively extracted Content of text, and judge whether to read carefully and thoroughly this bibliography, the present invention realizes mesh by the interaction of intelligent terminal and server The semi-automation of content of text extraction is marked, on the basis of ensuring that extracted target text content is accurate, also improves target text This contents extraction efficiency, while save time cost and human cost.

Brief description of the drawings

In order to illustrate more clearly of technical scheme, the required accompanying drawing used in embodiment will be made below Simply introduce, it should be apparent that, drawings in the following description are only some embodiments of the present invention, general for this area For logical technical staff, on the premise of not paying creative work, other accompanying drawings can also be obtained according to these accompanying drawings.

Fig. 1 is a kind of schematic flow sheet for method for extracting content of text that one embodiment of the present of invention provides；

Fig. 2 is a kind of structural representation for content of text extraction element that one embodiment of the present of invention provides.

Embodiment

Below in conjunction with the accompanying drawing in the embodiment of the present invention, the technical scheme in the embodiment of the present invention is carried out clear, complete Site preparation describes, it is clear that described embodiment is only part of the embodiment of the present invention, rather than whole embodiments.It is based on Embodiment in the present invention, those of ordinary skill in the art are obtained every other under the premise of creative work is not made Embodiment, belong to the scope of protection of the invention.

Referring to Fig. 1, Fig. 1 is a kind of flow signal for method for extracting content of text that one embodiment of the present of invention provides Figure.

A kind of method for extracting content of text that one embodiment of the present of invention provides can be performed by server, and hereafter equal Illustrated exemplified by using server as executive agent.

The method for extracting content of text, comprises the following steps：

S101, the content of text extraction request that editor terminal is sent is received, and send content of text and extract the page to described Editor terminal；

In one embodiment of the invention, the editor terminal can be the intelligent terminals such as smart mobile phone, PC, institute It is the reader APP pages or the wechat small routine page or the wechat public number page etc. to state the content of text extraction page.It is public with wechat Exemplified by many numbers, the data interaction of editor terminal and server is with the wechat public number page or public number edit page or other are flat Platform editor edit page is presentation layer.After responsible editor enters edit page, text editing option is clicked on, editor is whole immediately To server, server responds the request and the returned text contents extraction page to described for the content of text extraction request that end is sent Editor terminal.

S102, receive the book information that editor terminal extracts page transmission according to content of text；The book information includes Book categories, books title, and author；

In one embodiment of the invention, responsible editor can carry according to the content of text of server return editor terminal Take the page to carry out content of text extraction operation, determine to choose the scope for the books for needing to browse or classification letter such as from magnanimity books Breath, such as finance and economic, financial class, investment type, and specific books, then the books for the books to be extracted chosen are believed Breath, including book categories, books title, and author are sent to server, and the extraction that next step is carried out by server operates.

S103, according to the book information, inquiry book databases are simultaneously advised using semantic analysis and default contents extraction The target text content of the books is then extracted, and is transmitted to the editor terminal.

In one embodiment of the invention, it is preferable that described according to the book information, inquiry book databases and profit With the target text content of semantic analysis and default contents extraction the Rule Extraction books, and transmit to the editor eventually End, it is specially：

In one embodiment of the invention, it is preferable that the content of text data to books to be extracted carry out semantic Analysis includes：The content of text data for extracting books are segmented and part-of-speech tagging；Entity mark is carried out to the result of participle； Build the incidence relation between each word in data；The entity mark includes name mark, time-labeling and numeral mark.

Specifically, the processing procedure of one embodiment of the present of invention is as follows,

Instructed according to the book text content sample of selection, keyword, and the grammatical relation associated with keyword Practice analysis extraction contents extraction rule, and rule base is established according to the content of text and semantic analysis of books：

The first step, it is the content of text of books to be segmented and part-of-speech tagging first, for follow-up entity mark and structure Incidence relation in data between each word supports.The link needs common natural language processing technique, or based on statistics Or the model such as machine learning can realize the participle and part-of-speech tagging of content of text.For example " proposition 3 is ground before big to sentence The main points ... that can be touched the heart of sb. " carry out participle and part-of-speech tagging " to grind one/n before big, proposition/v, 3/num, individual/uj, can beat Dynamic/v, the popular feeling /adj, main points/n ... " wherein/x be part-of-speech tagging, for example n mark nouns, v identify verb etc..

Second step, entity mark, such as name mark, time-labeling, numeral mark, verb mark are done to the result of participle Deng.Wherein, it is more simpler with numeral mark compare other marks for time-labeling, passes through the responsible regular expression can of a bit Detect time and numeral and do entity mark.And name mark and verb mark are then preferably needed using condition random field Model come realize entity mark, be specially：Using the model of condition random field, according to the content of text through machine learning to books The participle and part-of-speech tagging made, while context, the part of speech and word of front and rear word of the content of text using books The length of language is trained to substantial amounts of language material, then does various entity marks to the word in content of text according to training result.

It should be noted that condition random field, is a kind of discriminate probabilistic model, it is one kind of random field, is usually used in marking Note or analytical sequence data, such as natural language word or biological sequence.Such as Markov random field, condition random field is tool There is a undirected graph model, the summit in figure represents stochastic variable, and the line between summit represents the dependence relation between stochastic variable, In condition random field, stochastic variable Y's is distributed as conditional probability, and given observed value is then stochastic variable X.In principle, condition The graph model layout of random field can be with any given, and typically conventional layout is the framework of chain eliminant, and chain eliminant framework is not By in training (training), inference (inference) or decoding (decoding), the higher algorithm of efficiency all be present It is available for calculating.

" condition random field " is used for the morphological analyses such as Chinese word segmentation and part-of-speech tagging work, and General Sequences disaggregated model is normal Frequently with hidden Markov model (HMM), as class-based Chinese word segmentation.But two hypothesis in hidden Markov model be present： Export independence assumption and Markov property is assumed.Wherein, export independence assumption and require that sequence data is strict independently of each other The correctness of derivation is can guarantee that, and in fact most of sequence datas can not be expressed as a series of independent events.And condition with Airport then uses a kind of probability graph model, has the ability of expression long-distance dependence and overlapping property feature, can preferably solve The advantages of the problems such as award of bid note (classification) biasing, and all features can carry out global normalization, can try to achieve the overall situation most Excellent solution.

3rd step, then build the incidence relation in data between each word, i.e., it is interdependent between each in content of text And association.The structure model of conventional comparative maturity has neutral net, maximum entropy, and condition random field.Build each word The grammatical relation of satisfaction between language or keyword, such as dynamic guest's relation, modified relationship.

4th step, various content of text extracting rules are established according to the grammer result of the 3rd step, and be saved in rule base. For example to establish content of text extracting rule as follows：In " grinding 3 main points ... that can be touched the heart of sb. of a proposition before big ",

" one " keyword behaviour name mark is ground before big；" proposition " keyword is verb, by the quantity for moving guest's relationship Word is " 3 "；" touching the heart of sb. " keyword by modified relationship associate for noun " main points " ... can then extract sentence " grinding 3 main points ... that can be touched the heart of sb. of a proposition before big " is by that analogy, various interior by being extracted in substantial amounts of data sample Hold extracting rule, establish rule base.

It should be noted that after rule base is established, then carrying for key content can be carried out to the content of text of books Take.Semantic analysis is carried out to the content of text data of books to be extracted, and according to right in semantic analysis result matching rule base The contents extraction rule answered, if the match is successful, is extracted using the contents extraction rule from the content of text of the books Target text content, and the target text content of extraction is transmitted to the editor terminal.If it fails to match, remember Semantic analysis result is recorded, and establishes new contents extraction rule, and by the newly-established contents extraction Policy Updates to rule base.

A kind of method for extracting content of text that one embodiment of the present of invention provides, the text sent by receiving editor terminal This contents extraction is asked, and is sent content of text and extracted the page to the editor terminal；Editor terminal is received according to content of text Extract the book information that the page is sent；The book information includes book categories, books title, and author；According to the book Nationality information, inquire about in the target text of book databases and utilization semantic analysis and default contents extraction the Rule Extraction books Hold, and transmit to the editor terminal.The browsable server of responsible editor sends to the process of editor terminal and tentatively extracted Content of text, and judge whether to read carefully and thoroughly this bibliography, the present invention realizes mesh by the interaction of intelligent terminal and server The semi-automation of content of text extraction is marked, on the basis of ensuring that extracted target text content is accurate, also improves target text This contents extraction efficiency, while save time cost and human cost.

Referring to Fig. 2, Fig. 2 is a kind of structural representation for content of text extraction element that one embodiment of the present of invention provides Figure.

Content of text extracts request reception unit 201, receives editor terminal and sends content of text extraction request, and sends text This contents extraction page is to the editor terminal；

Content of text extraction unit 202, believe for receiving the books that editor terminal is sent according to the content of text extraction page Breath, and according to the book information, inquire about book databases and be somebody's turn to do using semantic analysis and default contents extraction Rule Extraction The target text content of books, and transmit to the editor terminal；The book information includes book categories, books name Claim, and author.

A kind of content of text extraction element that one embodiment of the present of invention provides, request is extracted by content of text and received Unit 201 receives the content of text extraction request of editor terminal transmission, and it is whole to the editor to send the content of text extraction page End, the book information that then the reception editor terminal of content of text extraction unit 202 is sent according to the content of text extraction page, wherein The book information includes book categories, books title, and author.Content of text extraction unit 202 is believed according to the books Breath, inquire about book databases and utilize the target text content of semantic analysis and default contents extraction the Rule Extraction books, And transmit to the editor terminal.The browsable server of responsible editor sends the text tentatively extracted to the process of editor terminal This content, and judge whether to read carefully and thoroughly this bibliography, the present invention realizes target text by the interaction of intelligent terminal and server The semi-automation of this contents extraction, on the basis of ensuring that extracted target text content is accurate, also improve in target text Hold extraction efficiency, while save time cost and human cost.

Editor terminal, for sending content of text extraction request to server；

A kind of method for extracting content of text system that one embodiment of the present of invention provides, sent by receiving editor terminal Content of text extraction request, and send content of text and extract the page to the editor terminal；Editor terminal is received according to text The book information that the contents extraction page is sent；The book information includes book categories, books title, and author；According to institute Book information is stated, book databases is inquired about and utilizes the target text of semantic analysis and default contents extraction the Rule Extraction books This content, and transmit to the editor terminal.The process that the browsable server of responsible editor is sent to editor terminal is preliminary The content of text of extraction, and judge whether to read carefully and thoroughly this bibliography, the interaction of the invention by intelligent terminal and server, it is real The semi-automation of existing target text contents extraction, on the basis of ensuring that extracted target text content is accurate, also improves mesh Content of text extraction efficiency is marked, while saves time cost and human cost.

Described above is the preferred embodiment of the present invention, it is noted that for those skilled in the art For, under the premise without departing from the principles of the invention, some improvement and deformation can also be made, these are improved and deformation is also considered as Protection scope of the present invention.

One of ordinary skill in the art will appreciate that realize all or part of flow in above-described embodiment method, being can be with The hardware of correlation is instructed to complete by computer program, described program can be stored in a computer read/write memory medium In, the program is upon execution, it may include such as the flow of the embodiment of above-mentioned each method.Wherein, described storage medium can be magnetic Dish, CD, read-only memory (Read-Only Memory, ROM) or random access memory (Random Access Memory, RAM) etc..

Claims

1. a kind of method for extracting content of text, it is characterised in that comprise the following steps：

The content of text extraction request that editor terminal is sent is received, and sends content of text and extracts the page to the editor terminal；

Receive the book information that editor terminal extracts page transmission according to content of text；The book information include book categories, Books title, and author；

According to the book information, inquire about book databases and utilize semantic analysis and default contents extraction Rule Extraction book The target text content of nationality, and transmit to the editor terminal.

2. a kind of method for extracting content of text according to claim 1, it is characterised in that described to be believed according to the books Breath, inquire about book databases and utilize the target text content of semantic analysis and default contents extraction the Rule Extraction books, And transmit to the editor terminal, it is specially：

According to the book categories of books, books title, and author, book databases are inquired about to obtain in the book text Hold；

Semantic analysis is carried out to the content of text data of books to be extracted, and according to corresponding in semantic analysis result matching rule base Contents extraction rule；

If the match is successful, target text content is extracted from the content of text of the books using the contents extraction rule, And the target text content of extraction is transmitted to the editor terminal；

If it fails to match, semantic analysis result is recorded, and establishes new contents extraction rule, and the newly-established content is carried Policy Updates are taken to rule base.

A kind of 3. method for extracting content of text according to claim 2, it is characterised in that the text to books to be extracted This content-data, which carries out semantic analysis, to be included：The content of text data for extracting books are segmented and part-of-speech tagging；To participle Result carry out entity mark；Build the incidence relation between each word in data；The entity mark includes name mark, time Mark and numeral mark.

4. a kind of method for extracting content of text according to claim 3, it is characterised in that the result of described pair of participle is carried out Entity marks, and is specially：

Using the model of condition random field, according to the participle and part-of-speech tagging made through machine learning to the content of text of books, Context, the part of speech of front and rear word and the length of word of the content of text of books are utilized simultaneously, further to books Content of text carry out entity mark.

5. a kind of method for extracting content of text according to any one of Claims 1-4, it is characterised in that the content carries Rule is taken to be instructed for the book text content sample according to selection, keyword, and the grammatical relation associated with keyword Practice analysis extraction；The rule base is to be established according to the content of text of books and semantic analysis.

A kind of 6. content of text extraction element, it is characterised in that including：

Content of text extracts request reception unit, receives editor terminal and sends content of text extraction request, and sends content of text The page is extracted to the editor terminal；

Content of text extraction unit, the book information sent for receiving editor terminal according to the content of text extraction page, and root According to the book information, inquire about book databases and utilize the mesh of semantic analysis and default contents extraction the Rule Extraction books Content of text is marked, and is transmitted to the editor terminal；The book information includes book categories, books title, Yi Jizuo Person.

7. a kind of content of text extraction element, it is characterised in that including processor, memory and be stored in the memory And it is configured as, by the computer program of the computing device, described in the computing device during computer program, realizing such as Method for extracting content of text described in Claims 1-4.

A kind of 8. storage medium, it is characterised in that the storage medium includes the computer program of storage, wherein, in the meter Equipment is performed in the text as described in Claims 1-4 any one calculation machine program controls the storage medium when running where Hold extracting method.

9. a kind of content of text extraction system, it is characterised in that including editor terminal and server；

Editor terminal, for sending content of text extraction request to server；

The server, asked for being extracted according to the content of text, send the content of text extraction page and edited eventually to described End；

The editor terminal, it is additionally operable to obtain the book information that user extracts page selection according to content of text, and sends and extremely take Business device；The book information includes book categories, books title, and author；

The server, it is additionally operable to according to the book information, inquires about book databases and utilize semantic analysis and default interior Hold extracting rule and extract the target text content of the books, and transmit to the editor terminal.