CN109918490A - A kind of content extraction method and device - Google Patents

A kind of content extraction method and device Download PDF

Info

Publication number
CN109918490A
CN109918490A CN201910155040.6A CN201910155040A CN109918490A CN 109918490 A CN109918490 A CN 109918490A CN 201910155040 A CN201910155040 A CN 201910155040A CN 109918490 A CN109918490 A CN 109918490A
Authority
CN
China
Prior art keywords
classification
text
extraction
expression formula
node
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910155040.6A
Other languages
Chinese (zh)
Other versions
CN109918490B (en
Inventor
任宁
晋耀红
李德彦
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Anhui Tai Yue Xiang Sheng Software Co Ltd
Original Assignee
Anhui Tai Yue Xiang Sheng Software Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Anhui Tai Yue Xiang Sheng Software Co Ltd filed Critical Anhui Tai Yue Xiang Sheng Software Co Ltd
Priority to CN201910155040.6A priority Critical patent/CN109918490B/en
Publication of CN109918490A publication Critical patent/CN109918490A/en
Application granted granted Critical
Publication of CN109918490B publication Critical patent/CN109918490B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the present application provides a kind of content extraction method and device, comprising: the target classification of problem is obtained according to the classification expression formula that question-based teaching includes;Target classification corresponding target extract node in extracting tree is obtained, and the extraction expression formula for including using target extract node extracts object content from reading text;According to the corresponding post-processing rule of target classification, object content is post-processed, the answer of problem is obtained.Thus, method provided by the embodiments of the present application is applied to machine reading when understanding, it only needs the classification Construct question tree according to problem and extracts tree, when the classification of problem determines, question-based teaching and extraction tree also determine relatively, the answer that can be used for extracting problem in text from different reading, has universality, can be improved machine and reads the accuracy rate understood.

Description

A kind of content extraction method and device
Technical field
This application involves natural language processing technique field more particularly to a kind of content extraction method and devices.
Background technique
Machine, which is read, to be understood and is the development along with depth learning technology and the technical theme generated.Machine reads understanding Research purpose is exactly to allow machine to read text as the mankind, and then answered a question according to the understanding to the text, specific next It says, is exactly machine according to given corpus and problem, the correct option of problem is found out from corpus.
Current machine reads understanding and usually uses what the algorithm model based on deep learning was realized, that is, uses artificial structure That makes is marked the data set training deep learning model of problem and answer, so that deep learning model be made to have from simple text The ability of some simple problem answers is extracted in this.However, due to the limitation and data set scale of deep learning model algorithm Limitation, currently based on deep learning model realization machine read understanding method accuracy rate it is not high.Such as it is opened some In the practical application of putting property context, when extracting the answer of given problem from article, deep learning model can only realize 60% The extraction accuracy rate of left and right, is much not achieved the requirement being applicable in production environment.Understand as it can be seen that reading machine It says, understanding accuracy rate, there are also biggish rooms for promotion.
Summary of the invention
The embodiment of the present application provides a kind of content extraction method and device, reads understanding to solve the machine of the prior art Method extracts the lower problem of accuracy rate of problem answers from article.
In a first aspect, the embodiment of the present application provides a kind of content extraction method, comprising:
The target classification of problem is obtained according to the classification expression formula that question-based teaching includes, wherein described problem tree includes classification Node, a classification of each class node correspondence problem, the class node include classification expression list, the classification chart It include multiple classification expression formulas up to formula list;
The target classification corresponding target extract node in extracting tree is obtained, and uses the target extract node packet The extraction expression formula contained extracts object content from reading text, wherein the extraction tree is comprising extracting node, each extraction section One classification of point correspondence problem, the extraction node include to extract expression list, and the extraction expression list includes more A extraction expression formula;
According to the corresponding post-processing rule of the target classification, the object content is post-processed, described ask is obtained The answer of topic.
Second aspect, the embodiment of the present application provide a kind of content extraction device, comprising:
Problem matching module, the classification expression formula for including according to question-based teaching obtain the target classification of problem, wherein institute Stating question-based teaching includes class node, and a classification of each class node correspondence problem, the class node includes classification expression Formula list, the classification expression list include multiple classification expression formulas;
Content extraction module for obtaining the target classification corresponding target extract node in extracting tree, and uses The extraction expression formula that the target extract node includes extracts object content from reading text, wherein the extraction, which is set, includes Node is extracted, each classification for extracting node correspondence problem, the extraction node includes to extract expression list, the pumping Taking expression list includes multiple extraction expression formulas;
Post-processing module, for regular according to the corresponding post-processing of the target classification, after being carried out to the object content Processing, obtains the answer of described problem.
From the above technical scheme, the embodiment of the present application provides a kind of content extraction method and device, comprising: according to The classification expression formula that question-based teaching includes obtains the target classification of problem, and described problem tree includes class node, the class node Comprising expression list of classifying, the classification expression list includes multiple classification expression formulas;The target classification is obtained to take out Corresponding target extract node in tree is taken, and the extraction expression formula for including using the target extract node is taken out from reading text Object content is taken, the extraction tree includes to extract expression list comprising extracting node, the extraction node, the extraction expression Formula list includes multiple extraction expression formulas;According to the corresponding post-processing rule of the target classification, the object content is carried out Post-processing, obtains the answer of described problem.Technical solution provided by the embodiments of the present application is applied to machine and reads and understands as a result, When, it is only necessary to according to the classification Construct question tree of problem and tree is extracted, when the classification of problem determines, also phase is set in question-based teaching and extraction To determination, it can be used for extracting the answer of problem in text from different reading, there is universality, can be improved machine and read reason The accuracy rate of solution.
Detailed description of the invention
In order to illustrate more clearly of the technical solution of the application, letter will be made to attached drawing needed in the embodiment below Singly introduce, it should be apparent that, for those of ordinary skills, without any creative labor, It is also possible to obtain other drawings based on these drawings.
Fig. 1 is a kind of flow chart of content extraction method provided by the embodiments of the present application;
Fig. 2 is a kind of flow chart of pre-treating method provided by the embodiments of the present application;
Fig. 3 is a kind of flow chart of post-processing rule provided by the embodiments of the present application;
Fig. 4 is a kind of flow chart of post-processing rule provided by the embodiments of the present application;
Fig. 5 is a kind of structural schematic diagram of content extraction device provided by the embodiments of the present application.
Specific embodiment
In order to make those skilled in the art better understand the technical solutions in the application, below in conjunction with the application reality The attached drawing in example is applied, the technical scheme in the embodiment of the application is clearly and completely described, it is clear that described implementation Example is merely a part but not all of the embodiments of the present application.Based on the embodiment in the application, this field is common The application protection all should belong in technical staff's every other embodiment obtained without making creative work Range.
Machine, which is read, to be understood and is the development along with depth learning technology and the technical theme generated.Machine reads understanding Research purpose is exactly to allow machine to read text as the mankind, and then answered a question according to the understanding to the text, specific next It says, is exactly machine according to given corpus and problem, the correct option of problem is found out from corpus.
For example, reading the reading text understood using an article as machine:
Russian Defence Ministry Information Bureau gives out information on the 13rd, and " bridges of 2015 friendship " Russia and Egyptian naval join for the first time Terminate on the day of closing military exercises in Mediterranean.... Egyptian military spokesman indicates that this is the army of maximum-norm between two countries before this It drills, it is therefore intended that reinforce angstrom Russia's strategic military and security cooperation, promote the exchange of both sides' military technology.It is introduced that the Russian side is participated in The naval vessel of manoeuvre includes " Moscow " number guided missile cruiser, " simmon " number guided missile hovercraft, " Alexandria Sabah woods " number Large-scale landing boat and " MB-31 " deep-sea tug.Egyptian naval sends two cruise warships, two guided missile ships and other naval vessels to participate in This time manoeuvre.(reporter high mountain XX)
Does is and provide a problem: whom the author of this paper?
Then machine reads the purpose understood and is desirable to find out the author of this article from reading text, then, for this For a problem, correct answer should be: high mountain XX.
Current machine reads understanding and usually uses what the algorithm model based on deep learning was realized, that is, uses artificial structure The data set training deep learning model that problem and answer is marked made, the as a result, limitation and data set scale of algorithm Limitation leads to the reading of the machine based on deep learning model understanding method, and there are bottlenecks in terms of accuracy, especially some The demand of trained deep learning model is even more much not achieved in data set scale in the practical application of open context, leads to machine It is lower to read the accuracy rate understood.
The embodiment of the present application provides a kind of content extraction method and device, reads understanding to solve the machine of the prior art Method extracts the lower problem of accuracy rate of problem answers from article.
Here is the present processes embodiment, provides a kind of content extraction method, this method can be applied to server, PC (PC), tablet computer, mobile phone, smart television, intelligent sound box, virtual reality device and intelligent wearable device etc. are a variety of In equipment.
Fig. 1 is a kind of flow chart of content extraction method provided by the embodiments of the present application.As shown in Figure 1, the content extraction Method the following steps are included:
Step S101 obtains the target classification of problem according to the classification expression formula that question-based teaching includes, wherein described problem tree Comprising class node, a classification of each class node correspondence problem, the class node includes classification expression list, institute Stating classification expression list includes multiple classification expression formulas.
Specifically, the embodiment of the present application extracts corresponding content as problem from reading in text according to specified problem Answer.Wherein, read text for example can be an article, comprising: news release, from media article, scientific popular article, novel, Prose, monograph or article of professional domain etc., correspondingly, according to the main contents understood are read, problem may include more A classification, such as: whose etc. article theme, author, article source, article time, title are.
Based on the above-mentioned classification to problem, the embodiment of the present application can be with Construct question tree.Wherein, question-based teaching includes at least one A father node and multiple child nodes positioned at father node next stage, and using child node as class node, each class node pair Answer one of problem to classify, each class node include at least one classification expression list, expression list of classifying it is each Row includes an item name and a corresponding classification expression formula in pairs of form.
Illustratively, question-based teaching can be following form:
Question Classification(father node)
Author(child node)
Article masterTopic
Article source
Whom title is
The article time
Content matching is carried out to problem by using the classification expression formula in question-based teaching as a result, can determine that problem matches Which of question-based teaching classification expression formula has been arrived, thus, the class node where classification expression formula, which just can determine, asks The target classification of topic.
Illustratively, problem are as follows: the theme of article please be summarize, then, the classification expression formula that can be matched to using question-based teaching Are as follows: [c_ summary+{ 0,0 } c_ theme], since the expression formula is located at child node " article theme ", the corresponding mesh of the problem Mark classification are as follows: article theme.
In addition, the every a line for expression list of classifying can also include one multiple as a kind of achievable embodiment Select frame, which chooses or non-selected operation classification expression formula execution for user, and further to classification expression formula into Row modification and deletion etc..
In addition, the weighted value of each classification expression formula, the weight can also be arranged as a kind of achievable embodiment Value for example can be a natural numerical value, when problem is matched to two or more classification expression formulas simultaneously, according to weighted value Class node where the classification expression formula of highest (numerical value is maximum) determines the target classification of problem.
In addition, the identification state of classification expression formula, the identification shape can also be arranged as a kind of achievable embodiment State for example may include identification and exclude.Specifically, classification expression formula carries out positive to problem when identification state is identification Match, i.e., if problem has been matched to the classification expression formula that identification state is identification, the classification where the classification expression formula Node just can determine the target classification of problem;When identification state is to exclude, classification expression formula carries out negative relational matching to problem, I.e. if it is the classification expression formula excluded that problem, which has been matched to identification state, classification corresponding to the classification expression formula is centainly not It is the target classification of problem.
In addition, the enabled state of classification expression formula can also be set as a kind of achievable embodiment, the enabled shape State for example may include effective and invalid.Specifically, when enabled state is effective, classification expression formula participates in problem Match, when enabled state is invalid, classification expression formula is not involved in the matching to problem.
Step S102 obtains the target classification corresponding target extract node in extracting tree, and uses the target It extracting the extraction expression formula that node includes and extracts object content from reading text, wherein the extraction tree includes extraction node, Each classification for extracting node correspondence problem, the extraction node includes to extract expression list, the extraction expression formula List includes multiple extraction expression formulas.
Accordingly with above problem tree, the embodiment of the present application can construct extraction tree.Wherein, extracting tree includes at least one Father node and multiple child nodes positioned at father node next stage, and using child node as node is extracted, extract node and question-based teaching Class node there is one-to-one relationship, therefore each classification for extracting node also correspondence problem, each extractions save Point extracts expression list comprising at least one, and extracting every a line of expression list in pairs of form includes an extraction name Claim (specific name in corresponding classification expression list) and a corresponding extraction expression formula.
Illustratively, extracting tree can be following form:
Answer extracting(father node)
Article theme(child node)
Author
Article source
Whom title is
The article time
As a result, after step S101 has determined the target classification of problem, it is corresponding in extracting tree to obtain target classification Target extract node, and the extraction expression formula for including using target extract node extracts object content from reading text.
Illustratively, problem are as follows: whom is author? the classification expression formula that the problem can be matched in question-based teaching: [c_ Whom article+{ 0,0 } c_ reporter+{ 0,0 } c_ is], it is possible thereby to determine that target classification is: author;Therefore, in step 102 In, the extraction expression formula for including using " author " node for extracting tree is matched to text is read, such as: for above The reading text shown, can be used " k_ reporter { 0,1 }@c_ name@" from read text in be matched to " reporter high mountain XX ".
In addition, the every a line for extracting expression list can also include one multiple as a kind of achievable embodiment Select frame, which chooses or non-selected operation for user to expression formula execution is extracted, and further to extract expression formula into Row modification and deletion etc..
In addition, each weighted value for extracting expression formula, the weight can also be arranged as a kind of achievable embodiment Value for example can be a natural numerical value, when there is multiple extractions expression formulas to be matched to different contents from reading text simultaneously When, the content that the extraction expression formula of weighted value highest (numerical value is maximum) can be matched to is as object content.
In addition, the enabled state for extracting expression formula can also be set as a kind of achievable embodiment, the enabled shape State for example may include effective and invalid.Specifically, it when enabled state is effective, extracts expression formula and participates in reading text Matching, when enabled state is invalid, extraction expression formula is not involved in the matching to text is read.
In addition, the extraction range for extracting expression formula, the extraction model can also be arranged as a kind of achievable embodiment Enclosing for example may include only matching and across subordinate sentence matching in subordinate sentence, wherein can will read comma, branch and sentence in text Boundary number as subordinate sentence, the content between two boundaries is as a subordinate sentence.It specifically, is only in subordinate sentence when extracting range When matching, extracts expression formula and only match object content in each subordinate sentence for reading text respectively, matched not across subordinate sentence;Work as extraction When range is across subordinate sentence matching, across subordinate sentence matching can be carried out to text is read by extracting expression formula.
Step S103 post-processes the object content, obtains according to the corresponding post-processing rule of the target classification To the answer of described problem.
In object content other than the answer comprising problem, it is also possible to it include other content, such as: when problem is " text Whom chapter author is " when, the object content of extraction is " reporter Xing XX ", wherein only " Xing XX " is answer;Alternatively, working as step When being drawn into multiple object contents in S102, in order to guarantee uniquely determining for answer, need only to select an object content, and Answer is generated from the object content of selection.
Different post-processing rules can be arranged, and according to rear with the different classifications of correspondence problem in the embodiment of the present application as a result, Processing rule is filtered object content, screens and refines, to obtain the answer of problem.
From the above technical scheme, the embodiment of the present application provides a kind of content extraction method, comprising: according to question-based teaching The classification expression formula for including obtains the target classification of problem, and described problem tree includes class node, and the class node includes point Class expression list, the classification expression list include multiple classification expression formulas;The target classification is obtained in extracting tree Corresponding target extract node, and the extraction expression formula for including using the target extract node extracts target from reading text Content, the extraction tree include to extract expression list, the extraction expression list comprising extracting node, the extraction node Include multiple extraction expression formulas;According to the corresponding post-processing rule of the target classification, the object content is post-processed, Obtain the answer of described problem.Method provided by the embodiments of the present application is applied to machine reading when understanding as a result, it is only necessary to according to asking The classification Construct question tree and extraction tree of topic, when the classification of problem determines, question-based teaching and extraction tree are also relatively determining, Ke Yiyong In the answer for extracting problem in text from different reading, there is universality, can be improved machine and read the accuracy rate understood.
In one embodiment, classification expression formula and extraction expression formula can be by text concept, keyword and operators etc. Composition, wherein text concept includes at least one concept value, and a kind of expression way of the concept value as text concept is calculated Son is for forming the matching rule of expression formula in conjunction with text concept and keyword.
Specific explanations explanation is done below with reference to composition of some examples to classification expression formula and extraction expression formula.
Illustratively, for a following classification expression formula:
[c_ article+{ 0,0 } c_ whom is+{ 0,0 } k_ report]
" c_ article " " whom c_ is " is respectively text concept, wherein " c " is the mark of text concept, and " article " is text The title of concept, " article " can have multiple and different concept values, such as: article, news, text, report etc., it is literary when reading When there is above-mentioned concept value in this, " the c_ article " that can be classified in expression formula is matched to.
" k_ report " is the expression-form of keyword, wherein " k " is the mark of keyword, and " report " is keyword, when readding When reading in text comprising " report ", " the k_ report " that can be classified in expression formula is matched to.
"+" " { 0,0 } " " [] " etc. is operator.Wherein, "+" is and operator, matching rule are that the text before and after "+" is general It reads or keyword exists simultaneously;" { 0,0 } " is apart from operator, and format is { x, y }, and x, y are nonnegative integer, x is less than or equal to y, Two values in operator express one apart from section, and matching rule is the distance of text concept or keyword at x Between character bit and y character bit;" [] " is sequential operator, indicates that text concept and keyword in " [] " etc. will be according to fixed The good sequence of justice is matched.
Illustratively, for a following extraction expression formula:
C_ news media { 0,1 }@c_ name@
Wherein, " c_ news media " and " c_ name " are text concept, " { 0,1 } " be apart from operator, front and back occur two A "@" is respectively that prezone mark and rear boundary mark are known, and the content that the part between two "@" is matched in reading text is exactly Extract the expression formula object content to be extracted.The full sense of the extraction expression formula is to match distance in 0 to 1 word as a result, " news media " concept and " name " concept within symbol, and extract " name " concept.
It should be added that the expression-form of text concept, keyword shown in above-mentioned example and operator is only As composition and classification expression formula and extract a kind of selectable embodiment of expression formula, rather than whole embodiments.This Field technical staff can also design text concept as needed, close on the basis of the embodiment of the present application disclosure Other of keyword and operator expression-form, and constitute other classification expression formulas on this basis and extract expression formula, these set Count and conceive the protection scope without departing from the embodiment of the present application.
In one embodiment, safeguard that the embodiment of the present application also constructs for the ease of the concept value to text concept Conceptional tree.Wherein, conceptional tree includes multiple text concepts, and each text concept includes multiple concept nodes, each concept section The corresponding concept value of point.
Specifically, conceptional tree may include a problem concept node and an answer concept node, wherein problem concept Node and answer concept node include multiple class nodes, a classification of each class node correspondence problem, each merogenesis It also include multiple child nodes under point, each child node corresponds to a text concept, and includes a concept value list, concept value column Table have recorded text concept concept name and all concept values.
Illustratively, conceptional tree can be following form:
Problem concept
Article theme(class node)
Article(child node)
Probably
Theme
What
It introduces
Summarize
Whom title is
The article time
Answer concept
Wherein, as a kind of selectable embodiment, concept value as shown in above-mentioned example, may include " () " " | " "? " Etc. syntax rules, for expansion concept value expression-form and expression range, those skilled in the art can be according to regular expressions The syntax rule of the syntax rule design concept value of formula, such as: " | " represents the rule of selection or collection, and " () " represents grammer rule Opereating specification then, "? " represent the front character at most occur it is primary, etc..
As a result, when classifying expression formula and extracting expression formula comprising text concept, text concept can be used in conceptional tree In corresponding concept value to read text match, to extend the matching range of text concept.
Fig. 2 is a kind of flow chart of pre-treating method provided by the embodiments of the present application.
In an achievable embodiment, before the embodiment of the present application provides a kind of pair of problem and reads text progress The method of processing, the pre-treating method apply include according to question-based teaching classification expression formula obtain problem target classification it Before, the pre-treating method is as shown in Fig. 2, can specifically include following steps:
Step S201, the space character in removal problem.
Since space character can also occupy a character position, therefore can cause shadow to the matching process of classification expression formula It rings, such as: if there are two space characters, the distance between the two text concepts between two text concepts of problem At least more than or be equal to two character bits, at this point, if the distance between the two text concepts in classification expression formula are calculated Son is { 0,1 }, then due to the interference of space character, problem and classification expression formula will not successful match.Therefore, it gets rid of in problem Space character, can be improved classification expression formula to the matched accuracy rate of problem.
Step S202, the specific content that the starting position of text is read in removal or end position includes.
In production environment, some reading texts are obtained from network, therefore, these read texts in starting position or End position may include some specific contents, such as: the text of news report finally generally comprises " comment " " message " Equal word contents, when being only loaded with the part body content of news report in webpage, text is finally also possible that " load is more It is more " etc. word contents, the answer of problem is not included in these word contents, the matching process for extracting expression formula may but be caused Interference, therefore, the embodiment of the present application is in step S202, to the starting position or the end position above-mentioned text that includes for reading text Word content is removed.
Illustratively, for reading the starting position of text, the following contents can be removed: any blank character (including space Symbol, tab, form feed character etc.), it is the carriage return character, newline, specific format content (such as the format of " editor: SN+ number "), specific Word content (such as " load is more " " news load is more " " comment load is more " " obtaining authorization " etc.).For reading text End position, the following contents can be removed: specific character content (such as " video load in, please later " " automatic to play " " play "), any blank character (including space, tab, form feed character etc.) etc..
Step S203 is obtained and is read the blank character that text continuously occurs, and the blank character continuously occurred is replaced with One space character.
Reading the blank character continuously occurred in text may include space character, tab, form feed character etc., these characters connect Continuous appearance can interfere the matching for extracting expression formula to text is read, and therefore, the embodiment of the present application will continuously go out in step S203 Existing blank ancestral's character replaces with a space character, to reduce interference.
In some achievable embodiments, the application implements the different classifications according to problem, additionally provides to target The post-processing rule of content.
In one embodiment, corresponding " author " classification, post-processing rule may include: the removal object content It is included as the character string of noise;Removal is located at the space character before the object content and after the object content, Obtain the answer.
Illustratively, text is read are as follows:
On April 5th, 2016, Department of Transportation hold the green beacon in a small piece of land surrounded by water in South Sea small piece of land surrounded by water Bi Jiao and enable ceremony, the throwing of the green beacon in a small piece of land surrounded by water Surrounding body navaid, navigation scheduling and emergency rescue ability will effectively be promoted by entering use.Xing reporter of the Xinhua News Agency XX takes the photograph
Problem are as follows: whom is article reporter?
So, problem can be matched to " Question Classification -- author " node of question-based teaching.It extracts tree and uses " article work " c_ news media { 0,1 } the@c_ name@" of person " node can be drawn into " reporter Xing XX " from reading in text.Wherein " reporter " Belong to noise due to not being name, the application in post-processing will " reporter " removal, and before or after removing " Xing XX " The space character being likely to occur obtains the answer of problem.
Fig. 3 is a kind of flow chart of post-processing rule provided by the embodiments of the present application.
In one embodiment, corresponding " article source " classification, when being drawn into multiple object contents from reading text, Post-processing rule shown in Fig. 3 can be used and select answer of the object content as problem.Specifically, after shown in Fig. 3 Processing rule the following steps are included:
Step S301 will be apart from reading text end pre- when the extraction expression formula is drawn into multiple object contents If in range, and answer of the object content nearest apart from reading text end as problem.
Step S302 will be apart from reading if not including the object content in the preset range of text end apart from reading Text starts within a preset range, and starts answer of the nearest object content as problem apart from text is read.
Corresponding " article source " classification as a result, when being drawn into multiple object contents from reading text, the application is rear It is the answer of determining problem in treatment process provided at least two priority, highest priority, that is, step S301, first by selection Zone focusing obtains the answer of problem from the end for reading text to the end for reading text;Second priority, that is, step S302, Range of choice is focused on to the beginning for reading text, from the answer for the beginning acquisition problem for reading text.
Illustratively, if setting range is 30 characters, for one section of news as shown below, expression is extracted The object content that formula can be drawn into is the part of hereinafter font-weight, and obtaining the range of answer in step S301 is hereinafter Add the part of underscore:
U.S.'s " daily space flight " on June 29th, 2009 is reported ..., and space department, Britain has gone canvassing to Congressmen before this, refers to Out Britain dependent on U.S.'s imaging satellite thing be one certain be related to the aspect of national ability.... this report also suggests, Cyberspace will have growing importance as a national security field, it will continue almostWhole mankind Occupy growing importance in activity form.China Engineering Technology Information Networks
" China Engineering Technology Information Networks " are only drawn into due in the range, extracting expression formula as a result, therefore, " China Engineering Technology Information Networks " are exactly the answer of " article source " class problem.
Illustratively, if setting range is 30 characters, for one section of news as shown below, expression is extracted The object content that formula can be drawn into is the part of hereinafter font-weight, and obtaining the range of answer in step S301 is lower the end of writing Tail adds the part of underscore, and the range that answer is obtained in step S302 is the part of hereafter beginning addition underscore:
Net report on December 18 western medium, Reference News claims, and China side claims it on South Sea dispute islandConstruction be normal work It is dynamic.... according to " the daily inquirer of Philippine reports " website December 17, Asia maritime affairs G-8 Transparency Initiative G-8 website claims, and China is just The reef on the Nansha Islands and the Xisha Islands is being built up into island, while disposing military installations and equipment.In this regard, Chinese Foreign Ministry is sent out Say on regular press conference within fervent 15 days in speech people land: " China carries out peace construction on the territory of oneself and livesMove, dispose it is necessary anti- Defending facility is very normally that this is the thing within the scope of the sovereignty of China.
It is not drawn into object content due in the range of news end, extracting expression formula, it is therefore, preferential according to second Grade, from the answer of the beginning acquisition problem of news, to be drawn into " Reference News's net ", therefore, " Reference News's net " is exactly " text The answer of Zhang Laiyuan " class problem.
Multiple object contents are extracted from reading text aiming at the problem that extracting expression formula " article source " classification as a result, When, in such a way that setting priority is chosen, uniquely determine answer of the object content as problem.
Fig. 4 is a kind of flow chart of post-processing rule provided by the embodiments of the present application.
In one embodiment, corresponding " article time " classification, when being drawn into multiple object contents from reading text, Post-processing rule shown in Fig. 4 can be used and select answer of the object content as problem.Specifically, after shown in Fig. 4 Processing rule the following steps are included:
Step S401 obtains each object content and is reading when the extraction expression formula is drawn into multiple object contents End position in text, the end position are that the last character of object content is reading the position in text.
Step S402 calculates the difference that the end position of the character length and each object content of reading text subtracts each other.
Step S403, using the corresponding object content of difference minimum value subtracted each other as the answer of problem.
Illustratively, for a news:
Xinhua News Agency12 days 2 monthsReport, XXXXXXXXXXXXXXX, XXXXXXXXXXXXX.(Xinhua News Agency13 days 2 monthsNews)
Its character length are as follows: 54 (a characters), the content for drawing horizontal line are the object content for extracting expression formula and being drawn into, that , initial position and end position of each object content in reading text, and, character length and end position subtract each other it Difference, which can count, (sets the position for reading the first character of text as 0) in the following table:
Object content (time) Initial position End position Character length-end position
12 days 2 months 3 7 47
13 days 2 months 47 51 3
The minimum value that character length and end position subtract each other as a result, is 3, and the corresponding time is " 13 days 2 months ", thus " 2 months 13 days " the just answer as " article time " class problem.
Multiple object contents are extracted from reading text aiming at the problem that extracting expression formula " article time " classification as a result, When, the position in text is being read according to object content, is uniquely determining answer of the object content as problem.
In one embodiment, if target extract node is " whom title is ", target extract is used in step S102 When the extraction expression formula that node includes extracts object content from reading text, if the extraction expression formula includes text concept With multiple concept values, and the most concept value of character quantity includes other concept values, then most using character quantity The concept value extracts the object content.
Illustratively, for a news:
China's net was interrogated 29 December 29, and Foreign Ministry spokesman XXX participates in sub- boat lost contact passenger plane search-and-rescue work thing with regard to me and answers Reporter asks ... ...
Extracting the extraction expression formula that the extraction node " whom title is " set includes includes:
(c_ title) [^,.?!...] { 0,4 } (c_ name)
So, if text concept " c_ title " has multiple concept values in conceptional tree, such as: spokesman, Ministry of Foreign Affairs Spokesman, wherein " Foreign Ministry spokesman " contains " spokesman ", therefore then takes expression formula that " Foreign Ministry spokesman " is used to match Text is read, and extracts the object content being matched to, therefore, in above-mentioned news, the object content of extraction is " Ministry of Foreign Affairs's speech People XXX ", rather than " spokesman XXX ".As a result, by extracting more characters as object content, make finally obtained answer more Add complete and accurate.
In addition, as a kind of selectable embodiment, if the target point of problem has not been obtained according to classification expression formula Class, or be not drawn into object content according to expression formula is extracted, then use machine learning model trained in advance from reading text The answer of middle acquisition problem, to make the technical solution of the embodiment of the present application, with the reading understanding side based on machine learning model Method is alternative scheme, so that answer can be extracted from reading text according to problem under any circumstance.
Here is the Installation practice of the application, provides a kind of content extraction device, the device can be applied to server, PC (PC), tablet computer, mobile phone, smart television, intelligent sound box, virtual reality device and intelligent wearable device etc. are a variety of In equipment.Undocumented details in the Installation practice of the application, please refers to the Installation practice of the application.
Fig. 5 is a kind of structural schematic diagram of content extraction device provided by the embodiments of the present application.As shown in figure 5, the device Include:
Problem matching module 501, the classification expression formula for including according to question-based teaching obtain the target classification of problem, In, described problem tree includes class node, and a classification of each class node correspondence problem, the class node includes classification Expression list, the classification expression list include multiple classification expression formulas;
Content extraction module 502 for obtaining the target classification corresponding target extract node in extracting tree, and makes The extraction expression formula for including with the target extract node extracts object content from reading text, wherein the extraction tree packet Containing node is extracted, each classification for extracting node correspondence problem, the extraction node includes to extract expression list, described Extracting expression list includes multiple extraction expression formulas;
Post-processing module 503, for being carried out to the object content according to the corresponding post-processing rule of the target classification Post-processing, obtains the answer of described problem.
From the above technical scheme, the embodiment of the present application provides a kind of content extraction device, for according to question-based teaching The classification expression formula for including obtains the target classification of problem, and described problem tree includes class node, and the class node includes point Class expression list, the classification expression list include multiple classification expression formulas;The target classification is obtained in extracting tree Corresponding target extract node, and the extraction expression formula for including using the target extract node extracts target from reading text Content, the extraction tree include to extract expression list, the extraction expression list comprising extracting node, the extraction node Include multiple extraction expression formulas;According to the corresponding post-processing rule of the target classification, the object content is post-processed, Obtain the answer of described problem.Device provided by the embodiments of the present application is applied to machine reading when understanding as a result, it is only necessary to according to asking The classification Construct question tree and extraction tree of topic, when the classification of problem determines, question-based teaching and extraction tree are also relatively determining, Ke Yiyong In the answer for extracting problem in text from different reading, there is universality, can be improved machine and read the accuracy rate understood.
Those skilled in the art will readily occur to its of the application after considering specification and practicing application disclosed herein Its embodiment.This application is intended to cover any variations, uses, or adaptations of the application, these modifications, purposes or Person's adaptive change follows the general principle of the application and including the undocumented common knowledge in the art of the application Or conventional techniques.The description and examples are only to be considered as illustrative, and the true scope and spirit of the application are by following Claim is pointed out.
It should be understood that the application is not limited to the precise structure that has been described above and shown in the drawings, and And various modifications and changes may be made without departing from the scope thereof.Scope of the present application is only limited by the accompanying claims.

Claims (10)

1. a kind of content extraction method characterized by comprising
The target classification of problem is obtained according to the classification expression formula that question-based teaching includes, wherein described problem tree includes class node, One classification of each class node correspondence problem, the class node include classification expression list, the classification expression formula List includes multiple classification expression formulas;
The target classification corresponding target extract node in extracting tree is obtained, and includes using the target extract node It extracts expression formula and extracts object content from reading text, wherein the extraction tree is comprising extracting node, each extraction node pair One of problem is answered to classify, the extraction node includes to extract expression list, and the extraction expression list includes multiple pumpings Take expression formula;
According to the corresponding post-processing rule of the target classification, the object content is post-processed, described problem is obtained Answer.
2. the method according to claim 1, wherein the classification expression formula and the extraction expression formula are by text This concept, keyword and operator composition, wherein the text concept includes at least one concept value, and the concept value is as text A kind of expression way of this concept, the operator are used to be formed the matching of expression formula in conjunction with the text concept and the keyword Rule.
3. according to the method described in claim 2, it is characterized in that, further including conceptional tree, the conceptional tree includes multiple texts Concept, each text concept include multiple concept nodes, each corresponding concept value of the concept node.
4. the method according to claim 1, wherein the classification expression formula acquisition for including according to question-based teaching is asked Before the target classification of topic, further includes:
Space character in removal problem;
The specific content that the starting position of text is read in removal or end position includes;
It obtains and reads the blank character that text continuously occurs, and the blank character continuously occurred is replaced with into a space character.
5. the method according to claim 1, wherein the post-processing rule includes:
Remove the character string that the object content is included as noise;
Removal is located at the space character before the object content and after the object content, obtains the answer.
6. the method according to claim 1, wherein the post-processing rule includes:
When the extraction expression formula is drawn into multiple object contents, will apart from read text end within a preset range, and Distance reads answer of the nearest object content in text end as problem;
If not including the object content in the preset range of text end apart from reading, will preset apart from text beginning is read In range, and start answer of the nearest object content as problem apart from text is read.
7. the method according to claim 1, wherein the post-processing rule includes:
When the extraction expression formula is drawn into multiple object contents, obtains each object content and reading the stop bits in text It sets, the end position is that the last character of object content is reading the position in text;
Calculate the difference that the end position of the character length and each object content of reading text subtracts each other;
Using the corresponding object content of difference minimum value subtracted each other as the answer of problem.
8. according to the method described in claim 2, it is characterized in that, the extraction expression formula for including using target extract node Object content is extracted from reading in text, comprising:
If the extraction expression formula includes that text concept has multiple concept values, and the most concept value of character quantity includes Other concept values then extract the object content using the most concept value of character quantity.
9. method according to claim 1-8, which is characterized in that further include:
If the target classification of problem has not been obtained according to the classification expression formula, or not according to the extraction expression formula It is drawn into the object content, then obtains answering for described problem from reading text using machine learning model trained in advance Case.
10. a kind of content extraction device characterized by comprising
Problem matching module, the classification expression formula for including according to question-based teaching obtain the target classification of problem, wherein described to ask Topic tree includes class node, and a classification of each class node correspondence problem, the class node includes classification expression formula column Table, the classification expression list include multiple classification expression formulas;
Content extraction module, for obtaining the target classification corresponding target extract node in extracting tree, and described in use The extraction expression formula that target extract node includes extracts object content from reading text, wherein the extraction tree is comprising extracting Node, each classification for extracting node correspondence problem, the extraction node includes to extract expression list, the extraction table It include multiple extraction expression formulas up to formula list;
Post-processing module, for being post-processed to the object content according to the corresponding post-processing rule of the target classification, Obtain the answer of described problem.
CN201910155040.6A 2019-03-01 2019-03-01 Content extraction method and device Active CN109918490B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910155040.6A CN109918490B (en) 2019-03-01 2019-03-01 Content extraction method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910155040.6A CN109918490B (en) 2019-03-01 2019-03-01 Content extraction method and device

Publications (2)

Publication Number Publication Date
CN109918490A true CN109918490A (en) 2019-06-21
CN109918490B CN109918490B (en) 2022-12-16

Family

ID=66962894

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910155040.6A Active CN109918490B (en) 2019-03-01 2019-03-01 Content extraction method and device

Country Status (1)

Country Link
CN (1) CN109918490B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110413636A (en) * 2019-08-01 2019-11-05 北京香侬慧语科技有限责任公司 A kind of data processing method and device
CN110457597A (en) * 2019-08-08 2019-11-15 中科鼎富(北京)科技发展有限公司 A kind of advertisement recognition method and device
CN111008523A (en) * 2019-11-21 2020-04-14 中科鼎富(北京)科技发展有限公司 Information extraction method and device and server

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6115683A (en) * 1997-03-31 2000-09-05 Educational Testing Service Automatic essay scoring system using content-based techniques
CN107608949A (en) * 2017-10-16 2018-01-19 北京神州泰岳软件股份有限公司 A kind of Text Information Extraction method and device based on semantic model

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6115683A (en) * 1997-03-31 2000-09-05 Educational Testing Service Automatic essay scoring system using content-based techniques
CN107608949A (en) * 2017-10-16 2018-01-19 北京神州泰岳软件股份有限公司 A kind of Text Information Extraction method and device based on semantic model

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
王素格等: "面向高考阅读理解观点类问题的答案抽取方法", 《郑州大学学报(理学版)》 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110413636A (en) * 2019-08-01 2019-11-05 北京香侬慧语科技有限责任公司 A kind of data processing method and device
CN110457597A (en) * 2019-08-08 2019-11-15 中科鼎富(北京)科技发展有限公司 A kind of advertisement recognition method and device
CN111008523A (en) * 2019-11-21 2020-04-14 中科鼎富(北京)科技发展有限公司 Information extraction method and device and server

Also Published As

Publication number Publication date
CN109918490B (en) 2022-12-16

Similar Documents

Publication Publication Date Title
Dyson Oral language: The rooting system for learning to write
CN109918490A (en) A kind of content extraction method and device
Hurcombe Sex and God (RLE Women and Religion): Some Varieties of Women's Religious Experience
Cetina Merton's sociology of science: the first and the last sociology of science?
Crome The Restoration of the Jews: Early Modern Hermeneutics, Eschatology, and National Identity in the Works of Thomas Brightman
Ellis et al. 'Mara yurriku': Western Desert sign languages
Leach Claude Lévi-Strauss: anthropologist and philosopher
Berman Practicing transnational feminist recovery today
Butling et al. Poets Talk: Conversations with Robert Kroetsch, Daphne Marlatt, Erin Mouré, Dionne Brand, Marie Annharte Baker, Jeff Derksen, and Fred Wah
Embong et al. The representations of leadership by example in editorial cartoons
Nemani et al. An investigation of the constraints in subtitling the conversations: On the role of cultural effects on variation
Gramling Queer/LGBT approaches
Newland The Lost Tribes of I srael–and the G enesis of C hristianity in F iji: Missionary Notions of F ijian Origin from 1835 to Cession and Beyond
Cartwright The Cult of St Ursula and the 11,000 Virgins
Russell Inclusive language and power
Garley et al. Virtual meatspace: Word formation and deformation in cyberpunk discussions
Capancioni Janet Ross's intergenerational life writing: female intellectual legacy through memoirs, correspondence, and reminiscences
Williams et al. The (Ever) Lasting Significance of Zora Neale Hurston's Barracoon
Lepore Wigwam Words
Venkatesh My Indian Babel Multilingualism and Memory
Johnson The Malaysian intellectual: A brief historical overview of the discourse
Katz-Rosene et al. Ecopolitics Podcast, Episode 1: Introducing the Ecopolitics Podcast
Berg A Theory of Artificial Classification
Niven Crossing the Black Waters: NC Chaudhuri's A Passage to England and VS Naipaul's A n Area of Darkness
Qingqing Subtitle Translation from the Perspective of Multimodal Discourse Analysis: A Case Study of The Big Bang Theory

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant