CN109918490A - A kind of content extraction method and device - Google Patents
A kind of content extraction method and device Download PDFInfo
- Publication number
- CN109918490A CN109918490A CN201910155040.6A CN201910155040A CN109918490A CN 109918490 A CN109918490 A CN 109918490A CN 201910155040 A CN201910155040 A CN 201910155040A CN 109918490 A CN109918490 A CN 109918490A
- Authority
- CN
- China
- Prior art keywords
- classification
- text
- extraction
- expression formula
- node
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The embodiment of the present application provides a kind of content extraction method and device, comprising: the target classification of problem is obtained according to the classification expression formula that question-based teaching includes;Target classification corresponding target extract node in extracting tree is obtained, and the extraction expression formula for including using target extract node extracts object content from reading text;According to the corresponding post-processing rule of target classification, object content is post-processed, the answer of problem is obtained.Thus, method provided by the embodiments of the present application is applied to machine reading when understanding, it only needs the classification Construct question tree according to problem and extracts tree, when the classification of problem determines, question-based teaching and extraction tree also determine relatively, the answer that can be used for extracting problem in text from different reading, has universality, can be improved machine and reads the accuracy rate understood.
Description
Technical field
This application involves natural language processing technique field more particularly to a kind of content extraction method and devices.
Background technique
Machine, which is read, to be understood and is the development along with depth learning technology and the technical theme generated.Machine reads understanding
Research purpose is exactly to allow machine to read text as the mankind, and then answered a question according to the understanding to the text, specific next
It says, is exactly machine according to given corpus and problem, the correct option of problem is found out from corpus.
Current machine reads understanding and usually uses what the algorithm model based on deep learning was realized, that is, uses artificial structure
That makes is marked the data set training deep learning model of problem and answer, so that deep learning model be made to have from simple text
The ability of some simple problem answers is extracted in this.However, due to the limitation and data set scale of deep learning model algorithm
Limitation, currently based on deep learning model realization machine read understanding method accuracy rate it is not high.Such as it is opened some
In the practical application of putting property context, when extracting the answer of given problem from article, deep learning model can only realize 60%
The extraction accuracy rate of left and right, is much not achieved the requirement being applicable in production environment.Understand as it can be seen that reading machine
It says, understanding accuracy rate, there are also biggish rooms for promotion.
Summary of the invention
The embodiment of the present application provides a kind of content extraction method and device, reads understanding to solve the machine of the prior art
Method extracts the lower problem of accuracy rate of problem answers from article.
In a first aspect, the embodiment of the present application provides a kind of content extraction method, comprising:
The target classification of problem is obtained according to the classification expression formula that question-based teaching includes, wherein described problem tree includes classification
Node, a classification of each class node correspondence problem, the class node include classification expression list, the classification chart
It include multiple classification expression formulas up to formula list;
The target classification corresponding target extract node in extracting tree is obtained, and uses the target extract node packet
The extraction expression formula contained extracts object content from reading text, wherein the extraction tree is comprising extracting node, each extraction section
One classification of point correspondence problem, the extraction node include to extract expression list, and the extraction expression list includes more
A extraction expression formula;
According to the corresponding post-processing rule of the target classification, the object content is post-processed, described ask is obtained
The answer of topic.
Second aspect, the embodiment of the present application provide a kind of content extraction device, comprising:
Problem matching module, the classification expression formula for including according to question-based teaching obtain the target classification of problem, wherein institute
Stating question-based teaching includes class node, and a classification of each class node correspondence problem, the class node includes classification expression
Formula list, the classification expression list include multiple classification expression formulas;
Content extraction module for obtaining the target classification corresponding target extract node in extracting tree, and uses
The extraction expression formula that the target extract node includes extracts object content from reading text, wherein the extraction, which is set, includes
Node is extracted, each classification for extracting node correspondence problem, the extraction node includes to extract expression list, the pumping
Taking expression list includes multiple extraction expression formulas;
Post-processing module, for regular according to the corresponding post-processing of the target classification, after being carried out to the object content
Processing, obtains the answer of described problem.
From the above technical scheme, the embodiment of the present application provides a kind of content extraction method and device, comprising: according to
The classification expression formula that question-based teaching includes obtains the target classification of problem, and described problem tree includes class node, the class node
Comprising expression list of classifying, the classification expression list includes multiple classification expression formulas;The target classification is obtained to take out
Corresponding target extract node in tree is taken, and the extraction expression formula for including using the target extract node is taken out from reading text
Object content is taken, the extraction tree includes to extract expression list comprising extracting node, the extraction node, the extraction expression
Formula list includes multiple extraction expression formulas;According to the corresponding post-processing rule of the target classification, the object content is carried out
Post-processing, obtains the answer of described problem.Technical solution provided by the embodiments of the present application is applied to machine and reads and understands as a result,
When, it is only necessary to according to the classification Construct question tree of problem and tree is extracted, when the classification of problem determines, also phase is set in question-based teaching and extraction
To determination, it can be used for extracting the answer of problem in text from different reading, there is universality, can be improved machine and read reason
The accuracy rate of solution.
Detailed description of the invention
In order to illustrate more clearly of the technical solution of the application, letter will be made to attached drawing needed in the embodiment below
Singly introduce, it should be apparent that, for those of ordinary skills, without any creative labor,
It is also possible to obtain other drawings based on these drawings.
Fig. 1 is a kind of flow chart of content extraction method provided by the embodiments of the present application;
Fig. 2 is a kind of flow chart of pre-treating method provided by the embodiments of the present application;
Fig. 3 is a kind of flow chart of post-processing rule provided by the embodiments of the present application;
Fig. 4 is a kind of flow chart of post-processing rule provided by the embodiments of the present application;
Fig. 5 is a kind of structural schematic diagram of content extraction device provided by the embodiments of the present application.
Specific embodiment
In order to make those skilled in the art better understand the technical solutions in the application, below in conjunction with the application reality
The attached drawing in example is applied, the technical scheme in the embodiment of the application is clearly and completely described, it is clear that described implementation
Example is merely a part but not all of the embodiments of the present application.Based on the embodiment in the application, this field is common
The application protection all should belong in technical staff's every other embodiment obtained without making creative work
Range.
Machine, which is read, to be understood and is the development along with depth learning technology and the technical theme generated.Machine reads understanding
Research purpose is exactly to allow machine to read text as the mankind, and then answered a question according to the understanding to the text, specific next
It says, is exactly machine according to given corpus and problem, the correct option of problem is found out from corpus.
For example, reading the reading text understood using an article as machine:
Russian Defence Ministry Information Bureau gives out information on the 13rd, and " bridges of 2015 friendship " Russia and Egyptian naval join for the first time
Terminate on the day of closing military exercises in Mediterranean.... Egyptian military spokesman indicates that this is the army of maximum-norm between two countries before this
It drills, it is therefore intended that reinforce angstrom Russia's strategic military and security cooperation, promote the exchange of both sides' military technology.It is introduced that the Russian side is participated in
The naval vessel of manoeuvre includes " Moscow " number guided missile cruiser, " simmon " number guided missile hovercraft, " Alexandria Sabah woods " number
Large-scale landing boat and " MB-31 " deep-sea tug.Egyptian naval sends two cruise warships, two guided missile ships and other naval vessels to participate in
This time manoeuvre.(reporter high mountain XX)
Does is and provide a problem: whom the author of this paper?
Then machine reads the purpose understood and is desirable to find out the author of this article from reading text, then, for this
For a problem, correct answer should be: high mountain XX.
Current machine reads understanding and usually uses what the algorithm model based on deep learning was realized, that is, uses artificial structure
The data set training deep learning model that problem and answer is marked made, the as a result, limitation and data set scale of algorithm
Limitation leads to the reading of the machine based on deep learning model understanding method, and there are bottlenecks in terms of accuracy, especially some
The demand of trained deep learning model is even more much not achieved in data set scale in the practical application of open context, leads to machine
It is lower to read the accuracy rate understood.
The embodiment of the present application provides a kind of content extraction method and device, reads understanding to solve the machine of the prior art
Method extracts the lower problem of accuracy rate of problem answers from article.
Here is the present processes embodiment, provides a kind of content extraction method, this method can be applied to server,
PC (PC), tablet computer, mobile phone, smart television, intelligent sound box, virtual reality device and intelligent wearable device etc. are a variety of
In equipment.
Fig. 1 is a kind of flow chart of content extraction method provided by the embodiments of the present application.As shown in Figure 1, the content extraction
Method the following steps are included:
Step S101 obtains the target classification of problem according to the classification expression formula that question-based teaching includes, wherein described problem tree
Comprising class node, a classification of each class node correspondence problem, the class node includes classification expression list, institute
Stating classification expression list includes multiple classification expression formulas.
Specifically, the embodiment of the present application extracts corresponding content as problem from reading in text according to specified problem
Answer.Wherein, read text for example can be an article, comprising: news release, from media article, scientific popular article, novel,
Prose, monograph or article of professional domain etc., correspondingly, according to the main contents understood are read, problem may include more
A classification, such as: whose etc. article theme, author, article source, article time, title are.
Based on the above-mentioned classification to problem, the embodiment of the present application can be with Construct question tree.Wherein, question-based teaching includes at least one
A father node and multiple child nodes positioned at father node next stage, and using child node as class node, each class node pair
Answer one of problem to classify, each class node include at least one classification expression list, expression list of classifying it is each
Row includes an item name and a corresponding classification expression formula in pairs of form.
Illustratively, question-based teaching can be following form:
Question Classification(father node)
Author(child node)
Article masterTopic
Article source
Whom title is
The article time
Content matching is carried out to problem by using the classification expression formula in question-based teaching as a result, can determine that problem matches
Which of question-based teaching classification expression formula has been arrived, thus, the class node where classification expression formula, which just can determine, asks
The target classification of topic.
Illustratively, problem are as follows: the theme of article please be summarize, then, the classification expression formula that can be matched to using question-based teaching
Are as follows: [c_ summary+{ 0,0 } c_ theme], since the expression formula is located at child node " article theme ", the corresponding mesh of the problem
Mark classification are as follows: article theme.
In addition, the every a line for expression list of classifying can also include one multiple as a kind of achievable embodiment
Select frame, which chooses or non-selected operation classification expression formula execution for user, and further to classification expression formula into
Row modification and deletion etc..
In addition, the weighted value of each classification expression formula, the weight can also be arranged as a kind of achievable embodiment
Value for example can be a natural numerical value, when problem is matched to two or more classification expression formulas simultaneously, according to weighted value
Class node where the classification expression formula of highest (numerical value is maximum) determines the target classification of problem.
In addition, the identification state of classification expression formula, the identification shape can also be arranged as a kind of achievable embodiment
State for example may include identification and exclude.Specifically, classification expression formula carries out positive to problem when identification state is identification
Match, i.e., if problem has been matched to the classification expression formula that identification state is identification, the classification where the classification expression formula
Node just can determine the target classification of problem;When identification state is to exclude, classification expression formula carries out negative relational matching to problem,
I.e. if it is the classification expression formula excluded that problem, which has been matched to identification state, classification corresponding to the classification expression formula is centainly not
It is the target classification of problem.
In addition, the enabled state of classification expression formula can also be set as a kind of achievable embodiment, the enabled shape
State for example may include effective and invalid.Specifically, when enabled state is effective, classification expression formula participates in problem
Match, when enabled state is invalid, classification expression formula is not involved in the matching to problem.
Step S102 obtains the target classification corresponding target extract node in extracting tree, and uses the target
It extracting the extraction expression formula that node includes and extracts object content from reading text, wherein the extraction tree includes extraction node,
Each classification for extracting node correspondence problem, the extraction node includes to extract expression list, the extraction expression formula
List includes multiple extraction expression formulas.
Accordingly with above problem tree, the embodiment of the present application can construct extraction tree.Wherein, extracting tree includes at least one
Father node and multiple child nodes positioned at father node next stage, and using child node as node is extracted, extract node and question-based teaching
Class node there is one-to-one relationship, therefore each classification for extracting node also correspondence problem, each extractions save
Point extracts expression list comprising at least one, and extracting every a line of expression list in pairs of form includes an extraction name
Claim (specific name in corresponding classification expression list) and a corresponding extraction expression formula.
Illustratively, extracting tree can be following form:
Answer extracting(father node)
Article theme(child node)
Author
Article source
Whom title is
The article time
As a result, after step S101 has determined the target classification of problem, it is corresponding in extracting tree to obtain target classification
Target extract node, and the extraction expression formula for including using target extract node extracts object content from reading text.
Illustratively, problem are as follows: whom is author? the classification expression formula that the problem can be matched in question-based teaching: [c_
Whom article+{ 0,0 } c_ reporter+{ 0,0 } c_ is], it is possible thereby to determine that target classification is: author;Therefore, in step 102
In, the extraction expression formula for including using " author " node for extracting tree is matched to text is read, such as: for above
The reading text shown, can be used " k_ reporter { 0,1 }@c_ name@" from read text in be matched to " reporter high mountain XX ".
In addition, the every a line for extracting expression list can also include one multiple as a kind of achievable embodiment
Select frame, which chooses or non-selected operation for user to expression formula execution is extracted, and further to extract expression formula into
Row modification and deletion etc..
In addition, each weighted value for extracting expression formula, the weight can also be arranged as a kind of achievable embodiment
Value for example can be a natural numerical value, when there is multiple extractions expression formulas to be matched to different contents from reading text simultaneously
When, the content that the extraction expression formula of weighted value highest (numerical value is maximum) can be matched to is as object content.
In addition, the enabled state for extracting expression formula can also be set as a kind of achievable embodiment, the enabled shape
State for example may include effective and invalid.Specifically, it when enabled state is effective, extracts expression formula and participates in reading text
Matching, when enabled state is invalid, extraction expression formula is not involved in the matching to text is read.
In addition, the extraction range for extracting expression formula, the extraction model can also be arranged as a kind of achievable embodiment
Enclosing for example may include only matching and across subordinate sentence matching in subordinate sentence, wherein can will read comma, branch and sentence in text
Boundary number as subordinate sentence, the content between two boundaries is as a subordinate sentence.It specifically, is only in subordinate sentence when extracting range
When matching, extracts expression formula and only match object content in each subordinate sentence for reading text respectively, matched not across subordinate sentence;Work as extraction
When range is across subordinate sentence matching, across subordinate sentence matching can be carried out to text is read by extracting expression formula.
Step S103 post-processes the object content, obtains according to the corresponding post-processing rule of the target classification
To the answer of described problem.
In object content other than the answer comprising problem, it is also possible to it include other content, such as: when problem is " text
Whom chapter author is " when, the object content of extraction is " reporter Xing XX ", wherein only " Xing XX " is answer;Alternatively, working as step
When being drawn into multiple object contents in S102, in order to guarantee uniquely determining for answer, need only to select an object content, and
Answer is generated from the object content of selection.
Different post-processing rules can be arranged, and according to rear with the different classifications of correspondence problem in the embodiment of the present application as a result,
Processing rule is filtered object content, screens and refines, to obtain the answer of problem.
From the above technical scheme, the embodiment of the present application provides a kind of content extraction method, comprising: according to question-based teaching
The classification expression formula for including obtains the target classification of problem, and described problem tree includes class node, and the class node includes point
Class expression list, the classification expression list include multiple classification expression formulas;The target classification is obtained in extracting tree
Corresponding target extract node, and the extraction expression formula for including using the target extract node extracts target from reading text
Content, the extraction tree include to extract expression list, the extraction expression list comprising extracting node, the extraction node
Include multiple extraction expression formulas;According to the corresponding post-processing rule of the target classification, the object content is post-processed,
Obtain the answer of described problem.Method provided by the embodiments of the present application is applied to machine reading when understanding as a result, it is only necessary to according to asking
The classification Construct question tree and extraction tree of topic, when the classification of problem determines, question-based teaching and extraction tree are also relatively determining, Ke Yiyong
In the answer for extracting problem in text from different reading, there is universality, can be improved machine and read the accuracy rate understood.
In one embodiment, classification expression formula and extraction expression formula can be by text concept, keyword and operators etc.
Composition, wherein text concept includes at least one concept value, and a kind of expression way of the concept value as text concept is calculated
Son is for forming the matching rule of expression formula in conjunction with text concept and keyword.
Specific explanations explanation is done below with reference to composition of some examples to classification expression formula and extraction expression formula.
Illustratively, for a following classification expression formula:
[c_ article+{ 0,0 } c_ whom is+{ 0,0 } k_ report]
" c_ article " " whom c_ is " is respectively text concept, wherein " c " is the mark of text concept, and " article " is text
The title of concept, " article " can have multiple and different concept values, such as: article, news, text, report etc., it is literary when reading
When there is above-mentioned concept value in this, " the c_ article " that can be classified in expression formula is matched to.
" k_ report " is the expression-form of keyword, wherein " k " is the mark of keyword, and " report " is keyword, when readding
When reading in text comprising " report ", " the k_ report " that can be classified in expression formula is matched to.
"+" " { 0,0 } " " [] " etc. is operator.Wherein, "+" is and operator, matching rule are that the text before and after "+" is general
It reads or keyword exists simultaneously;" { 0,0 } " is apart from operator, and format is { x, y }, and x, y are nonnegative integer, x is less than or equal to y,
Two values in operator express one apart from section, and matching rule is the distance of text concept or keyword at x
Between character bit and y character bit;" [] " is sequential operator, indicates that text concept and keyword in " [] " etc. will be according to fixed
The good sequence of justice is matched.
Illustratively, for a following extraction expression formula:
C_ news media { 0,1 }@c_ name@
Wherein, " c_ news media " and " c_ name " are text concept, " { 0,1 } " be apart from operator, front and back occur two
A "@" is respectively that prezone mark and rear boundary mark are known, and the content that the part between two "@" is matched in reading text is exactly
Extract the expression formula object content to be extracted.The full sense of the extraction expression formula is to match distance in 0 to 1 word as a result,
" news media " concept and " name " concept within symbol, and extract " name " concept.
It should be added that the expression-form of text concept, keyword shown in above-mentioned example and operator is only
As composition and classification expression formula and extract a kind of selectable embodiment of expression formula, rather than whole embodiments.This
Field technical staff can also design text concept as needed, close on the basis of the embodiment of the present application disclosure
Other of keyword and operator expression-form, and constitute other classification expression formulas on this basis and extract expression formula, these set
Count and conceive the protection scope without departing from the embodiment of the present application.
In one embodiment, safeguard that the embodiment of the present application also constructs for the ease of the concept value to text concept
Conceptional tree.Wherein, conceptional tree includes multiple text concepts, and each text concept includes multiple concept nodes, each concept section
The corresponding concept value of point.
Specifically, conceptional tree may include a problem concept node and an answer concept node, wherein problem concept
Node and answer concept node include multiple class nodes, a classification of each class node correspondence problem, each merogenesis
It also include multiple child nodes under point, each child node corresponds to a text concept, and includes a concept value list, concept value column
Table have recorded text concept concept name and all concept values.
Illustratively, conceptional tree can be following form:
Problem concept
Article theme(class node)
Article(child node)
Probably
Theme
What
It introduces
Summarize
Whom title is
The article time
Answer concept
Wherein, as a kind of selectable embodiment, concept value as shown in above-mentioned example, may include " () " " | " "? "
Etc. syntax rules, for expansion concept value expression-form and expression range, those skilled in the art can be according to regular expressions
The syntax rule of the syntax rule design concept value of formula, such as: " | " represents the rule of selection or collection, and " () " represents grammer rule
Opereating specification then, "? " represent the front character at most occur it is primary, etc..
As a result, when classifying expression formula and extracting expression formula comprising text concept, text concept can be used in conceptional tree
In corresponding concept value to read text match, to extend the matching range of text concept.
Fig. 2 is a kind of flow chart of pre-treating method provided by the embodiments of the present application.
In an achievable embodiment, before the embodiment of the present application provides a kind of pair of problem and reads text progress
The method of processing, the pre-treating method apply include according to question-based teaching classification expression formula obtain problem target classification it
Before, the pre-treating method is as shown in Fig. 2, can specifically include following steps:
Step S201, the space character in removal problem.
Since space character can also occupy a character position, therefore can cause shadow to the matching process of classification expression formula
It rings, such as: if there are two space characters, the distance between the two text concepts between two text concepts of problem
At least more than or be equal to two character bits, at this point, if the distance between the two text concepts in classification expression formula are calculated
Son is { 0,1 }, then due to the interference of space character, problem and classification expression formula will not successful match.Therefore, it gets rid of in problem
Space character, can be improved classification expression formula to the matched accuracy rate of problem.
Step S202, the specific content that the starting position of text is read in removal or end position includes.
In production environment, some reading texts are obtained from network, therefore, these read texts in starting position or
End position may include some specific contents, such as: the text of news report finally generally comprises " comment " " message "
Equal word contents, when being only loaded with the part body content of news report in webpage, text is finally also possible that " load is more
It is more " etc. word contents, the answer of problem is not included in these word contents, the matching process for extracting expression formula may but be caused
Interference, therefore, the embodiment of the present application is in step S202, to the starting position or the end position above-mentioned text that includes for reading text
Word content is removed.
Illustratively, for reading the starting position of text, the following contents can be removed: any blank character (including space
Symbol, tab, form feed character etc.), it is the carriage return character, newline, specific format content (such as the format of " editor: SN+ number "), specific
Word content (such as " load is more " " news load is more " " comment load is more " " obtaining authorization " etc.).For reading text
End position, the following contents can be removed: specific character content (such as " video load in, please later " " automatic to play "
" play "), any blank character (including space, tab, form feed character etc.) etc..
Step S203 is obtained and is read the blank character that text continuously occurs, and the blank character continuously occurred is replaced with
One space character.
Reading the blank character continuously occurred in text may include space character, tab, form feed character etc., these characters connect
Continuous appearance can interfere the matching for extracting expression formula to text is read, and therefore, the embodiment of the present application will continuously go out in step S203
Existing blank ancestral's character replaces with a space character, to reduce interference.
In some achievable embodiments, the application implements the different classifications according to problem, additionally provides to target
The post-processing rule of content.
In one embodiment, corresponding " author " classification, post-processing rule may include: the removal object content
It is included as the character string of noise;Removal is located at the space character before the object content and after the object content,
Obtain the answer.
Illustratively, text is read are as follows:
On April 5th, 2016, Department of Transportation hold the green beacon in a small piece of land surrounded by water in South Sea small piece of land surrounded by water Bi Jiao and enable ceremony, the throwing of the green beacon in a small piece of land surrounded by water
Surrounding body navaid, navigation scheduling and emergency rescue ability will effectively be promoted by entering use.Xing reporter of the Xinhua News Agency XX takes the photograph
Problem are as follows: whom is article reporter?
So, problem can be matched to " Question Classification -- author " node of question-based teaching.It extracts tree and uses " article work
" c_ news media { 0,1 } the@c_ name@" of person " node can be drawn into " reporter Xing XX " from reading in text.Wherein " reporter "
Belong to noise due to not being name, the application in post-processing will " reporter " removal, and before or after removing " Xing XX "
The space character being likely to occur obtains the answer of problem.
Fig. 3 is a kind of flow chart of post-processing rule provided by the embodiments of the present application.
In one embodiment, corresponding " article source " classification, when being drawn into multiple object contents from reading text,
Post-processing rule shown in Fig. 3 can be used and select answer of the object content as problem.Specifically, after shown in Fig. 3
Processing rule the following steps are included:
Step S301 will be apart from reading text end pre- when the extraction expression formula is drawn into multiple object contents
If in range, and answer of the object content nearest apart from reading text end as problem.
Step S302 will be apart from reading if not including the object content in the preset range of text end apart from reading
Text starts within a preset range, and starts answer of the nearest object content as problem apart from text is read.
Corresponding " article source " classification as a result, when being drawn into multiple object contents from reading text, the application is rear
It is the answer of determining problem in treatment process provided at least two priority, highest priority, that is, step S301, first by selection
Zone focusing obtains the answer of problem from the end for reading text to the end for reading text;Second priority, that is, step S302,
Range of choice is focused on to the beginning for reading text, from the answer for the beginning acquisition problem for reading text.
Illustratively, if setting range is 30 characters, for one section of news as shown below, expression is extracted
The object content that formula can be drawn into is the part of hereinafter font-weight, and obtaining the range of answer in step S301 is hereinafter
Add the part of underscore:
U.S.'s " daily space flight " on June 29th, 2009 is reported ..., and space department, Britain has gone canvassing to Congressmen before this, refers to
Out Britain dependent on U.S.'s imaging satellite thing be one certain be related to the aspect of national ability.... this report also suggests,
Cyberspace will have growing importance as a national security field, it will continue almostWhole mankind Occupy growing importance in activity form.China Engineering Technology Information Networks
" China Engineering Technology Information Networks " are only drawn into due in the range, extracting expression formula as a result, therefore, " China
Engineering Technology Information Networks " are exactly the answer of " article source " class problem.
Illustratively, if setting range is 30 characters, for one section of news as shown below, expression is extracted
The object content that formula can be drawn into is the part of hereinafter font-weight, and obtaining the range of answer in step S301 is lower the end of writing
Tail adds the part of underscore, and the range that answer is obtained in step S302 is the part of hereafter beginning addition underscore:
Net report on December 18 western medium, Reference News claims, and China side claims it on South Sea dispute islandConstruction be normal work
It is dynamic.... according to " the daily inquirer of Philippine reports " website December 17, Asia maritime affairs G-8 Transparency Initiative G-8 website claims, and China is just
The reef on the Nansha Islands and the Xisha Islands is being built up into island, while disposing military installations and equipment.In this regard, Chinese Foreign Ministry is sent out
Say on regular press conference within fervent 15 days in speech people land: " China carries out peace construction on the territory of oneself and livesMove, dispose it is necessary anti- Defending facility is very normally that this is the thing within the scope of the sovereignty of China.
It is not drawn into object content due in the range of news end, extracting expression formula, it is therefore, preferential according to second
Grade, from the answer of the beginning acquisition problem of news, to be drawn into " Reference News's net ", therefore, " Reference News's net " is exactly " text
The answer of Zhang Laiyuan " class problem.
Multiple object contents are extracted from reading text aiming at the problem that extracting expression formula " article source " classification as a result,
When, in such a way that setting priority is chosen, uniquely determine answer of the object content as problem.
Fig. 4 is a kind of flow chart of post-processing rule provided by the embodiments of the present application.
In one embodiment, corresponding " article time " classification, when being drawn into multiple object contents from reading text,
Post-processing rule shown in Fig. 4 can be used and select answer of the object content as problem.Specifically, after shown in Fig. 4
Processing rule the following steps are included:
Step S401 obtains each object content and is reading when the extraction expression formula is drawn into multiple object contents
End position in text, the end position are that the last character of object content is reading the position in text.
Step S402 calculates the difference that the end position of the character length and each object content of reading text subtracts each other.
Step S403, using the corresponding object content of difference minimum value subtracted each other as the answer of problem.
Illustratively, for a news:
Xinhua News Agency12 days 2 monthsReport, XXXXXXXXXXXXXXX, XXXXXXXXXXXXX.(Xinhua News Agency13 days 2 monthsNews)
Its character length are as follows: 54 (a characters), the content for drawing horizontal line are the object content for extracting expression formula and being drawn into, that
, initial position and end position of each object content in reading text, and, character length and end position subtract each other it
Difference, which can count, (sets the position for reading the first character of text as 0) in the following table:
Object content (time) | Initial position | End position | Character length-end position |
12 days 2 months | 3 | 7 | 47 |
13 days 2 months | 47 | 51 | 3 |
The minimum value that character length and end position subtract each other as a result, is 3, and the corresponding time is " 13 days 2 months ", thus " 2 months
13 days " the just answer as " article time " class problem.
Multiple object contents are extracted from reading text aiming at the problem that extracting expression formula " article time " classification as a result,
When, the position in text is being read according to object content, is uniquely determining answer of the object content as problem.
In one embodiment, if target extract node is " whom title is ", target extract is used in step S102
When the extraction expression formula that node includes extracts object content from reading text, if the extraction expression formula includes text concept
With multiple concept values, and the most concept value of character quantity includes other concept values, then most using character quantity
The concept value extracts the object content.
Illustratively, for a news:
China's net was interrogated 29 December 29, and Foreign Ministry spokesman XXX participates in sub- boat lost contact passenger plane search-and-rescue work thing with regard to me and answers
Reporter asks ... ...
Extracting the extraction expression formula that the extraction node " whom title is " set includes includes:
(c_ title) [^,.?!...] { 0,4 } (c_ name)
So, if text concept " c_ title " has multiple concept values in conceptional tree, such as: spokesman, Ministry of Foreign Affairs
Spokesman, wherein " Foreign Ministry spokesman " contains " spokesman ", therefore then takes expression formula that " Foreign Ministry spokesman " is used to match
Text is read, and extracts the object content being matched to, therefore, in above-mentioned news, the object content of extraction is " Ministry of Foreign Affairs's speech
People XXX ", rather than " spokesman XXX ".As a result, by extracting more characters as object content, make finally obtained answer more
Add complete and accurate.
In addition, as a kind of selectable embodiment, if the target point of problem has not been obtained according to classification expression formula
Class, or be not drawn into object content according to expression formula is extracted, then use machine learning model trained in advance from reading text
The answer of middle acquisition problem, to make the technical solution of the embodiment of the present application, with the reading understanding side based on machine learning model
Method is alternative scheme, so that answer can be extracted from reading text according to problem under any circumstance.
Here is the Installation practice of the application, provides a kind of content extraction device, the device can be applied to server,
PC (PC), tablet computer, mobile phone, smart television, intelligent sound box, virtual reality device and intelligent wearable device etc. are a variety of
In equipment.Undocumented details in the Installation practice of the application, please refers to the Installation practice of the application.
Fig. 5 is a kind of structural schematic diagram of content extraction device provided by the embodiments of the present application.As shown in figure 5, the device
Include:
Problem matching module 501, the classification expression formula for including according to question-based teaching obtain the target classification of problem,
In, described problem tree includes class node, and a classification of each class node correspondence problem, the class node includes classification
Expression list, the classification expression list include multiple classification expression formulas;
Content extraction module 502 for obtaining the target classification corresponding target extract node in extracting tree, and makes
The extraction expression formula for including with the target extract node extracts object content from reading text, wherein the extraction tree packet
Containing node is extracted, each classification for extracting node correspondence problem, the extraction node includes to extract expression list, described
Extracting expression list includes multiple extraction expression formulas;
Post-processing module 503, for being carried out to the object content according to the corresponding post-processing rule of the target classification
Post-processing, obtains the answer of described problem.
From the above technical scheme, the embodiment of the present application provides a kind of content extraction device, for according to question-based teaching
The classification expression formula for including obtains the target classification of problem, and described problem tree includes class node, and the class node includes point
Class expression list, the classification expression list include multiple classification expression formulas;The target classification is obtained in extracting tree
Corresponding target extract node, and the extraction expression formula for including using the target extract node extracts target from reading text
Content, the extraction tree include to extract expression list, the extraction expression list comprising extracting node, the extraction node
Include multiple extraction expression formulas;According to the corresponding post-processing rule of the target classification, the object content is post-processed,
Obtain the answer of described problem.Device provided by the embodiments of the present application is applied to machine reading when understanding as a result, it is only necessary to according to asking
The classification Construct question tree and extraction tree of topic, when the classification of problem determines, question-based teaching and extraction tree are also relatively determining, Ke Yiyong
In the answer for extracting problem in text from different reading, there is universality, can be improved machine and read the accuracy rate understood.
Those skilled in the art will readily occur to its of the application after considering specification and practicing application disclosed herein
Its embodiment.This application is intended to cover any variations, uses, or adaptations of the application, these modifications, purposes or
Person's adaptive change follows the general principle of the application and including the undocumented common knowledge in the art of the application
Or conventional techniques.The description and examples are only to be considered as illustrative, and the true scope and spirit of the application are by following
Claim is pointed out.
It should be understood that the application is not limited to the precise structure that has been described above and shown in the drawings, and
And various modifications and changes may be made without departing from the scope thereof.Scope of the present application is only limited by the accompanying claims.
Claims (10)
1. a kind of content extraction method characterized by comprising
The target classification of problem is obtained according to the classification expression formula that question-based teaching includes, wherein described problem tree includes class node,
One classification of each class node correspondence problem, the class node include classification expression list, the classification expression formula
List includes multiple classification expression formulas;
The target classification corresponding target extract node in extracting tree is obtained, and includes using the target extract node
It extracts expression formula and extracts object content from reading text, wherein the extraction tree is comprising extracting node, each extraction node pair
One of problem is answered to classify, the extraction node includes to extract expression list, and the extraction expression list includes multiple pumpings
Take expression formula;
According to the corresponding post-processing rule of the target classification, the object content is post-processed, described problem is obtained
Answer.
2. the method according to claim 1, wherein the classification expression formula and the extraction expression formula are by text
This concept, keyword and operator composition, wherein the text concept includes at least one concept value, and the concept value is as text
A kind of expression way of this concept, the operator are used to be formed the matching of expression formula in conjunction with the text concept and the keyword
Rule.
3. according to the method described in claim 2, it is characterized in that, further including conceptional tree, the conceptional tree includes multiple texts
Concept, each text concept include multiple concept nodes, each corresponding concept value of the concept node.
4. the method according to claim 1, wherein the classification expression formula acquisition for including according to question-based teaching is asked
Before the target classification of topic, further includes:
Space character in removal problem;
The specific content that the starting position of text is read in removal or end position includes;
It obtains and reads the blank character that text continuously occurs, and the blank character continuously occurred is replaced with into a space character.
5. the method according to claim 1, wherein the post-processing rule includes:
Remove the character string that the object content is included as noise;
Removal is located at the space character before the object content and after the object content, obtains the answer.
6. the method according to claim 1, wherein the post-processing rule includes:
When the extraction expression formula is drawn into multiple object contents, will apart from read text end within a preset range, and
Distance reads answer of the nearest object content in text end as problem;
If not including the object content in the preset range of text end apart from reading, will preset apart from text beginning is read
In range, and start answer of the nearest object content as problem apart from text is read.
7. the method according to claim 1, wherein the post-processing rule includes:
When the extraction expression formula is drawn into multiple object contents, obtains each object content and reading the stop bits in text
It sets, the end position is that the last character of object content is reading the position in text;
Calculate the difference that the end position of the character length and each object content of reading text subtracts each other;
Using the corresponding object content of difference minimum value subtracted each other as the answer of problem.
8. according to the method described in claim 2, it is characterized in that, the extraction expression formula for including using target extract node
Object content is extracted from reading in text, comprising:
If the extraction expression formula includes that text concept has multiple concept values, and the most concept value of character quantity includes
Other concept values then extract the object content using the most concept value of character quantity.
9. method according to claim 1-8, which is characterized in that further include:
If the target classification of problem has not been obtained according to the classification expression formula, or not according to the extraction expression formula
It is drawn into the object content, then obtains answering for described problem from reading text using machine learning model trained in advance
Case.
10. a kind of content extraction device characterized by comprising
Problem matching module, the classification expression formula for including according to question-based teaching obtain the target classification of problem, wherein described to ask
Topic tree includes class node, and a classification of each class node correspondence problem, the class node includes classification expression formula column
Table, the classification expression list include multiple classification expression formulas;
Content extraction module, for obtaining the target classification corresponding target extract node in extracting tree, and described in use
The extraction expression formula that target extract node includes extracts object content from reading text, wherein the extraction tree is comprising extracting
Node, each classification for extracting node correspondence problem, the extraction node includes to extract expression list, the extraction table
It include multiple extraction expression formulas up to formula list;
Post-processing module, for being post-processed to the object content according to the corresponding post-processing rule of the target classification,
Obtain the answer of described problem.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910155040.6A CN109918490B (en) | 2019-03-01 | 2019-03-01 | Content extraction method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910155040.6A CN109918490B (en) | 2019-03-01 | 2019-03-01 | Content extraction method and device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109918490A true CN109918490A (en) | 2019-06-21 |
CN109918490B CN109918490B (en) | 2022-12-16 |
Family
ID=66962894
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910155040.6A Active CN109918490B (en) | 2019-03-01 | 2019-03-01 | Content extraction method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109918490B (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110413636A (en) * | 2019-08-01 | 2019-11-05 | 北京香侬慧语科技有限责任公司 | A kind of data processing method and device |
CN110457597A (en) * | 2019-08-08 | 2019-11-15 | 中科鼎富(北京)科技发展有限公司 | A kind of advertisement recognition method and device |
CN111008523A (en) * | 2019-11-21 | 2020-04-14 | 中科鼎富(北京)科技发展有限公司 | Information extraction method and device and server |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6115683A (en) * | 1997-03-31 | 2000-09-05 | Educational Testing Service | Automatic essay scoring system using content-based techniques |
CN107608949A (en) * | 2017-10-16 | 2018-01-19 | 北京神州泰岳软件股份有限公司 | A kind of Text Information Extraction method and device based on semantic model |
-
2019
- 2019-03-01 CN CN201910155040.6A patent/CN109918490B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6115683A (en) * | 1997-03-31 | 2000-09-05 | Educational Testing Service | Automatic essay scoring system using content-based techniques |
CN107608949A (en) * | 2017-10-16 | 2018-01-19 | 北京神州泰岳软件股份有限公司 | A kind of Text Information Extraction method and device based on semantic model |
Non-Patent Citations (1)
Title |
---|
王素格等: "面向高考阅读理解观点类问题的答案抽取方法", 《郑州大学学报(理学版)》 * |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110413636A (en) * | 2019-08-01 | 2019-11-05 | 北京香侬慧语科技有限责任公司 | A kind of data processing method and device |
CN110457597A (en) * | 2019-08-08 | 2019-11-15 | 中科鼎富(北京)科技发展有限公司 | A kind of advertisement recognition method and device |
CN111008523A (en) * | 2019-11-21 | 2020-04-14 | 中科鼎富(北京)科技发展有限公司 | Information extraction method and device and server |
Also Published As
Publication number | Publication date |
---|---|
CN109918490B (en) | 2022-12-16 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Dyson | Oral language: The rooting system for learning to write | |
CN109918490A (en) | A kind of content extraction method and device | |
Hurcombe | Sex and God (RLE Women and Religion): Some Varieties of Women's Religious Experience | |
Cetina | Merton's sociology of science: the first and the last sociology of science? | |
Crome | The Restoration of the Jews: Early Modern Hermeneutics, Eschatology, and National Identity in the Works of Thomas Brightman | |
Ellis et al. | 'Mara yurriku': Western Desert sign languages | |
Leach | Claude Lévi-Strauss: anthropologist and philosopher | |
Berman | Practicing transnational feminist recovery today | |
Butling et al. | Poets Talk: Conversations with Robert Kroetsch, Daphne Marlatt, Erin Mouré, Dionne Brand, Marie Annharte Baker, Jeff Derksen, and Fred Wah | |
Embong et al. | The representations of leadership by example in editorial cartoons | |
Nemani et al. | An investigation of the constraints in subtitling the conversations: On the role of cultural effects on variation | |
Gramling | Queer/LGBT approaches | |
Newland | The Lost Tribes of I srael–and the G enesis of C hristianity in F iji: Missionary Notions of F ijian Origin from 1835 to Cession and Beyond | |
Cartwright | The Cult of St Ursula and the 11,000 Virgins | |
Russell | Inclusive language and power | |
Garley et al. | Virtual meatspace: Word formation and deformation in cyberpunk discussions | |
Capancioni | Janet Ross's intergenerational life writing: female intellectual legacy through memoirs, correspondence, and reminiscences | |
Williams et al. | The (Ever) Lasting Significance of Zora Neale Hurston's Barracoon | |
Lepore | Wigwam Words | |
Venkatesh | My Indian Babel Multilingualism and Memory | |
Johnson | The Malaysian intellectual: A brief historical overview of the discourse | |
Katz-Rosene et al. | Ecopolitics Podcast, Episode 1: Introducing the Ecopolitics Podcast | |
Berg | A Theory of Artificial Classification | |
Niven | Crossing the Black Waters: NC Chaudhuri's A Passage to England and VS Naipaul's A n Area of Darkness | |
Qingqing | Subtitle Translation from the Perspective of Multimodal Discourse Analysis: A Case Study of The Big Bang Theory |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |