CN109918490B - Content extraction method and device - Google Patents

Content extraction method and device Download PDF

Info

Publication number
CN109918490B
CN109918490B CN201910155040.6A CN201910155040A CN109918490B CN 109918490 B CN109918490 B CN 109918490B CN 201910155040 A CN201910155040 A CN 201910155040A CN 109918490 B CN109918490 B CN 109918490B
Authority
CN
China
Prior art keywords
extraction
classification
target
text
concept
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910155040.6A
Other languages
Chinese (zh)
Other versions
CN109918490A (en
Inventor
任宁
晋耀红
李德彦
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Anhui Taiyue Xiangsheng Software Co ltd
Original Assignee
Anhui Taiyue Xiangsheng Software Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Anhui Taiyue Xiangsheng Software Co ltd filed Critical Anhui Taiyue Xiangsheng Software Co ltd
Priority to CN201910155040.6A priority Critical patent/CN109918490B/en
Publication of CN109918490A publication Critical patent/CN109918490A/en
Application granted granted Critical
Publication of CN109918490B publication Critical patent/CN109918490B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the application provides a content extraction method and a content extraction device, and the method comprises the following steps: obtaining a target classification of the problem according to a classification expression contained in the problem tree; acquiring a target extraction node corresponding to the target classification in the extraction tree, and extracting target content from the reading text by using an extraction expression contained in the target extraction node; and post-processing the target content according to post-processing rules corresponding to the target classification to obtain answers of the questions. Therefore, when the method provided by the embodiment of the application is applied to machine reading understanding, the question tree and the extraction tree are only required to be constructed according to the category of the question, when the category of the question is determined, the question tree and the extraction tree are also relatively determined, the method can be used for extracting answers of the question from different reading texts, has universality, and can improve the accuracy of machine reading understanding.

Description

Content extraction method and device
Technical Field
The present application relates to the field of natural language processing technologies, and in particular, to a content extraction method and apparatus.
Background
Machine-reading understanding is a subject of technology that accompanies the development of deep learning techniques. The research goal of machine reading comprehension is to make a machine read a text like a human being and answer a question according to comprehension of the text, and specifically, the machine finds a correct answer to the question from a corpus according to a given corpus and the question.
The current machine reading understanding is usually realized by using an algorithm model based on deep learning, namely, the deep learning model is trained by adopting a data set which is constructed manually and marked with questions and answers, so that the deep learning model has the capability of extracting some simple question answers from simple texts. However, due to the limitation of the deep learning model algorithm and the limitation of the data set size, the accuracy of the machine reading understanding method realized based on the deep learning model is not high at present. For example, in some open-context practical applications, when extracting answers to a given question from an article, the deep learning model can only achieve about 60% of extraction accuracy, which is far from the requirement of being applicable to a production environment. It can be seen that for machine reading understanding, the understanding accuracy rate has a larger promotion space.
Disclosure of Invention
The embodiment of the application provides a content extraction method and device, and aims to solve the problem that the accuracy of extracting answers to questions from articles by a machine reading understanding method in the prior art is low.
In a first aspect, an embodiment of the present application provides a content extraction method, including:
obtaining a target classification of a problem according to classification expressions contained in a problem tree, wherein the problem tree contains classification nodes, each classification node corresponds to one classification of the problem, each classification node contains a classification expression list, and each classification expression list contains a plurality of classification expressions;
acquiring target extraction nodes corresponding to the target classification in an extraction tree, and extracting target content from a reading text by using extraction expressions contained in the target extraction nodes, wherein the extraction tree contains extraction nodes, each extraction node corresponds to one classification of a problem, the extraction nodes contain an extraction expression list, and the extraction expression list contains a plurality of extraction expressions;
and post-processing the target content according to post-processing rules corresponding to the target classification to obtain answers of the questions.
In a second aspect, an embodiment of the present application provides a content extraction apparatus, including:
the problem matching module is used for obtaining a target classification of a problem according to classification expressions contained in a problem tree, wherein the problem tree contains classification nodes, each classification node corresponds to one classification of the problem, each classification node contains a classification expression list, and each classification expression list contains a plurality of classification expressions;
the content extraction module is used for acquiring target extraction nodes corresponding to the target classification in an extraction tree and extracting target content from a read text by using extraction expressions contained in the target extraction nodes, wherein the extraction tree contains extraction nodes, each extraction node corresponds to one classification of a problem, each extraction node contains an extraction expression list, and the extraction expression list contains a plurality of extraction expressions;
and the post-processing module is used for post-processing the target content according to the post-processing rule corresponding to the target classification to obtain the answer of the question.
As can be seen from the foregoing technical solutions, an embodiment of the present application provides a content extraction method and apparatus, including: obtaining a target classification of a problem according to a classification expression contained in a problem tree, wherein the problem tree contains classification nodes, the classification nodes contain a classification expression list, and the classification expression list contains a plurality of classification expressions; acquiring a target extraction node corresponding to the target classification in an extraction tree, and extracting target content from a reading text by using an extraction expression contained in the target extraction node, wherein the extraction tree contains the extraction node, the extraction node contains an extraction expression list, and the extraction expression list contains a plurality of extraction expressions; and post-processing the target content according to post-processing rules corresponding to the target classification to obtain answers of the questions. Therefore, when the technical scheme provided by the embodiment of the application is applied to machine reading understanding, the question tree and the extraction tree only need to be constructed according to the category of the question, when the category of the question is determined, the question tree and the extraction tree are also relatively determined, the problem tree and the extraction tree can be used for extracting answers to the question from different reading texts, the universality is realized, and the accuracy of machine reading understanding can be improved.
Drawings
In order to more clearly explain the technical solution of the present application, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious to those skilled in the art that other drawings can be obtained according to the drawings without any creative effort.
Fig. 1 is a flowchart of a content extraction method according to an embodiment of the present application;
fig. 2 is a flowchart of a preprocessing method according to an embodiment of the present disclosure;
FIG. 3 is a flow chart of a post-processing rule provided by an embodiment of the present application;
FIG. 4 is a flow chart of a post-processing rule provided by an embodiment of the present application;
fig. 5 is a schematic structural diagram of a content extraction device according to an embodiment of the present application.
Detailed Description
In order to make those skilled in the art better understand the technical solutions in the present application, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
Machine-reading understanding is a subject of technology that accompanies the development of deep learning techniques. The research goal of machine reading comprehension is to make a machine read a text like a human being and answer a question according to comprehension of the text, and specifically, the machine finds a correct answer to the question from a corpus according to a given corpus and the question.
For example, reading text understood by a machine reading is an article:
XXX13 day publish message XXX … XXX. (Reyue XX) and presents a question: who is the author of this document?
The purpose of machine reading understanding is to want to find out the author of the article from the reading text, then the correct answer for this question should be: yue XX.
The current machine reading understanding is usually realized by using an algorithm model based on deep learning, namely, the deep learning model is trained by adopting a data set which is artificially constructed and marked with questions and answers, so that the limitation of the algorithm and the limitation of the data set scale cause that the machine reading understanding method based on the deep learning model has a bottleneck in the accuracy aspect, especially the data set scale is far less than the requirement of training the deep learning model in the practical application of some open contexts, and the accuracy of the machine reading understanding is low.
The embodiment of the application provides a content extraction method and device, and aims to solve the problem that the accuracy of extracting answers to questions from articles by a machine reading understanding method in the prior art is low.
The following is a method embodiment of the present application, which provides a content extraction method, and the method may be applied to a server, a PC (personal computer), a tablet computer, a mobile phone, a smart television, a smart speaker, a virtual reality device, an intelligent wearable device, and other devices.
Fig. 1 is a flowchart of a content extraction method according to an embodiment of the present application. As shown in fig. 1, the content extraction method includes the steps of:
step S101, obtaining target classification of the problem according to classification expressions included in a problem tree, wherein the problem tree includes classification nodes, each classification node corresponds to one classification of the problem, each classification node includes a classification expression list, and the classification expression list includes a plurality of classification expressions.
Specifically, according to the specified question, the embodiment of the present application extracts corresponding content from the read text as an answer to the question. The reading text may be an article, for example, which includes: press releases, self-media articles, science popularization articles, novels, prose, topical articles, or articles in the professional domain, etc., and accordingly, questions may contain multiple categories depending on the primary content of the reading understanding, such as: article subject, article author, article source, article time, who the title is, etc.
Based on the above classification of the problem, the embodiments of the present application may construct a problem tree. The problem tree comprises at least one father node and a plurality of child nodes located at the next level of the father node, the child nodes are used as classification nodes, each classification node corresponds to one classification of a problem, each classification node comprises at least one classification expression list, and each row of the classification expression list comprises a class name and a corresponding classification expression in a paired mode.
Illustratively, the question tree may be of the form:
problem classification(parent node)
Article author(child node)
Figure GDA0003932123880000031
Article topic
Figure GDA0003932123880000032
Figure GDA0003932123880000041
Sources of articles
Who the title is
Time of article
Therefore, the problem is subjected to content matching by using the classification expressions in the problem tree, which classification expression in the problem tree the problem is matched to can be determined, and therefore the target classification of the problem can be determined according to the classification node where the classification expression is located.
Illustratively, the problem is: please summarize the topic of the article, then the classification expressions that can be matched using the question tree are: [ c _ summarization + {0,0} c _ topic ], since the expression is located at the child node "article topic", the target for the question is classified as: the subject of the article.
In addition, as an implementation manner, each row of the sorted expression list may further include a check box, and the check box is used for checking or unchecking the sorted expression by the user, and further modifying and deleting the sorted expression, and the like.
In addition, as an implementation manner, a weight value of each classification expression may be set, and the weight value may be, for example, a natural numerical value, and when a problem is matched to two or more classification expressions at the same time, a target classification of the problem is determined according to a classification node where the classification expression with the highest weight value (with the largest numerical value) is located.
In addition, as an implementable embodiment, a recognition state of the classification expression may also be set, which may include recognition and exclusion, for example. Specifically, when the identification state is identification, the classification expression performs forward matching on the problem, that is, if the problem matches the classification expression of which the identification state is identification, the target classification of the problem can be determined according to the classification node where the classification expression is located; when the recognition state is exclusion, the classification expression performs reverse matching on the problem, that is, if the problem matches the classification expression whose recognition state is exclusion, the classification corresponding to the classification expression is not necessarily the target classification of the problem.
In addition, as an implementable embodiment, an enable state of the classification expression may also be set, which may include, for example, valid and invalid. Specifically, when the enable state is active, the classification expressions participate in matching the problem, and when the enable state is inactive, the classification expressions do not participate in matching the problem.
Step S102, obtaining target extraction nodes corresponding to the target classification in an extraction tree, and extracting target content from a reading text by using extraction expressions contained in the target extraction nodes, wherein the extraction tree contains extraction nodes, each extraction node corresponds to one classification of a problem, each extraction node contains an extraction expression list, and the extraction expression list contains a plurality of extraction expressions.
Corresponding to the problem tree, the embodiment of the present application may construct an extraction tree. The extraction tree comprises at least one father node and a plurality of child nodes positioned at the next level of the father node, the child nodes are used as extraction nodes, the extraction nodes and the classification nodes of the problem tree have one-to-one correspondence, therefore, each extraction node also corresponds to one category of the problem, each extraction node comprises at least one extraction expression list, and each row of the extraction expression list comprises an extraction name (corresponding to the classification name in the classification expression list) and a corresponding extraction expression in a paired mode.
Illustratively, the decimation tree may be in the form of:
answer extraction(parent node)
Article topic(child node)
Article author
Figure GDA0003932123880000051
Sources of articles
Who the title is
Time of article
Thus, after the target classification of the question is determined in step S101, the target extraction node corresponding to the target classification in the extraction tree is obtained, and the target content is extracted from the reading text using the extraction expression included in the target extraction node.
Illustratively, the problem is: who is the author of the article? The classification expressions that the problem can match in the problem tree: [ c _ article + {0,0} c _ reporter + {0,0} c _ is ], from which it can be determined that the target classification is: an article author; thus, in step 102, the read text is matched using the extraction expression contained in the "article author" node of the extraction tree, for example: for the reading text shown above, "k _ reporter {0,1} @ c _ name @" may be used to match from the reading text to "reporter yue XX".
In addition, as an implementable implementation manner, each row of the extraction expression list can further comprise a check box, and the check box is used for performing checking or non-checking operation on the extraction expression by a user, and further modifying, deleting and the like on the extraction expression.
In addition, as an implementable implementation, a weight value of each extraction expression may be set, and the weight value may be, for example, a natural numerical value, and when a plurality of extraction expressions are matched with different contents from a read text at the same time, the content matched with the extraction expression with the highest weight value (with the largest numerical value) may be used as the target content.
In addition, as an implementable embodiment, an enable state of the decimation expression may also be set, which may include, for example, valid and invalid. Specifically, when the enable state is valid, the extraction expression participates in the matching of the read text, and when the enable state is invalid, the extraction expression does not participate in the matching of the read text.
In addition, as an implementable embodiment, an extraction range of the extraction expression may be set, and the extraction range may include matching only within a clause and matching across clauses, for example, where commas, semicolons, and periods in the reading text may be taken as boundaries of a clause, and the content between two boundaries is taken as one clause. Specifically, when the extraction range is only matched in a clause, the extraction expression is only matched with the target content in each clause of the reading text, and the extraction expression is not matched across clauses; when the extraction range is cross clause matching, the extraction expression can perform cross clause matching on the reading text.
And step S103, post-processing the target content according to the post-processing rule corresponding to the target classification to obtain the answer of the question.
The target content may contain other contents besides the answer to the question, such as: when the question is 'who the article author is', the extracted target content is 'reporter chen XX', wherein only 'chen XX' is the answer; alternatively, when a plurality of target contents are extracted in step S102, in order to ensure unique determination of the answer, it is necessary to select only one target content and generate the answer from the selected target content.
Therefore, the method and the device can set different post-processing rules corresponding to different classifications of the problems, and filter, screen, refine and the like the target content according to the post-processing rules so as to obtain answers of the problems.
As can be seen from the foregoing technical solutions, an embodiment of the present application provides a content extraction method, including: obtaining a target classification of a problem according to a classification expression contained in a problem tree, wherein the problem tree contains classification nodes, the classification nodes contain a classification expression list, and the classification expression list contains a plurality of classification expressions; acquiring a target extraction node corresponding to the target classification in an extraction tree, and extracting target content from a reading text by using an extraction expression contained in the target extraction node, wherein the extraction tree contains the extraction node, the extraction node contains an extraction expression list, and the extraction expression list contains a plurality of extraction expressions; and post-processing the target content according to post-processing rules corresponding to the target classification to obtain answers of the questions. Therefore, when the method provided by the embodiment of the application is applied to machine reading understanding, the question tree and the extraction tree only need to be constructed according to the category of the question, when the category of the question is determined, the question tree and the extraction tree are also relatively determined, the method can be used for extracting answers of the question from different reading texts, has universality, and can improve the accuracy of machine reading understanding.
In one embodiment, the classification expression and the extraction expression may be composed of a text concept, a keyword, an operator, and the like, where the text concept includes at least one concept value, the concept value is used as an expression mode of the text concept, and the operator is used to form a matching rule of the expression by combining the text concept and the keyword.
The composition of the classification expressions and the decimation expressions is explained in detail below with reference to some examples.
Illustratively, for one of the following classification expressions:
[ c _ article + {0,0} c _ is who + {0,0} k _ report ]
"c _ article", "c _ who" are text concepts, respectively, where "c" is an identification of a text concept, "article" is a name of a text concept, "article" may have a number of different concept values, for example: articles, news, text, reports, etc. can be matched by "c _ articles" in the taxonomy expression when the above-mentioned concept values appear in the read text.
The 'k _ report' is an expression form of the keyword, wherein the 'k' is an identification of the keyword, and the 'report' is the keyword, and can be matched with the 'k _ report' in the classification expression when the 'report' is included in the reading text.
"+" {0,0} "[ ]", etc. are operators. Wherein, "+" is operator, and the matching rule is that the text concepts or keywords before and after "+" exist at the same time; "{0,0}" is a distance operator, the format is { x, y }, x and y are nonnegative integers, x is less than or equal to y, two numerical values in the distance operator express a distance interval, and the matching rule is that the distance of a text concept or a keyword is between x character bits and y character bits; "is a sequential operator, which indicates that the text concept and the key word in the" [ ] "are matched according to a well-defined sequence.
Illustratively, the expression is extracted for one of:
c _ news media {0,1} @ c _ name @ c _ news
The 'c _ news media' and the 'c _ name' are text concepts, the '0,1' is a distance operator, two '@' appearing before and after are respectively a front boundary identification and a rear boundary identification, and the content matched in the reading text by the part between the two '@' is the target content to be extracted by the extraction expression. Thus, the full meaning of the extraction expression is to match the "news media" concept and the "person name" concept within a distance of 0 to 1 character and extract the "person name" concept.
It should be added that the expression forms of the text concepts, the keywords and the operators shown in the above examples are only used as an alternative embodiment for forming the classification expressions and extracting the expressions, and not as a whole embodiment. On the basis of the content disclosed in the embodiments of the present application, a person skilled in the art may also design other expression forms of text concepts, keywords, and operators as needed, and form other classification expressions and extraction expressions on the basis of the design and concept, which do not exceed the protection scope of the embodiments of the present application.
In one embodiment, in order to maintain concept values of text concepts, a concept tree is constructed in the embodiment of the present application. The concept tree comprises a plurality of text concepts, each text concept comprises a plurality of concept nodes, and each concept node corresponds to one concept value.
Specifically, the concept tree may include a question concept node and an answer concept node, where the question concept node and the answer concept node each include a plurality of classification nodes, each classification node corresponds to a classification of a question, each division node further includes a plurality of child nodes, each child node corresponds to a text concept, and includes a concept value list, and the concept value list records concept names and all concept values of the text concepts.
Illustratively, the concept tree may be in the form of:
concept of problem
Article topic(Classification node)
Article (child node)
Figure GDA0003932123880000071
Probably because of
Themes
What is
Introduction to
Summary of the invention
Who the title is
Time of article
Concept of answers
As an alternative embodiment, the concept value may include "()" | "? "etc. grammar rules for extending the expression form and expression range of the concept value, and those skilled in the art can design the grammar rules of the concept value according to the grammar rules of the regular expression, such as: "|" represents a rule of a selection or set, "()" represents an operating range of a grammar rule, "? "represents that the preceding character appears at most once, and so on.
Therefore, when the classification expression and the extraction expression contain text concepts, the read text can be matched by using the concept values corresponding to the text concepts in the concept tree, and the matching range of the text concepts is expanded.
Fig. 2 is a flowchart of a preprocessing method according to an embodiment of the present disclosure.
In an implementable implementation, the present application provides a method for preprocessing a question and a read text, where the preprocessing method is applied before obtaining a target classification of the question according to a classification expression included in a question tree, and the preprocessing method may specifically include the following steps, as shown in fig. 2:
step S201, removing space characters in the problem.
Since the space character also occupies a character position, the matching process of the classification expression is affected, for example: if two space characters exist between two text concepts of the problem, the distance between the two text concepts is at least greater than or equal to two character bits, at which time, if the distance operator between the two text concepts in the taxonomy expression is {0,1}, the problem will not successfully match the taxonomy expression due to the interference of the space characters. Therefore, the space characters in the problem are removed, and the accuracy of problem matching of the classification expression can be improved.
In step S202, the specific content included in the start position or the end position of the reading text is removed.
In a production environment, some reading texts are obtained from the network, and therefore, the reading texts may contain some specific contents at the starting position or the ending position, such as: the text of the news report usually includes text contents such as "comment" and "message", and when only a part of the text contents of the news report is loaded in the web page, the text contents such as "load more" may also appear in the end, and these text contents do not include answers to questions but may interfere with the matching process of extracting expressions, so in step S202, in the embodiment of the present application, the text contents included in the start position or the end position of the reading text are removed.
For example, for a starting position of reading a text, the following may be removed: any blank characters (including space characters, tab characters, page changers, etc.), carriage returns, line changers, specific format content (e.g., "edit: SN + number format"), specific textual content (e.g., "load more," "news load more," "comment load more," "obtain authorization," etc.). For the end position of reading the text, the following can be removed: specific textual content (e.g., "in video load, please later," "auto play," "play"), any blank characters (including spaces, tabs, page changers, etc.), and the like.
Step S203, acquiring blank characters continuously appearing in the read text, and replacing the continuously appearing blank characters with a space character.
Blank characters which continuously appear in the read text may include a space character, a tab character, a page-changing character and the like, and the continuous appearance of the characters may interfere with matching of the extraction expression to the read text, so that in step S203, the embodiment of the present application replaces the continuously appearing blank Zu Zifu with a space character, thereby reducing interference.
In some realizable embodiments, the present application implements different classifications based on the problem, and also provides post-processing rules for the targeted content.
In one embodiment, corresponding to the "article author" classification, the post-processing rules may include: removing a character string contained in the target content as noise; and removing the space characters positioned before the target content and after the target content to obtain the answer.
Illustratively, the reading text is:
4/5/2016, XXX … XXX. The journalist Schchen XX taking pictures
The problems are as follows: who is the article reporter?
Then, the question can be matched to the "question category- -article author" node of the question tree. The extraction tree can be extracted from the reading text to "reporter" chen XX "using" c _ news media {0,1} @ c _ name @ "of the" article author "node. The 'reporter' is not a name but belongs to noise, and is removed in post-processing, and space characters possibly appearing before or after the 'chen XX' are removed to obtain answers of the questions.
Fig. 3 is a flowchart of a post-processing rule according to an embodiment of the present application.
In one embodiment, when multiple target contents are extracted from the read text corresponding to the "article source" category, one target content may be selected as the answer to the question using the post-processing rules shown in fig. 3. Specifically, the post-processing rule shown in fig. 3 includes the following steps:
step S301, when the extraction expression extracts a plurality of target contents, taking the target content which is within a preset range from the end of the reading text and is closest to the end of the reading text as the answer to the question.
Step S302, if the target content is not included in the preset range from the end of the reading text, the target content which is within the preset range from the beginning of the reading text and is closest to the beginning of the reading text is taken as the answer of the question.
Therefore, corresponding to the classification of the article sources, when a plurality of target contents are extracted from the reading text, at least two priorities are set for determining answers to the questions in the post-processing process, wherein the highest priority is step S301, the selection range is focused to the tail of the reading text, and the answers to the questions are obtained from the tail of the reading text; the second priority, step S302, focuses the selection range on the beginning of the reading text, and obtains the answer to the question from the beginning of the reading text.
For example, if the set range is 30 characters, for a piece of news shown below, the target content that the extraction expression can extract is a bold font part hereinafter, and the range of obtaining the answer in step S301 is an underlined part hereinafter:
"daily XX" 2009 6-month 29-day declaration … … XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX。
China engineering technology information network
Therefore, in the range, the extraction expression is only extracted to the Chinese engineering technology information network, so that the Chinese engineering technology information network is the answer of the question of the article source class.
For example, if the set range is 30 characters, for a piece of news shown below, the target content that the extraction expression can extract is a font-bolded part below, the range of obtaining the answer in step S301 is a tail-underlined part below, and the range of obtaining the answer in step S302 is a head-underlined part below:
XX message net 12 month 18 daily newspaper way XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX…XXX。
Since the extraction expression is not extracted to the target content in the range of the end of the news, according to the second priority, the answer to the question is obtained from the beginning of the news, so that the "XX message net" is extracted, and therefore, the "XX message net" is the answer to the question of the "article source" type.
Therefore, when the extraction expression extracts a plurality of target contents from the reading text aiming at the problems classified by the article source, one target content is uniquely determined as the answer of the problem by setting the priority selection mode.
Fig. 4 is a flowchart of a post-processing rule according to an embodiment of the present application.
In one embodiment, when multiple target contents are extracted from the reading text corresponding to the "article time" category, one target content may be selected as the answer to the question using the post-processing rule shown in fig. 4. Specifically, the post-processing rule shown in fig. 4 includes the following steps:
step S401, when the extraction expression extracts a plurality of target contents, acquiring an end position of each target content in the reading text, where the end position is a position of a last character of the target content in the reading text.
In step S402, the difference between the character length of the read text and the end position of each target content is calculated.
Step S403, using the target content corresponding to the minimum subtraction difference as the answer to the question.
For example, for a news:
xinhua society2 month and 12 daysReported, XXXXXXXXXXXXXXX. (Xinhuashi)2 month and 13 daysCommunication)
The character length is: 54 (characters), the content of the cross line is the target content extracted by the extraction expression, then the starting position and the ending position of each target content in the reading text, and the difference between the subtraction of the character length and the ending position can be counted in the following table (setting the position of the first character of the reading text to be 0):
target content (time) Starting position End position Character length-end position
2 month and 12 days 3 7 47
2 month and 13 days 47 51 3
Thus, the minimum value of the subtraction between the character length and the end position is 3, and the corresponding time is "2 months and 13 days", so that "2 months and 13 days" is used as the answer to the question of the "article time".
Thus, when the extraction expression extracts a plurality of target contents from the read text for the question classified by the "article time", one target content is uniquely determined as the answer to the question according to the position of the target content in the read text.
In one embodiment, if the target extraction node is "who' S title", when extracting the target content from the read text using the extraction expression included in the target extraction node in step S102, if the extraction expression includes a text concept having a plurality of concept values and the concept value having the largest number of characters includes other concept values, the concept value having the largest number of characters is used to extract the target content.
For example, for a news article:
in China's network 12 months and 29 days, the speaker XXX at the department of outcrossing participates in the answerer questions of the search and rescue work of the passenger plane with sub-aviation loss connection … …
The extraction expression contained in the extraction node ' who's head ' of the extraction tree includes:
(c _ Toxate) [ Lambda,. Is there a ! … ] {0,4} (c _ name)
Then, if the text concept "c _ title" has multiple concept values in the concept tree, for example: the speaker and the speaker of the external department, wherein the speaker of the external department includes the speaker, the expression is adopted to match the reading text by using the speaker of the external department, and the matched target content is extracted, therefore, in the news, the extracted target content is the speaker XXX of the external department instead of the speaker XXX. Therefore, by extracting more characters as target content, the final answer is more complete and accurate.
In addition, as an alternative implementation, if the target classification of the question is not obtained according to the classification expression or the target content is not extracted according to the extraction expression, the answer to the question is obtained from the reading text by using a machine learning model trained in advance, so that the technical scheme of the embodiment of the application takes the reading understanding method based on the machine learning model as a standby scheme, and the answer can be extracted from the reading text according to the question under any condition.
The following is an embodiment of the apparatus of the present application, and provides a content extraction apparatus, which may be applied to a server, a PC (personal computer), a tablet computer, a mobile phone, a smart television, a smart speaker, a virtual reality device, an intelligent wearable device, and other devices. For details not disclosed in the device embodiments of the present application, please refer to the device embodiments of the present application.
Fig. 5 is a schematic structural diagram of a content extraction device according to an embodiment of the present application. As shown in fig. 5, the apparatus includes:
a problem matching module 501, configured to obtain a target classification of a problem according to a classification expression included in a problem tree, where the problem tree includes classification nodes, each classification node corresponds to a classification of the problem, each classification node includes a classification expression list, and the classification expression list includes multiple classification expressions;
a content extraction module 502, configured to obtain a target extraction node corresponding to the target classification in an extraction tree, and extract target content from a read text by using an extraction expression included in the target extraction node, where the extraction tree includes extraction nodes, each extraction node corresponds to a category of a problem, the extraction node includes an extraction expression list, and the extraction expression list includes multiple extraction expressions;
and a post-processing module 503, configured to perform post-processing on the target content according to a post-processing rule corresponding to the target classification, to obtain an answer to the question.
As can be seen from the foregoing technical solutions, an embodiment of the present application provides a content extraction device, configured to obtain a target classification of a problem according to a classification expression included in a problem tree, where the problem tree includes classification nodes, the classification nodes include a classification expression list, and the classification expression list includes multiple classification expressions; acquiring a target extraction node corresponding to the target classification in an extraction tree, and extracting target content from a reading text by using an extraction expression contained in the target extraction node, wherein the extraction tree contains the extraction node, the extraction node contains an extraction expression list, and the extraction expression list contains a plurality of extraction expressions; and post-processing the target content according to post-processing rules corresponding to the target classification to obtain answers of the questions. Therefore, when the device provided by the embodiment of the application is applied to machine reading understanding, the question tree and the extraction tree are only required to be constructed according to the category of the question, when the category of the question is determined, the question tree and the extraction tree are also relatively determined, the device can be used for extracting answers of the question from different reading texts, has universality, and can improve the accuracy of machine reading understanding.
Other embodiments of the present application will be apparent to those skilled in the art from consideration of the specification and practice of the application disclosed herein. This application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the application being indicated by the following claims.
It will be understood that the present application is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the application is limited only by the appended claims.

Claims (7)

1. A method for extracting content, comprising:
removing space characters in the problem;
removing specific content contained in the starting position or the ending position of the reading text;
acquiring blank characters continuously appearing in a read text, and replacing the continuously appearing blank characters with a space character;
obtaining a target classification of a problem according to classification expressions contained in a problem tree, wherein the problem tree contains classification nodes, each classification node corresponds to one classification of the problem, each classification node contains a classification expression list, and each classification expression list contains a plurality of classification expressions;
acquiring target extraction nodes corresponding to the target classification in an extraction tree, and extracting target content from a reading text by using extraction expressions contained in the target extraction nodes, wherein the extraction tree contains extraction nodes, each extraction node corresponds to one classification of a problem, the extraction nodes contain an extraction expression list, and the extraction expression list contains a plurality of extraction expressions;
post-processing the target content according to post-processing rules corresponding to the target classification to obtain answers of the questions;
the classification expression and the extraction expression are respectively composed of a text concept, a keyword and an operator, the text concept comprises at least one concept value, the concept value is used as an expression mode of the text concept, and the operator is used for combining the text concept and the keyword to form a matching rule of the expression; a plurality of said text concepts are located in a concept book, each said text concept containing a plurality of concept nodes, each said concept node corresponding to a concept value.
2. The method of claim 1, wherein the post-processing rule comprises:
removing a character string contained in the target content as noise;
and removing the space characters positioned before the target content and after the target content to obtain the answer.
3. The method of claim 1, wherein the post-processing rule comprises:
when the extraction expression extracts a plurality of target contents, the target contents which are within a preset range from the end of the reading text and are closest to the end of the reading text are used as answers of the questions;
and if the target content is not contained in the preset range from the end of the reading text, the target content which is in the preset range from the beginning of the reading text and is closest to the beginning of the reading text is taken as the answer of the question.
4. The method of claim 1, wherein the post-processing rule comprises:
when the extraction expression extracts a plurality of target contents, acquiring the end position of each target content in the reading text, wherein the end position is the position of the last character of the target content in the reading text;
calculating the difference between the character length of the read text and the subtraction of the end position of each target content;
and taking the target content corresponding to the minimum subtraction difference as the answer of the question.
5. The method of claim 1, wherein extracting the target content from the reading text by using the extraction expression contained in the target extraction node comprises:
and if the extraction expression contains a text concept with a plurality of concept values and the concept value with the largest number of characters contains other concept values, extracting the target content by using the concept value with the largest number of characters.
6. The method of any one of claims 1-5, further comprising:
and if the target classification of the question is not obtained according to the classification expression or the target content is not extracted according to the extraction expression, obtaining an answer of the question from a reading text by using a machine learning model trained in advance.
7. A content extraction apparatus, comprising:
the question matching module is used for removing space characters in the question; removing specific content contained in the starting position or the ending position of the reading text; acquiring blank characters continuously appearing in a read text, and replacing the continuously appearing blank characters with a space character; obtaining a target classification of a problem according to classification expressions contained in a problem tree, wherein the problem tree contains classification nodes, each classification node corresponds to one classification of the problem, each classification node contains a classification expression list, and each classification expression list contains a plurality of classification expressions;
the content extraction module is used for acquiring target extraction nodes corresponding to the target classification in an extraction tree and extracting target content from a read text by using extraction expressions contained in the target extraction nodes, wherein the extraction tree contains extraction nodes, each extraction node corresponds to one classification of a problem, each extraction node contains an extraction expression list, and the extraction expression list contains a plurality of extraction expressions;
the post-processing module is used for post-processing the target content according to post-processing rules corresponding to the target classification to obtain answers of the questions;
the classification expression and the extraction expression are respectively composed of a text concept, a keyword and an operator, the text concept comprises at least one concept value, the concept value is used as an expression mode of the text concept, and the operator is used for combining the text concept and the keyword to form a matching rule of the expression; a plurality of said text concepts are located in a concept book, each said text concept containing a plurality of concept nodes, each said concept node corresponding to a concept value.
CN201910155040.6A 2019-03-01 2019-03-01 Content extraction method and device Active CN109918490B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910155040.6A CN109918490B (en) 2019-03-01 2019-03-01 Content extraction method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910155040.6A CN109918490B (en) 2019-03-01 2019-03-01 Content extraction method and device

Publications (2)

Publication Number Publication Date
CN109918490A CN109918490A (en) 2019-06-21
CN109918490B true CN109918490B (en) 2022-12-16

Family

ID=66962894

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910155040.6A Active CN109918490B (en) 2019-03-01 2019-03-01 Content extraction method and device

Country Status (1)

Country Link
CN (1) CN109918490B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110413636A (en) * 2019-08-01 2019-11-05 北京香侬慧语科技有限责任公司 A kind of data processing method and device
CN110457597A (en) * 2019-08-08 2019-11-15 中科鼎富(北京)科技发展有限公司 A kind of advertisement recognition method and device
CN111008523A (en) * 2019-11-21 2020-04-14 中科鼎富(北京)科技发展有限公司 Information extraction method and device and server

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6115683A (en) * 1997-03-31 2000-09-05 Educational Testing Service Automatic essay scoring system using content-based techniques
CN107608949A (en) * 2017-10-16 2018-01-19 北京神州泰岳软件股份有限公司 A kind of Text Information Extraction method and device based on semantic model

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6115683A (en) * 1997-03-31 2000-09-05 Educational Testing Service Automatic essay scoring system using content-based techniques
CN107608949A (en) * 2017-10-16 2018-01-19 北京神州泰岳软件股份有限公司 A kind of Text Information Extraction method and device based on semantic model

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
面向高考阅读理解观点类问题的答案抽取方法;王素格等;《郑州大学学报(理学版)》;20180125(第01期);全文 *

Also Published As

Publication number Publication date
CN109918490A (en) 2019-06-21

Similar Documents

Publication Publication Date Title
Lovato et al. Hey Google, do unicorns exist? Conversational agents as a path to answers to children's questions
US10133733B2 (en) Systems and methods for an autonomous avatar driver
CN107346336B (en) Information processing method and device based on artificial intelligence
CN109918490B (en) Content extraction method and device
CN111460162A (en) Text classification method and device, terminal equipment and computer readable storage medium
US11436278B2 (en) Database creation apparatus and search system
CN112380868B (en) Multi-classification device and method for interview destination based on event triplets
CN111680146A (en) Method and device for determining new words, electronic equipment and readable storage medium
Ciubotariu et al. Minions at SemEval-2016 Task 4: or how to build a sentiment analyzer using off-the-shelf resources?
CN110555202A (en) method and device for generating abstract broadcast
Anchiêta et al. Improving opinion summarization by assessing sentence importance in on-line reviews
CN113516984A (en) Sign language interaction method, system, equipment and storage medium
US20210056170A1 (en) Limiting a dictionary used by a natural language model to summarize a document
Friginal et al. Multi-dimensional analysis
Hrkút et al. Data Collection for Natural Language Processing Systems
Burns et al. Corpus Linguistics and the Appraisal Framework for Retrieving Emotion and Stance–The Case of Samsung’s and Apple’s Facebook Pages
CN116226677B (en) Parallel corpus construction method and device, storage medium and electronic equipment
CN112632991B (en) Method and device for extracting characteristic information of Chinese language
US20240070398A1 (en) Sentiment analysis system, sentiment analysis method, and information storage medium
CN111753533B (en) Title text generation method, device, computer storage medium and electronic equipment
Grande 'American Dirt'Isn't the Problem.
Jin The Use of the Non-Degree Adv+ Adj Construction in Sci Papers: A Comparison between English L1 Speakers and Chinese Efl Speakers
Makagonov et al. Computer Analysis of Texts in Social Networks, Its Method and Tools: State-of-the-Art Review
Schmidt et al. A Corpus of Memes from Reddit: Acquisition, Preparation and First Case Studies
Wubben et al. Facilitating online discussions by automatic summarization

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant