CN111274792A - Method and system for generating abstract of text - Google Patents

Method and system for generating abstract of text Download PDF

Info

Publication number
CN111274792A
CN111274792A CN202010065621.3A CN202010065621A CN111274792A CN 111274792 A CN111274792 A CN 111274792A CN 202010065621 A CN202010065621 A CN 202010065621A CN 111274792 A CN111274792 A CN 111274792A
Authority
CN
China
Prior art keywords
abstract
text
word
category
semantic role
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010065621.3A
Other languages
Chinese (zh)
Other versions
CN111274792B (en
Inventor
王欣晟
周继恩
陆堃彪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Unionpay Co Ltd
Original Assignee
China Unionpay Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Unionpay Co Ltd filed Critical China Unionpay Co Ltd
Priority to CN202010065621.3A priority Critical patent/CN111274792B/en
Publication of CN111274792A publication Critical patent/CN111274792A/en
Application granted granted Critical
Publication of CN111274792B publication Critical patent/CN111274792B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/355Class or cluster creation or modification

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

The invention provides a method for generating an abstract of a text, which comprises the following steps: preprocessing the text; marking parts of speech of words in the text and determining a grammatical structure existing in the text; determining a category of semantic roles for the word based on the part of speech and the grammar structure; extracting the abstract of the text from the clauses of the text according to a preset algorithm; and adjusting the summary.

Description

Method and system for generating abstract of text
Technical Field
The invention relates to the field of word processing, in particular to a method and a system for generating an abstract of a text.
Background
Currently, there are a number of methods for abstracting a summary of text. For example, a method of extracting a summary based on simple statistics. The method may score each sentence in the text according to the frequency of occurrence of the individual words in the sentence, resulting in a ranking of the importance of the sentences, and take the most important sentences as a summary of the text.
However, this type of method has a disadvantage in that the minimum unit of the extracted summary is one sentence. However, for chinese, a sentence may be a compound sentence, which may be formed of a plurality of clauses, each of which is connected by a comma, a pause, and a semicolon. Thus, the sentences that are abstracts may still be very long and still take a lot of time to read.
Disclosure of Invention
One aspect of the present invention provides a method for generating a summary of a text, comprising: preprocessing the text; marking parts of speech of words in the text and determining a grammatical structure existing in the text; determining a category of semantic roles for the word based on the part of speech and the grammar structure; extracting the abstract of the text from the clauses of the text according to a preset algorithm; and adjusting the summary.
Another aspect of the present invention provides a method for generating a summary of a text, including: preprocessing the text; extracting a first abstract of the text from sentences of the text according to a preset algorithm; marking the part of speech of the words in the first abstract and determining a grammatical structure existing in the first abstract; determining a category of semantic roles for the word based on the part of speech and the grammar structure; extracting a second abstract of the first abstract from the clauses of the first abstract according to a preset algorithm; and adjusting the second summary.
Yet another aspect of the present invention provides a system for generating a summary of text, comprising: a text pre-processing system for pre-processing the text; a part-of-speech tagging and syntactic structure analysis system for tagging parts-of-speech of words in the text and determining syntactic structures present in the text; a semantic role tagging system for determining a category of a semantic role for the word based on the part of speech and the syntactic structure; the single sentence abstract extracting system is used for extracting the abstract of the text from the clauses of the text according to a preset algorithm; and a digest adjustment system for adjusting the digest.
Yet another aspect of the present invention provides a system for generating a summary of a text, comprising: a text pre-processing system for pre-processing the text; a single sentence abstract extraction system for extracting a first abstract of the text in a sentence of the text according to a predetermined algorithm; a part-of-speech tagging and syntactic structure analysis system for tagging parts of speech of words in the first abstract and determining syntactic structures existing in the first abstract; a semantic role tagging system for determining a category of a semantic role for the word based on the part of speech and the syntactic structure; a single sentence abstract extraction system for extracting a second abstract of the first abstract from the clauses of the first abstract according to the predetermined algorithm; and a digest adjustment system for adjusting the second digest.
The present invention also provides a computer readable medium having stored thereon computer readable instructions which, when executed by a computer, are capable of performing a method according to embodiments of the present invention.
The embodiment of the invention can filter the non-text information in the rich text in the HTML format, extract the abstract of the text by taking the signs of periods, question marks, exclamation marks and the like as the marks, and extract the main body of each clause in a compound sentence as the abstract by marking the semantic characters (the main body is the words of which the types of the semantic characters in the clauses are subject, object and time and place). Embodiments of the present invention are also capable of re-abstracting a summary for a sentence and optimizing the re-abstracted summary.
The foregoing description is only an overview of the technical solutions of the present invention, and the embodiments of the present invention are described below in order to make the technical means of the present invention more clearly understood and to make the above and other objects, features, and advantages of the present invention more clearly understandable.
Drawings
The above and other objects, features and advantages of exemplary embodiments of the present invention will become readily apparent from the following detailed description read in conjunction with the accompanying drawings. Several embodiments of the invention are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings and in which:
in the drawings, the same or corresponding reference numerals indicate the same or corresponding parts.
FIG. 1 shows a general schematic of a system for summarization according to an embodiment of the invention.
FIG. 2 illustrates a schematic diagram of determining related words from predicates according to an embodiment of the invention.
Detailed Description
The principles and spirit of the present invention will be described with reference to a number of exemplary embodiments. It is understood that these embodiments are given solely for the purpose of enabling those skilled in the art to better understand and to practice the invention, and are not intended to limit the scope of the invention in any way. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
The following detailed description of embodiments of the invention refers to the accompanying drawings.
FIG. 1 shows a general schematic of a system for summarization according to an embodiment of the invention. As shown in the figure, the system comprises a text preprocessing system, a single sentence abstract extracting system, a part-of-speech tagging and syntactic structure analyzing system, a semantic role tagging system and an abstract generating system. Each system will be described separately below.
Text preprocessing system
The text pre-processing system may implement the following steps:
(1) and extracting the plain text.
The text pre-processing system may use regular expressions to extract plain text content from rich text in HTML format. In some embodiments, since multimedia information such as pictures, attachments and the like is included in the rich text in the HTML format, regular expressions can be used to filter out tags in the rich text, so as to extract plain text information therein. In some embodiments, this step may not be performed if the text to be abstracted is already plain text.
(2) And segmenting the text.
Unlike english, which naturally distinguishes different words by spaces, chinese does not have similar symbols to segment text, and therefore it is necessary to segment words in text for subsequent steps. In some embodiments, the text pre-processing system may use jieba tokenization to tokenize text.
(3) Stop words are removed.
There are many stop words in chinese, such as "o", "etc. These stop words are generally insignificant to semantics and may also affect the quality of the generated summary. Stop words may be removed before the abstract is extracted.
Single sentence abstract extraction system
The single sentence abstract extraction system may extract a single sentence with the highest weight value in a text based on a plurality of sentences (sentences are also referred to as "single sentences") included in a piece of text to take the single sentence as an abstract of the text. In some embodiments, the method for extracting a single sentence may adopt an existing TextRank algorithm, and the specific steps of the algorithm are as follows:
(1) from the individual sentences and the results of the word segmentation, a graph can be constructed, G ═ V, (E), consisting of a set of vertices V and a set of edges E, E being a subset of V × V. For text, V is each sentence divided by a period or question mark or exclamation point. For example, Vi represents the ith sentence in a paragraph. W (Vi) represents the weight of the ith sentence, and the specific calculation formula is as follows:
Figure BDA0002375880580000041
where sentences represents the number of clauses in a paragraph. d is a damping factor between 0 and 1, which represents the probability of jumping from a given vertex to a random vertex in another graph, and is typically set to 0.85. w is ajiIndicating how similar the jth sentence is to the ith sentence.
Textrank calculates each sentence as follows:
1) initializing a weight for each sentence
Figure BDA0002375880580000042
2) Obtaining the final weight of each sentence through iterative calculation
And finding a single sentence with the highest weight in the text by a TextRank algorithm, and taking the found single sentence as an abstract. In some embodiments, if multiple single sentences are included in the text, the found single sentence may be referred to as a summary. In some embodiments, if only one single sentence is included in the text, the use of a single sentence summarization system can be reduced and the semantic role tagging system can be used directly with the part of speech and grammatical structure of the text.
In other embodiments, other algorithms such as Luhn, Edmundson, LSA, LexRank, Tfidf, etc. may be used to extract the summary from the text.
Part-of-speech tagging and syntactic structure analysis system
The part-of-speech tagging and syntactic structure analysis system may determine the part-of-speech of each word in a word based on existing methods such as conditional random fields, and determine the syntactic structure in the text based on the obtained parts-of-speech using existing dependent syntactic analysis methods, after the step of segmenting the text or the step of extracting the words using a word summarization extraction system. An exemplary syntax structure is shown in table 1 below.
TABLE 1
Figure BDA0002375880580000051
Figure BDA0002375880580000061
Semantic role labeling system
The semantic role labeling system can find predicates in the single sentence through a machine learning model based on the part of speech and the grammatical structure, classify all words related to the predicates in each grammatical structure of the single sentence, and analyze which semantic role each word possibly belongs to. Meanwhile, the machine learning model can score each word possibly belonging to the semantic role, set a threshold value for different semantic roles, and determine the word with the score higher than the threshold value as the semantic role. In some embodiments, the machine learning model may be a classification model. In some embodiments, the classification model may be a maximum entropy model (also known as a maximum entropy classifier, see, for example, the literature published in the journal software journal:semanticrole labeling based on maximum entropy classifiers). In other embodiments, other classification models may be used instead of the maximum entropy model to achieve the same effect.
In some embodiments, the semantic role annotation system may implement the following steps:
(1) verbs in each grammar structure in a single sentence are used as predicates based on the part-of-speech.
(2) And determining related words according to the predicates. For example, a word connected to the predicate by a grammatical structure may be determined as the relevant word. For example, FIG. 2 shows an example of determining related words based on predicates. Wherein the predicate is "punishment", and the words connected to the predicate by a grammatical structure are "party center", "want", "insistence" and "crime".
(3) The actual semantic role of the related words is determined based on the maximum entropy model.
As described above, step (4) may include classifying the related terms to determine which semantic role they are likely to belong to. Generally, each semantic role is corresponding to some part-of-speech of a word, e.g., A0 (which usually represents the action actor, i.e., the subject) corresponds to a noun or pronoun, and A0 cannot be a verb.
In addition, the step (4) may further include scoring each related term, and determining the related term with the score higher than a preset threshold as the semantic role.
Semantic roles may include, but are not limited to: a0 (typically representing action actors, i.e. subjects), a1 (typically representing action influencers, i.e. objects), ADV (subjects), TMP (time), LOC (location), MNR (manner), BNE (beneficiary), CND (condition), DIR (direction), DGR (degree), EXT (spread), FRQ (frequency), PRP (purpose or cause), etc.
Abstract generation system
The summary generation system may perform the following steps:
(1) the extracted single sentence as the abstract (i.e., the case where the text includes a plurality of single sentences) or the separation symbols such as commas, semicolons, etc. in the single sentence (i.e., the case where the text itself is a single sentence) are changed to periods so that each of the clauses in the generated single sentence becomes an individual sentence.
(2) The TextRank algorithm is used to abstract all individual sentences (i.e., using the single sentence abstract extraction system described above).
(3) And adjusting the extracted abstract according to a certain rule.
Experiments show that the abstract extracted according to the step (2) can have the following problems:
(a) the original abstract-extracted single sentence or single sentence itself has a definite subject, but pronouns are used in the abstract extracted in this step to refer to the original subject.
(b) The originally extracted single sentence as the abstract or the single sentence has a definite subject, but the subject is omitted in the abstract extracted in the step;
(c) the originally extracted single sentence or single sentence serving as the abstract has time words or place words, and the abstract extracted in the step lacks the time words or the place words.
Therefore, the method of the present invention can also adjust the summary. For example, semantic roles are added to the abstracts extracted at this step based on the following rules:
(a) if the first word of the abstract extracted at this step is a pronoun (e.g., "he", "she", "they", "it", "this", etc.), the pronoun is replaced with the semantic character a1 in the last atomic sentence of the abstract (i.e., the last current individual sentence);
(b) if the first word of the abstract extracted at this step is not a noun or pronoun, the semantic role a0 in the previous atomic sentence is added to the beginning of the abstract extracted at this step (if the a0 of the previous atomic sentence is empty, look for a0 in the previous atomic sentence until an a0 is found).
(c) If the abstract extracted in this step does not contain the time word TMP and the location word LOC and the originally extracted single sentence or single sentence itself as the abstract contains the time word TMP and the location word LOC, TMP and LOC thereof are extracted and added to the beginning of the abstract extracted in this step.
The overall scheme of the invention will be described below as an embodiment with reference to a piece of text.
The text is as follows:
"the chinese union of bank released the ETC issuing platform from 2019 month 6, and as of this year 10 months, the platform had been docked with the ETC issuing parties in beijing, shanghai, guangdong, flunan, zhejiang, jilin, north of lake, tianjin 8 provinces and cities, and the vehicle owner user in the above-mentioned area can apply for the ETC card provided by the local issuing party through the union of bank ETC issuing platform quickly, and can enjoy the non-inductive payment of highway toll after binding the designated union of bank card for deduction. In addition to the areas, the Unionpay also carries out cooperative docking with multiple places at a high speed, and subsequently has related functions of online, so as to actively respond to the specific requirements of 'implementation scheme for deepening toll road system and reforming and canceling highway provincial toll station' issued by office of State Council, and provide convenient ETC application and payment service for the majority of vehicle owners.
The Unionpay ETC issuing platform supports users to initiate applications through different online and offline channels, wherein the cloud flash APP can be opened by users in Beijing, Shanghai, Guangdong, Henan, Zhejiang, Jilin, Hubei and the like to carry out online applications. Taking Guangdong area Yuetong card application as an example, a car owner enters more ' life choice ' from a cloud flash APP ' preferential ' home page, enters intelligent passing ' ETC service, quickly inputs vehicle information, binds a designated Unionpay card, can select on-site self-pickup or mailing OBU equipment after application is completed, and can quickly pass through an ETC lane on a highway after self-help installation and activation. Users in Tianjin area can carry the identity card and the driving license to a service network point under the networking toll center line of the expressway in Tianjin city to apply for ETC, and the system can be used after field installation and activation, and can automatically identify and deduct toll without parking. On the basis of providing convenient ETC application service for users, the Unionpay and Unionpay combination industry partners develop the preferential activities of the ETC Unionpay noninductive payment toll in multiple places successively.
And next, China Unionpay continuously inherits open and cooperative service concepts, aims at 'benefiting people, facilitating people and benefiting people', further exerts platform advantages, assists ETC release and popularization, and promotes rapid access in more regions. In addition, the Unionpay can develop ETC promotion and release with multi-field partners such as parking, refueling, logistics, insurance, communication and the like, and more convenient travel service is provided for vast users. "
The weight value of each sentence is obtained according to the TextRank algorithm as follows:
[0.16841363,0.11995843,0.11883978,0.13119057,0.114654860.12456722,0.10932548,0.11305002]
according to the weighted value, the first sentence of the text, namely Chinese Union Pay, is that the ETC issuing platform is released from 2019 in 6, and is butted with ETC issuers in 8 provinces and cities of Beijing, Shanghai, Guangdong, Henan, Zhejiang, Jilin, Hubei and Tianjin as of 10 th month in this year, vehicle owners in the areas can quickly apply for ETC cards provided by local issuers through the Union Pay ETC issuing platform, and can enjoy the non-inductive payment of expressway toll after binding the designated Union Pay cards for deduction. "may be taken as a summary.
The following results can be obtained after the semantic role is labeled on the abstract for one time:
the first clause: the predicate "push out", a0 "chinese union of bank", TMP "6 months in 2019", a1 "platform".
The second clause: the predicate "cut to", TMP "month 10 this year".
The third clause: the predicate "docking", LOC "beijing", LOC "shanghai", LOC "guangdong", LOC "he nan", LOC "zhejiang", LOC "jilin", LOC "north of lake", and LOC "tianjin".
The fourth clause: a0 "owner user in the above area", MNR "pass", ADV "fast", predicate "apply", a1 "card"; a0 "local issuer", predicate "provide", a1 "card".
The fifth clause: the predicate "deduct money", ADV "binds the designated Unionpay card"
Sixth clause: the predicate "enjoy", ADV "just, a 1" pay "
The abstract of the second extraction is: has been docked with 8 ETC issuers in Beijing, Shanghai, Guangdong, Henan, Zhejiang, Jilin, Hubei and Tianjin provinces and cities.
The Chinese Union of silver "is the action force provider A0 nearest to the sentence where the secondary abstract is located, and" 10 months this year "is the time TMP nearest to the sentence where the secondary abstract is located, so the abstract generated after adding the semantic role is: in this year, 10 months, Unionpay of China has been in butt joint with ETC issuers in Beijing, Shanghai, Guangdong, Henan, Zhejiang, Jilin, Hubei and Tianjin 8 provinces and cities.
The system, method and apparatus of the embodiments of the present invention can be implemented as pure software (e.g., a software program written in Java), as pure hardware (e.g., a dedicated ASIC chip or FPGA chip), or as a system combining software and hardware (e.g., a firmware system storing fixed code or a system with a general-purpose memory and a processor), as desired.
Moreover, while the operations of the method of the invention are depicted in the drawings in a particular order, this does not require or imply that the operations must be performed in this particular order, or that all of the illustrated operations must be performed, to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step execution, and/or one step broken down into multiple step executions.
It should be noted that although in the above detailed description several software means/modules and sub-means/modules are mentioned which implement the above described method, such a division is not mandatory. Indeed, the features and functionality of two or more of the devices described above may be embodied in one device/module according to embodiments of the invention. Conversely, the features and functions of one apparatus/module described above may be further divided into embodiments by a plurality of apparatuses/modules.
While the spirit and principles of the invention have been described with reference to several particular embodiments, it is to be understood that the invention is not limited to the disclosed embodiments, nor is the division of aspects, which is for convenience only as the features in such aspects may not be combined to benefit. The invention is intended to cover various modifications and equivalent arrangements included within the spirit and scope of the appended claims.

Claims (29)

1. A method for generating a summary of text, comprising:
preprocessing the text;
marking parts of speech of words in the text and determining a grammatical structure existing in the text;
determining a category of semantic roles for the word based on the part of speech and the grammar structure;
extracting the abstract of the text from the clauses of the text according to a preset algorithm; and
and adjusting the abstract.
2. The method of claim 1, wherein the step of pre-processing the text comprises:
performing word segmentation on the text to obtain the word.
3. The method of claim 2, wherein the step of pre-processing the text further comprises:
removing stop words in the text; or
The text is extracted from the rich text in HTML format using regular expressions.
4. The method of claim 1, wherein the step of determining a category for a semantic role for the term based on the part of speech and the syntactic structure further comprises:
taking verbs in the grammar structure as predicates based on the parts-of-speech;
determining related words related to the predicates according to the predicates; and
determining a category of semantic roles for the related terms based on a classification model.
5. The method of claim 4, wherein the step of determining a category for the semantic role of the related term based on a classification model comprises:
classifying each word in the related words according to the category of the semantic role, and determining the category to which each word can be subordinate; and
in any one of the categories, the score of each word subordinate to is evaluated, and the word with the score higher than a threshold value is determined as the category with the semantic role.
6. The method of claim 1, wherein the step of adjusting the summary comprises:
if the part of speech of the first word in the abstract is a pronoun, searching the word which is closest to the abstract and has the semantic role category as an object in the sentence before the abstract in the text, and replacing the first word with the word of which the semantic role category is an object; or
If the part of speech of the first word in the abstract is not a noun or a pronoun, searching the word which is closest to the abstract and has the category of the semantic role as the subject in the sentence before the abstract in the text, and adding the word which has the category of the semantic role as the subject to the beginning of the abstract.
7. The method of claim 1, wherein the step of adjusting the summary comprises:
if the words of which the category of the semantic role is time and/or place are not contained in the abstract and the words of which the category of the semantic role is time and/or place exist in the text, the words of which the category of the semantic role is time and/or place are added to the beginning of the abstract.
8. A method for generating a summary of text, comprising:
preprocessing the text;
extracting a first abstract of the text from sentences of the text according to a preset algorithm;
marking the part of speech of the words in the first abstract and determining a grammatical structure existing in the first abstract;
determining a category of semantic roles for the word based on the part of speech and the grammar structure;
extracting a second abstract of the first abstract from the clauses of the first abstract according to the preset algorithm; and
and adjusting the second abstract.
9. The method of claim 8, wherein the step of pre-processing the text comprises:
performing word segmentation on the text to obtain the word.
10. The method of claim 9, wherein the step of pre-processing the text further comprises:
removing stop words in the text; or
The text is extracted from the rich text in HTML format using regular expressions.
11. The method of claim 8, wherein the step of determining a category for a semantic role for the term based on the part of speech and the grammar structure further comprises:
taking verbs in the grammar structure as predicates based on the parts-of-speech;
determining related words related to the predicates according to the predicates; and
determining a category of semantic roles for the related terms based on a classification model.
12. The method of claim 11, wherein the step of determining a category for the semantic role of the related term based on a classification model comprises:
classifying each word in the related words according to the category of the semantic role, and determining the category to which each word can be subordinate; and
in any one of the categories, the score of each word subordinate to is evaluated, and the word with the score higher than a threshold value is determined as the category with the semantic role.
13. The method of claim 8, wherein the step of adjusting the second summary comprises:
if the part of speech of the first word in the second abstract is a pronoun, searching the word which is closest to the second abstract and has the semantic role category as an object in the first abstract and in a clause before the second abstract, and replacing the first word with the word of which the semantic role category is an object; or
If the part of speech of the first word in the second abstract is not a noun or a pronoun, searching the word which is closest to the second abstract and has the semantic role of the subject in the clauses before the second abstract in the first abstract, and adding the word which has the semantic role of the subject to the beginning of the second abstract.
14. The method of claim 8, wherein the step of adjusting the second summary comprises:
if the words of which the category of the semantic role is time and/or place are not contained in the second abstract and the words of which the category of the semantic role is time and/or place exist in the first abstract, the words of which the category of the semantic role is time and/or place are added to the beginning of the second abstract.
15. A system for generating a summary of text, comprising:
a text pre-processing system for pre-processing the text;
a part-of-speech tagging and syntactic structure analysis system for tagging parts-of-speech of words in the text and determining syntactic structures present in the text;
a semantic role tagging system for determining a category of a semantic role for the word based on the part of speech and the syntactic structure;
the single sentence abstract extracting system is used for extracting the abstract of the text from the clauses of the text according to a preset algorithm; and
a digest adjustment system for adjusting the digest.
16. The system of claim 15, wherein the text pre-processing system comprises:
means for segmenting the text to obtain the words.
17. The system of claim 16, wherein the text pre-processing system further comprises:
means for removing stop words in the text; or
Means for extracting the text from the rich text having an HTML format using regular expressions.
18. The system of claim 15, wherein the semantic role annotation system comprises:
means for determining a predicate from the grammar structure based on the part-of-speech;
means for determining a related word related to the predicate from the predicate;
means for determining a category of semantic role for the related term based on a classification model.
19. The system of claim 18, wherein the means for determining the category of the semantic role for the related term based on a classification model further comprises:
a module for classifying each of the related terms by category of semantic role and determining a category to which each term can depend; and
and a module for evaluating the score of each word subordinate to any one of the categories, and determining the word with the score higher than a threshold value as the word in any one of the categories with the semantic role.
20. The system of claim 15, wherein the summary adjustment system comprises:
a module for searching the words which are closest to the abstract and have the category of semantic characters as objects in the text in the clauses before the abstract if the part of speech of the first word in the abstract is a pronoun, and replacing the first word with the words which have the category of the semantic characters as objects; or
And if the part of speech of the first word in the abstract is not a noun or a pronoun, searching the word which is closest to the abstract and has the category of the semantic role as the subject in the sentence before the abstract in the text, and adding the word which has the category of the semantic role as the subject to the beginning of the abstract.
21. The system of claim 15, wherein the summary adjustment system further comprises:
means for adding words of semantic role category of time and/or place to the beginning of the abstract if the words of semantic role category of time and/or place are not included in the abstract and words of semantic role category of time and/or place are present in the text.
22. A system for generating a summary of text, comprising:
a text pre-processing system for pre-processing the text;
a single sentence abstract extraction system for extracting a first abstract of the text in a sentence of the text according to a predetermined algorithm;
a part-of-speech tagging and syntactic structure analysis system for tagging parts of speech of words in the first abstract and determining syntactic structures existing in the first abstract;
a semantic role tagging system for determining a category of a semantic role for the word based on the part of speech and the syntactic structure;
a single sentence abstract extraction system for extracting a second abstract of the first abstract from the clauses of the first abstract according to the predetermined algorithm; and
a digest adjustment system for adjusting the second digest.
23. The system of claim 22, wherein the text pre-processing system comprises:
means for segmenting the text to obtain the words.
24. The system of claim 23, wherein the text pre-processing system further comprises:
means for removing stop words in the text; or
Means for extracting the text from the rich text having an HTML format using regular expressions.
25. The system of claim 22, wherein the semantic role annotation system comprises:
means for determining a predicate from the grammar structure based on the part-of-speech;
means for determining a related word related to the predicate from the predicate; and
means for determining a category of semantic role for the related term based on a classification model.
26. The system of claim 25, wherein the means for determining the category of the semantic role for the related term based on a classification model further comprises:
a module for classifying each of the related terms by category of semantic role and determining a category to which each term can depend; and
and a module for evaluating the score of each word subordinate to any one of the categories, and determining the word with the score higher than a threshold value as the word in any one of the categories with the semantic role.
27. The system of claim 22, wherein the summary adjustment system comprises:
a module for searching for a word which is closest to the second abstract and has a semantic character category of object in the first abstract and in a clause before the second abstract if the part of speech of the first word in the second abstract is a pronoun, and replacing the first word with a word of which the semantic character category is of object; or
And a module for searching for a word which is closest to the second abstract and has a semantic role of a subject in a clause before the second abstract in the first abstract if the part of speech of the first word in the second abstract is not a noun or a pronoun, and adding a word which has a semantic role of a subject to the beginning of the second abstract.
28. The system of claim 22, wherein the summary adjustment system comprises:
means for adding words of semantic role category of time and/or place to the beginning of the second summary if the words of semantic role category of time and/or place are not included in the second summary and words of semantic role category of time and/or place are present in the first summary.
29. A computer readable medium having computer readable instructions stored thereon which, when executed by a computer, are capable of performing the method of any one of claims 1-4.
CN202010065621.3A 2020-01-20 2020-01-20 Method and system for generating abstract of text Active CN111274792B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010065621.3A CN111274792B (en) 2020-01-20 2020-01-20 Method and system for generating abstract of text

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010065621.3A CN111274792B (en) 2020-01-20 2020-01-20 Method and system for generating abstract of text

Publications (2)

Publication Number Publication Date
CN111274792A true CN111274792A (en) 2020-06-12
CN111274792B CN111274792B (en) 2023-06-27

Family

ID=70999005

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010065621.3A Active CN111274792B (en) 2020-01-20 2020-01-20 Method and system for generating abstract of text

Country Status (1)

Country Link
CN (1) CN111274792B (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104360993A (en) * 2014-11-19 2015-02-18 广州极盛信息科技开发有限公司 Method for extracting needed content from text
US20170060826A1 (en) * 2015-08-26 2017-03-02 Subrata Das Automatic Sentence And Clause Level Topic Extraction And Text Summarization
CN106599041A (en) * 2016-11-07 2017-04-26 中国电子科技集团公司第三十二研究所 Text processing and retrieval system based on big data platform
CN106777275A (en) * 2016-12-29 2017-05-31 北京理工大学 Entity attribute and property value extracting method based on many granularity semantic chunks
US20170161372A1 (en) * 2015-12-04 2017-06-08 Codeq Llc Method and system for summarizing emails and extracting tasks
CN109271626A (en) * 2018-08-31 2019-01-25 北京工业大学 Text semantic analysis method

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104360993A (en) * 2014-11-19 2015-02-18 广州极盛信息科技开发有限公司 Method for extracting needed content from text
US20170060826A1 (en) * 2015-08-26 2017-03-02 Subrata Das Automatic Sentence And Clause Level Topic Extraction And Text Summarization
US20170161372A1 (en) * 2015-12-04 2017-06-08 Codeq Llc Method and system for summarizing emails and extracting tasks
CN106599041A (en) * 2016-11-07 2017-04-26 中国电子科技集团公司第三十二研究所 Text processing and retrieval system based on big data platform
CN106777275A (en) * 2016-12-29 2017-05-31 北京理工大学 Entity attribute and property value extracting method based on many granularity semantic chunks
CN109271626A (en) * 2018-08-31 2019-01-25 北京工业大学 Text semantic analysis method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
徐驰 等: "基于TextRank和GloVe的自动文本摘要算法" *

Also Published As

Publication number Publication date
CN111274792B (en) 2023-06-27

Similar Documents

Publication Publication Date Title
CN101510221B (en) Enquiry statement analytical method and system for information retrieval
US9430742B2 (en) Method and apparatus for extracting entity names and their relations
CN106951530B (en) Event type extraction method and device
CN106777275A (en) Entity attribute and property value extracting method based on many granularity semantic chunks
CN108875059B (en) Method and device for generating document tag, electronic equipment and storage medium
CN110263248A (en) A kind of information-pushing method, device, storage medium and server
CN109635082A (en) Policy implication analysis method, device, computer equipment and storage medium
Curtotti et al. Corpus based classification of text in Australian contracts
CN110909122A (en) Information processing method and related equipment
CN112699677B (en) Event extraction method and device, electronic equipment and storage medium
CN109271624B (en) Target word determination method, device and storage medium
CN113722492A (en) Intention identification method and device
CN110781669A (en) Text key information extraction method and device, electronic equipment and storage medium
Pan et al. Charge prediction for multi-defendant cases with multi-scale attention
Abdurakhmonova Formal-Functional Models of The Uzbek Electron Corpus
CN102163189A (en) Method and device for extracting evaluative information from critical texts
Subha et al. Quality factor assessment and text summarization of unambiguous natural language requirements
Reddy et al. An efficient approach for web document summarization by sentence ranking
CN111274792B (en) Method and system for generating abstract of text
Liu et al. Japanese named entity recognition for question answering system
CN103942188B (en) A kind of method and apparatus identifying language material language
Li-Juan et al. A classification method of Vietnamese news events based on maximum entropy model
Das et al. Sentence level emotion tagging
Kitoogo et al. Towards domain independent named entity recognition
Xu et al. Building comparative product relation maps by mining consumer opinions on the web

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant