CN116501862B - Automatic text extraction system based on dynamic distributed collection - Google Patents
Automatic text extraction system based on dynamic distributed collection Download PDFInfo
- Publication number
- CN116501862B CN116501862B CN202310748841.XA CN202310748841A CN116501862B CN 116501862 B CN116501862 B CN 116501862B CN 202310748841 A CN202310748841 A CN 202310748841A CN 116501862 B CN116501862 B CN 116501862B
- Authority
- CN
- China
- Prior art keywords
- text
- extracted
- words
- sentences
- level
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000605 extraction Methods 0.000 title claims abstract description 67
- 238000011156 evaluation Methods 0.000 claims abstract description 19
- 238000000034 method Methods 0.000 claims abstract description 15
- 238000012546 transfer Methods 0.000 claims abstract description 14
- 238000012216 screening Methods 0.000 claims description 64
- 238000004590 computer program Methods 0.000 claims description 3
- 238000000556 factor analysis Methods 0.000 claims description 3
- 238000011176 pooling Methods 0.000 claims 6
- 230000008569 process Effects 0.000 abstract description 6
- 238000012545 processing Methods 0.000 abstract description 3
- 239000000284 extract Substances 0.000 description 4
- 230000002776 aggregation Effects 0.000 description 2
- 238000004220 aggregation Methods 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 230000006978 adaptation Effects 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000013518 transcription Methods 0.000 description 1
- 230000035897 transcription Effects 0.000 description 1
- 230000007704 transition Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/34—Browsing; Visualisation therefor
- G06F16/345—Summarisation for human users
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Machine Translation (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention belongs to the technical field of automatic text extraction, and particularly relates to an automatic text extraction system based on dynamic distributed collection. According to the method, the text content can be classified into a plurality of sections to be extracted according to the multi-level titles of the text content, the related words and related sentences corresponding to the titles of the sections to be extracted are extracted from the sections to be extracted, then the sections to be extracted are preprocessed through the pre-extraction module to obtain a transfer data set, the transfer data set contains the related words and related sentences in each section to be extracted, the data processing amount of the sections to be extracted in the follow-up execution process is reduced, the priority of the related words and related sentences is determined through the evaluation module, and finally the contents in the transfer data set are extracted respectively through the extraction module.
Description
Technical Field
The invention belongs to the technical field of automatic text extraction, and particularly relates to an automatic text extraction system based on dynamic distributed collection.
Background
The text excerpt is to condense complicated text content to obtain simple and clear summary sentences or key words, so that readers can obtain the meaning actually expressed by the text content through the key words or the summary sentences, meanwhile, the traditional mode is to accurately reflect the central thought of the text content, but the excerpt can be performed only by defining the central thought of the text, and along with the development of informatization technology, the text can be automatically excerpted by identifying the document, the reading time of the readers is saved, and the readers can be effectively helped to understand the text content.
In the prior art, when automatic text extraction is performed, all text contents are scanned and identified, so that a plurality of keywords or sentences forming a text summary are obtained, but the text contents are often laid out into a plurality of sections, and the central ideas expressed among the sections may be inconsistent, so that when the keywords or related sentences are extracted, the central ideas in part of sections may not be extracted, and further, the text extraction result cannot meet the requirements of users.
Disclosure of Invention
The invention aims to provide a text automatic extraction system based on dynamic distributed collection, which can classify text contents into a plurality of sections to be extracted according to multi-level titles of the text contents, and extract the contents of the sections to be extracted respectively.
The technical scheme adopted by the invention is as follows:
a text automatic extraction system based on dynamic distributed collection comprises a text acquisition module, a classification and identification module, an association extraction module, a pre-extraction module, an evaluation module and an extraction module;
the text acquisition module is used for scanning and acquiring text contents to obtain a text to be extracted;
the classification and identification module is used for identifying multi-level titles in the text to be extracted and classifying the text content to be extracted into a plurality of sections to be extracted according to the multi-level titles;
the association extraction module is used for extracting association words and association sentences from the sections to be extracted;
the pre-extraction module is used for extracting sample words and sample sentences from the text to be extracted according to the associated words and the associated sentences to obtain a transfer data set;
the evaluation module is used for evaluating the priority of the sample word and the consistency of the sample sentence according to the weight value of the multi-level title;
the extraction module is used for obtaining user requirements, extracting key words from a plurality of sample words according to the user requirements, and summarizing the sample sentences to obtain text summaries corresponding to the text contents.
In a preferred scheme, when the text to be extracted is classified, determining the subordinate relation of the multi-level title to obtain an upper title and a lower title, and judging whether text contents exist between the upper title and the lower title;
if the text content exists, determining the text content between the upper title and the lower title as a section to be extracted;
if not, the upper header is screened out and the lower header is replaced by the upper header.
In a preferred scheme, when the association extraction module executes, the text content of the multi-level title is identified and split to obtain a plurality of reference words;
calling a plurality of sections to be extracted, which correspond to the multi-level titles, respectively, extracting vocabularies associated with the multi-level titles from the sections to be extracted, and calibrating the vocabularies as associated words;
and identifying the sentence in which the associated word is located, and calibrating the sentence as the associated sentence.
In a preferred scheme, the association extraction module comprises a screening unit, after the association sentences are determined, the screening unit performs the processing, counts the number of the association words in each association sentence, marks the association words as parameters to be compared, and sorts all the association sentences according to the size of the parameters to be compared to obtain a plurality of parallel association sentences;
a screening threshold value for comparing with the parameter to be compared is preset in the screening unit;
if the screening threshold value is smaller than the parameter to be compared, screening the corresponding associated sentence;
and if the screening threshold value is greater than or equal to the parameter to be compared, reserving the corresponding associated statement.
In a preferred scheme, the screening threshold is set according to user requirements, wherein the user requirements comprise keyword requirements, associated sentence requirements and text summary requirement words.
In a preferred scheme, when the pre-extraction module executes, generating a transit data set, wherein sample words in the transit data set are arranged according to occurrence frequency, and the sample sentences are arranged according to the subordinate relations of the multi-level titles and the distribution positions of the sample words in the text content;
the pre-extraction module comprises a screening unit, wherein the screening unit is used for screening sample sentences in each section to be extracted;
the screening unit is internally preset with a first-level screening threshold value and a second-level screening threshold value, the first-level screening threshold value is used for evaluating the word number of sample sentences, the second-level screening threshold value is used for evaluating the number of the sample sentences, and the evaluation priority of the first-level screening threshold value is higher than that of the second-level screening threshold value.
In a preferred embodiment, the evaluation module assigns weight values according to the membership of the multi-level title when executed, wherein the assignment of weight values is performed based on a factor analysis method.
In a preferred scheme, a turning vocabulary library is preset in the evaluation module, wherein the turning vocabulary library comprises a plurality of turning vocabularies, and the turning vocabularies are used for transiting sample sentences under adjacent orders.
In one preferred approach, the snippet module, when executed, obtains a snippet sample word and a sample sentence from the transcription dataset and determines them as a keyword and a text summary, respectively.
The invention also provides a text automatic extraction terminal based on dynamic distributed collection, which comprises:
at least one processor;
and a memory communicatively coupled to the at least one processor;
wherein the memory stores a computer program executable by the at least one processor to enable the at least one processor to perform the operations of the above-described automatic text excerpt system based on dynamic distributed aggregation.
The invention has the technical effects that:
according to the method, the text content can be classified into a plurality of sections to be extracted according to the multi-level titles of the text content, the related words and related sentences corresponding to the titles of the sections to be extracted are extracted from the sections to be extracted, then the sections to be extracted are preprocessed through the pre-extraction module to obtain a transfer data set, the transfer data set contains the related words and related sentences in each section to be extracted, the data processing amount of the sections to be extracted in the follow-up execution process is reduced, the priority of the related words and related sentences is determined through the evaluation module, and finally the contents in the transfer data set are extracted respectively through the extraction module.
Drawings
FIG. 1 is a system operational diagram provided by the present invention;
fig. 2 is a block diagram of a system provided by the present invention.
Detailed Description
In order that the above-recited objects, features and advantages of the present invention will become more readily apparent, a more particular description of the invention will be rendered by reference to specific embodiments thereof which are illustrated in the appended drawings.
In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention, but the present invention may be practiced in other ways other than those described herein, and persons skilled in the art will readily appreciate that the present invention is not limited to the specific embodiments disclosed below.
Further, reference herein to "one embodiment" or "an embodiment" means that a particular feature, structure, or characteristic can be included in at least one implementation of the invention. The appearances of the phrase "in one preferred embodiment" in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments.
Referring to fig. 1 and 2, the invention provides a text automatic extraction system based on dynamic distributed collection, which comprises a text acquisition module, a classification and identification module, a correlation extraction module, a pre-extraction module, an evaluation module and an extraction module;
the text acquisition module is used for scanning and acquiring text contents to obtain a text to be extracted;
the classification and identification module is used for identifying multi-level titles in the text to be extracted and classifying the text content to be extracted into a plurality of sections to be extracted according to the multi-level titles;
the association extraction module is used for extracting association words and association sentences from the sections to be extracted;
the pre-extraction module is used for extracting sample words and sample sentences from the text to be extracted according to the associated words and the associated sentences to obtain a transfer data set;
the evaluation module is used for evaluating the priority of the sample word and the consistency of the sample sentence according to the weight value of the multi-level title;
the extraction module is used for obtaining user requirements, extracting key words from a plurality of sample words according to the user requirements, and summarizing the sample sentences to obtain text summaries corresponding to the text contents.
Specifically, the text extraction is to condense complicated text content to obtain simple summary sentences or key words, so that readers can obtain the meaning of the text content actually expressed through the key words or the summary sentences, and simultaneously can accurately reflect the central thought of the text content, in the embodiment, firstly, the text content can be acquired through a text acquisition module, which can be in the format of pictures, charts and the like, and can be scanned and converted into text content through the text acquisition module, so that subsequent recognition and extraction are convenient, the conventional technical means of the field personnel are adopted, excessive description is not adopted, then the multi-stage titles in the text to be extracted are determined through a classification recognition module, corresponding parallel relations and subordinate relations exist among the multi-stage titles, the text to be extracted is classified into a plurality of blocks to be extracted through the text content between adjacent titles, the content in each block to be extracted necessarily surrounds the corresponding title, then, the associated words and associated sentences are extracted from the blocks through an association module, when the associated words are extracted, the associated words are required to be extracted, the associated words can be automatically extracted through the association words, and the associated words can be determined through the association sentence classification module, and the associated words can be automatically determined through the association sentence classification module, and the associated words can be determined, and the associated words can be automatically determined through the comparison and the extraction module, and the associated sentence classification module, and the associated words can be determined by the associated words and the associated words are automatically by the associated with the associated sentence through the extraction module, and the associated words and the extraction module by the associated words, and the associated words with the text to be extracts, and the text to be extracts and the text and the corresponding text and the text to be extracts and the text through the text and the corresponding text and the text, thus, sample sentences and text summaries meeting the requirements of users can be obtained.
In a preferred embodiment, when the text to be extracted is classified, determining the subordinate relation of the multi-level title to obtain an upper-level title and a lower-level title, and judging whether text contents exist between the upper-level title and the lower-level title;
if the text content exists, determining the text content between the upper title and the lower title as a section to be extracted;
if not, the upper header is screened out and the lower header is replaced by the upper header.
In this embodiment, after obtaining the text to be extracted, the subordinate relations between the multi-level titles in the text are determined, and the text can be classified into an upper-level title and a lower-level title according to the subordinate relations, and a parallel relation can exist between the upper-level title and the lower-level title, so that a blank section to be extracted can appear from the beginning in consideration of the fact that text content may not exist between adjacent upper-level titles and lower-level titles.
Secondly, when the association extraction module executes, recognizing the text content of the multi-level title, and splitting the text content to obtain a plurality of reference words;
calling a plurality of sections to be extracted, which correspond to the multi-level titles, respectively, extracting vocabularies associated with the multi-level titles from the sections to be extracted, and calibrating the vocabularies as associated words;
and identifying the sentence in which the associated word is located, and calibrating the sentence as the associated sentence.
In the above, when the association extraction module is executed, the text content in the multi-level title needs to be identified first and split into a plurality of reference words, then based on the reference words, a plurality of vocabularies related to the reference words are extracted from the section to be extracted corresponding to the title as association words, and the sentences where the association words are located are determined as association sentences.
Secondly, the association extraction module comprises a screening unit, after the association sentences are determined, the screening unit performs the screening, counts the number of the association words in each association sentence, marks the association words as parameters to be compared, and sorts all the association sentences according to the size of the parameters to be compared to obtain a plurality of parallel association sentences;
a screening threshold value for comparing with the parameter to be compared is preset in the screening unit;
if the screening threshold value is smaller than the parameter to be compared, screening the corresponding associated sentence;
and if the screening threshold value is greater than or equal to the parameter to be compared, reserving the corresponding associated statement.
In this embodiment, the screening unit is configured to screen out related sentences that are lower than the parameter to be compared, where the parameter to be compared is the number of related words in the same related sentence, that is, the more the number of related words in the same related sentence, the higher the priority of the related words to be extracted is, which indicates that the higher the possibility of the related words to be extracted, and in this embodiment, a screening threshold is preset in the screening unit, after the parameter to be compared is determined, the related words to be compared can be compared with the screening threshold, where the corresponding related sentences can be reserved only when the screening threshold is greater than or equal to the parameter to be compared, so that the amount of subsequent extraction can be reduced, and the result of extraction cannot be affected.
In a preferred embodiment, the screening threshold is set according to a user requirement, wherein the user requirement includes a keyword requirement amount, an associated sentence requirement amount, and a text summary requirement word number.
In this embodiment, when the screening threshold is set, the actual needs of the user, such as the keyword requirement amount, the associated sentence requirement amount, and the text summary requirement word number, need to be clarified, for example, when writing papers, the number of the required keywords is 3, where the screening threshold may be set to 6, 9, etc. so as to be convenient for the user to select, and similarly, when reading a certain document, the central ideas in each section to be extracted need to be summarized, and the screening threshold may also be set based on the above process, which will not be repeated herein.
Secondly, when the pre-extraction module executes, generating a transit data set, wherein sample words in the transit data set are arranged according to occurrence frequency, and the sample words are arranged according to the subordinate relations of the multi-level titles and the distribution positions of the sample words in the text content;
the pre-extraction module comprises a screening unit, wherein the screening unit is used for screening sample sentences in each section to be extracted;
the screening unit is internally preset with a first-level screening threshold value and a second-level screening threshold value, the first-level screening threshold value is used for evaluating the word number of sample sentences, the second-level screening threshold value is used for evaluating the number of the sample sentences, and the evaluation priority of the first-level screening threshold value is higher than that of the second-level screening threshold value.
After the pre-extraction module is executed, a transfer data set can be obtained, sample words in the transfer data set are arranged according to occurrence frequency of the sample words, sample sentences of the transfer data set are arranged according to subordinate relations of multi-level titles and distribution positions of the sample sentences in text contents, the purpose of the transfer data set is to ensure continuity of the sample sentences, so that readers can clearly know sequence of the whole text contents, but in order to ensure that the sample sentences in all sections to be extracted are not too complicated, the word number and the number of the sample sentences in all sections to be extracted are limited through a screening unit, and therefore the text summary formed by the finally obtained sample sentences can accurately reflect the whole meaning of the text contents, and extraction results can meet user requirements.
And then, when the evaluation module executes, weight values are distributed according to the subordinate relations of the multi-level titles, wherein the distribution of the weight values is executed based on a factor analysis method.
In this embodiment, the evaluation module is configured to evaluate the priorities of the keywords and the sample sentences, where the weight value of the upper header is higher than the weight value of the lower header, so that the extraction of the keywords can be more attached to the article topic, and meanwhile, there is a connection between the keywords and each section to be extracted, and the obtained sample sentences can more indirectly summarize the text summary and the central ideas of each section to be extracted, so that the relevance between the text content and the buckling problems can be ensured.
In a preferred embodiment, the evaluation module is preset with a turn vocabulary library, where the turn vocabulary library includes a plurality of turn vocabularies, and the turn vocabularies are used for transiting sample sentences under adjacent orders.
In this embodiment, when the sample sentences are combined into the text summary, there may be a phenomenon of sentence stiffness, and then corresponding turning word assembly is needed to transition, for example, "but," "another," "and" etc., which are not listed here, the word number of the turning word is also written into the total word number of the text summary, and for the case that the sentence stiffness is still caused after the turning word is added, the next sample sentence is automatically matched, so as to ensure sentence consistency of the text summary and ensure applicability of the text extraction result.
When the snippet module executes, snippet sample words and sample sentences are obtained from the intermediate data set and are determined as keywords and text summaries, respectively.
When the text extraction is executed, firstly, text content which can be in the format of pictures, charts and the like is acquired through the text acquisition module, and scanned through the text acquisition module and literal text content can be obtained, so that subsequent recognition and extraction are convenient, the technical means commonly used by the person in the art are not excessively described, then multi-level titles in the text to be extracted are determined through the classification recognition module, corresponding parallel relations and subordinate relations exist among the multi-level titles, the text to be extracted is classified into a plurality of blocks to be extracted through acquiring text content among adjacent titles, the content in each block to be extracted inevitably surrounds the corresponding title, then related words and related sentences are extracted from the blocks to be extracted through the association module, and when the related words are extracted, the title in the section to be extracted is split to obtain a plurality of reference words, the related words in the section to be extracted can be determined one by one according to the reference words, so that the related words can be counted, correspondingly, when the related sentences are extracted, the sentences with the keyword are only required to be determined, then the pre-extraction operation is carried out with the extraction module to obtain a transfer data set, the priority of the sample words and the sample sentences is determined by combining the evaluation module, finally, the automatic extraction is carried out on the sample words and the sample sentences through the extraction module, for example, the number of keywords required by a user is 3, the number of words of text summary content is not higher than 200 words, the words with the priority arranged in the first three digits can be screened out from the data set to serve as the keywords, and the text summary content is formed by splicing the sentences in each section to be extracted, thus, sample sentences and text summaries meeting the requirements of users can be obtained.
The invention also provides a text automatic extraction terminal based on dynamic distributed collection, which comprises:
at least one processor;
and a memory communicatively coupled to the at least one processor;
wherein the memory stores a computer program executable by the at least one processor to enable the at least one processor to perform the operations of the above-described automatic text excerpt system based on dynamic distributed aggregation.
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, apparatus, article, or method that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, apparatus, article, or method. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, apparatus, article or method that comprises the element.
The foregoing is merely a preferred embodiment of the present invention and it should be noted that modifications and adaptations to those skilled in the art may be made without departing from the principles of the present invention, which are intended to be comprehended within the scope of the present invention. Structures, devices and methods of operation not specifically described and illustrated herein, unless otherwise indicated and limited, are implemented according to conventional means in the art.
Claims (8)
1. The utility model provides a text automatic extraction system based on dynamic distributed gathers, includes text collection module, categorised recognition module, association extraction module, pre-extraction module, evaluation module and extraction module, its characterized in that:
the text acquisition module is used for scanning and acquiring text contents to obtain a text to be extracted;
the classification and identification module is used for identifying multi-level titles in the text to be extracted and classifying the text content to be extracted into a plurality of sections to be extracted according to the multi-level titles;
the association extraction module is used for extracting association words and association sentences from the sections to be extracted;
when the association extraction module executes, recognizing the text content of the multi-level title, and splitting the text content to obtain a plurality of reference words;
calling a plurality of sections to be extracted, which correspond to the multi-level titles, respectively, extracting vocabularies associated with the multi-level titles from the sections to be extracted, and calibrating the vocabularies as associated words;
identifying the sentence in which the related word is located, and calibrating the sentence as the related sentence;
the association extraction module comprises a screening unit, after the association sentences are determined, the screening unit performs the operation, counts the number of the association words in each association sentence, marks the association words as parameters to be compared, and sorts all the association sentences according to the size of the parameters to be compared to obtain a plurality of parallel association sentences;
a screening threshold value for comparing with the parameter to be compared is preset in the screening unit;
if the screening threshold value is smaller than the parameter to be compared, screening the corresponding associated sentence;
if the screening threshold value is larger than or equal to the parameter to be compared, reserving the corresponding associated statement;
the pre-extraction module is used for extracting sample words and sample sentences from the text to be extracted according to the associated words and the associated sentences to obtain a transfer data set;
the evaluation module is used for evaluating the priority of the sample word and the consistency of the sample sentence according to the weight value of the multi-level title;
the extraction module is used for obtaining user requirements, extracting key words from a plurality of sample words according to the user requirements, and summarizing the sample sentences to obtain text summaries corresponding to the text contents.
2. The automatic text excerpt system based on dynamic distributed pooling of claim 1, wherein: determining the subordinate relation of the multi-level title when the text to be extracted is classified, obtaining an upper title and a lower title, and judging whether text contents exist between the upper title and the lower title;
if the text content exists, determining the text content between the upper title and the lower title as a section to be extracted;
if not, the upper header is screened out and the lower header is replaced by the upper header.
3. The automatic text excerpt system based on dynamic distributed pooling of claim 1, wherein: the screening threshold is set according to user requirements, wherein the user requirements comprise keyword requirements, associated sentence requirements and text summary requirements.
4. The automatic text excerpt system based on dynamic distributed pooling of claim 1, wherein: when the pre-extraction module executes, generating a transit data set, wherein sample words in the transit data set are arranged according to occurrence frequency, and the sample sentences are arranged according to the subordinate relations of the multi-level titles and the distribution positions of the sample sentences in the text content;
the pre-extraction module comprises a screening unit, wherein the screening unit is used for screening sample sentences in each section to be extracted;
the screening unit is internally preset with a first-level screening threshold value and a second-level screening threshold value, the first-level screening threshold value is used for evaluating the word number of sample sentences, the second-level screening threshold value is used for evaluating the number of the sample sentences, and the evaluation priority of the first-level screening threshold value is higher than that of the second-level screening threshold value.
5. The automatic text excerpt system based on dynamic distributed pooling of claim 1, wherein: and when the evaluation module is executed, weight values are distributed according to the subordination relation of the multi-level titles, wherein the distribution of the weight values is executed based on a factor analysis method.
6. The automatic text excerpt system based on dynamic distributed pooling of claim 1, wherein: the evaluation module is internally preset with a turning vocabulary library, wherein the turning vocabulary library comprises a plurality of turning vocabularies, and the turning vocabularies are used for transiting sample sentences under adjacent orders.
7. The automatic text excerpt system based on dynamic distributed pooling of claim 1, wherein: when the snippet module executes, snippet sample words and sample sentences are obtained from the intermediate data set and are determined as keywords and text summaries, respectively.
8. A text automatic extraction terminal based on dynamic distributed collection is characterized in that: comprising the following steps:
at least one processor;
and a memory communicatively coupled to the at least one processor;
wherein the memory stores a computer program executable by the at least one processor to enable the at least one processor to perform the operations of the dynamic distributed assembly based text automatic snippet system of any one of claims 1 to 7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310748841.XA CN116501862B (en) | 2023-06-25 | 2023-06-25 | Automatic text extraction system based on dynamic distributed collection |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310748841.XA CN116501862B (en) | 2023-06-25 | 2023-06-25 | Automatic text extraction system based on dynamic distributed collection |
Publications (2)
Publication Number | Publication Date |
---|---|
CN116501862A CN116501862A (en) | 2023-07-28 |
CN116501862B true CN116501862B (en) | 2023-09-12 |
Family
ID=87323415
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310748841.XA Active CN116501862B (en) | 2023-06-25 | 2023-06-25 | Automatic text extraction system based on dynamic distributed collection |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116501862B (en) |
Citations (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPH10254900A (en) * | 1997-03-14 | 1998-09-25 | Omron Corp | Automatic document summarizing device and its method |
CA2363834A1 (en) * | 1999-02-19 | 2001-01-25 | The Trustees Of Columbia University In The City Of New York | Cut and paste document summarization system and method |
JP2006309347A (en) * | 2005-04-26 | 2006-11-09 | Saga Univ | Method, system, and program for extracting keyword from object document |
JP2011103075A (en) * | 2009-11-11 | 2011-05-26 | Kansai Electric Power Co Inc:The | Method for extracting excerpt sentence |
CN104361111A (en) * | 2014-11-28 | 2015-02-18 | 青岛大学 | Automatic archive editing method |
CN104462306A (en) * | 2014-11-28 | 2015-03-25 | 青岛大学 | Automatic archive compiling and researching device |
WO2018150244A1 (en) * | 2017-02-18 | 2018-08-23 | Yogesh Chunilal Rathod | Registering, auto generating and accessing unique word(s) including unique geotags |
WO2021164231A1 (en) * | 2020-02-18 | 2021-08-26 | 平安科技(深圳)有限公司 | Official document abstract extraction method and apparatus, and device and computer readable storage medium |
CN113919336A (en) * | 2021-10-20 | 2022-01-11 | 平安科技(深圳)有限公司 | Article generation method and device based on deep learning and related equipment |
CN114611520A (en) * | 2022-04-12 | 2022-06-10 | 北京澜舟科技有限公司 | Text abstract generating method |
WO2022241950A1 (en) * | 2021-05-21 | 2022-11-24 | 平安科技(深圳)有限公司 | Text summarization generation method and apparatus, and device and storage medium |
Family Cites Families (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8346534B2 (en) * | 2008-11-06 | 2013-01-01 | University of North Texas System | Method, system and apparatus for automatic keyword extraction |
US20170075877A1 (en) * | 2015-09-16 | 2017-03-16 | Marie-Therese LEPELTIER | Methods and systems of handling patent claims |
WO2021076606A1 (en) * | 2019-10-14 | 2021-04-22 | Stacks LLC | Conceptual, contextual, and semantic-based research system and method |
-
2023
- 2023-06-25 CN CN202310748841.XA patent/CN116501862B/en active Active
Patent Citations (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPH10254900A (en) * | 1997-03-14 | 1998-09-25 | Omron Corp | Automatic document summarizing device and its method |
CA2363834A1 (en) * | 1999-02-19 | 2001-01-25 | The Trustees Of Columbia University In The City Of New York | Cut and paste document summarization system and method |
JP2006309347A (en) * | 2005-04-26 | 2006-11-09 | Saga Univ | Method, system, and program for extracting keyword from object document |
JP2011103075A (en) * | 2009-11-11 | 2011-05-26 | Kansai Electric Power Co Inc:The | Method for extracting excerpt sentence |
CN104361111A (en) * | 2014-11-28 | 2015-02-18 | 青岛大学 | Automatic archive editing method |
CN104462306A (en) * | 2014-11-28 | 2015-03-25 | 青岛大学 | Automatic archive compiling and researching device |
WO2018150244A1 (en) * | 2017-02-18 | 2018-08-23 | Yogesh Chunilal Rathod | Registering, auto generating and accessing unique word(s) including unique geotags |
WO2021164231A1 (en) * | 2020-02-18 | 2021-08-26 | 平安科技(深圳)有限公司 | Official document abstract extraction method and apparatus, and device and computer readable storage medium |
WO2022241950A1 (en) * | 2021-05-21 | 2022-11-24 | 平安科技(深圳)有限公司 | Text summarization generation method and apparatus, and device and storage medium |
CN113919336A (en) * | 2021-10-20 | 2022-01-11 | 平安科技(深圳)有限公司 | Article generation method and device based on deep learning and related equipment |
CN114611520A (en) * | 2022-04-12 | 2022-06-10 | 北京澜舟科技有限公司 | Text abstract generating method |
Non-Patent Citations (1)
Title |
---|
自动文摘基集语句的提取与润色的数学模型;吴岩;李秀坤;;计算机应用研究(05);全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN116501862A (en) | 2023-07-28 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112347244B (en) | Yellow-based and gambling-based website detection method based on mixed feature analysis | |
US9218326B2 (en) | Method and apparatus for detecting pagination constructs including a header and a footer in legacy documents | |
US6654744B2 (en) | Method and apparatus for categorizing information, and a computer product | |
KR101276602B1 (en) | System and method for searching and matching data having ideogrammatic content | |
Martins et al. | Language identification in web pages | |
US20140307959A1 (en) | Method and system of pre-analysis and automated classification of documents | |
US20110188759A1 (en) | Method and System of Pre-Analysis and Automated Classification of Documents | |
CN107992633A (en) | Electronic document automatic classification method and system based on keyword feature | |
CN105279277A (en) | Knowledge data processing method and device | |
CN110738033B (en) | Report template generation method, device and storage medium | |
CN110472203B (en) | Article duplicate checking and detecting method, device, equipment and storage medium | |
KR102445443B1 (en) | Method and system for automating keyword extraction in documents | |
CN113486664A (en) | Text data visualization analysis method, device, equipment and storage medium | |
CN103246655A (en) | Text categorizing method, device and system | |
CN115618014A (en) | Standard document analysis management system and method applying big data technology | |
KR101803150B1 (en) | Important precedents extraction and sorting method using Big Data | |
CN112199499A (en) | Text division method, text classification method, device, equipment and storage medium | |
CN114117038A (en) | Document classification method, device and system and electronic equipment | |
CN111291535B (en) | Scenario processing method and device, electronic equipment and computer readable storage medium | |
TW201508525A (en) | Document sorting system, document sorting method, and document sorting program | |
CN116501862B (en) | Automatic text extraction system based on dynamic distributed collection | |
KR101951910B1 (en) | An E-book Production System Using Automatic Placement Of Illustration And Text | |
CN110888977B (en) | Text classification method, apparatus, computer device and storage medium | |
CN110765107A (en) | Question type identification method and system based on digital coding | |
CN113609864B (en) | Text semantic recognition processing system and method based on industrial control system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
TA01 | Transfer of patent application right | ||
TA01 | Transfer of patent application right |
Effective date of registration: 20230823 Address after: 541000 No.1 Jinji Road, Qixing District, Guilin City, Guangxi Zhuang Autonomous Region Applicant after: GUILIN University OF ELECTRONIC TECHNOLOGY Address before: 710000 room 61203, floor 12, unit 6, building 1, Weiyang impression city, No. 33, Weiyang Road, Weiyang District, Xi'an City, Shaanxi Province Applicant before: Xi'an outstanding technology Co.,Ltd. Applicant before: GUILIN University OF ELECTRONIC TECHNOLOGY |
|
GR01 | Patent grant | ||
GR01 | Patent grant |