CN116501862B - Automatic text extraction system based on dynamic distributed collection - Google Patents

Automatic text extraction system based on dynamic distributed collection Download PDF

Info

Publication number
CN116501862B
CN116501862B CN202310748841.XA CN202310748841A CN116501862B CN 116501862 B CN116501862 B CN 116501862B CN 202310748841 A CN202310748841 A CN 202310748841A CN 116501862 B CN116501862 B CN 116501862B
Authority
CN
China
Prior art keywords
text
extracted
words
sentences
level
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202310748841.XA
Other languages
Chinese (zh)
Other versions
CN116501862A (en
Inventor
林国义
刘雨露
张发明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guilin University of Electronic Technology
Original Assignee
Guilin University of Electronic Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guilin University of Electronic Technology filed Critical Guilin University of Electronic Technology
Priority to CN202310748841.XA priority Critical patent/CN116501862B/en
Publication of CN116501862A publication Critical patent/CN116501862A/en
Application granted granted Critical
Publication of CN116501862B publication Critical patent/CN116501862B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/34Browsing; Visualisation therefor
    • G06F16/345Summarisation for human users
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention belongs to the technical field of automatic text extraction, and particularly relates to an automatic text extraction system based on dynamic distributed collection. According to the method, the text content can be classified into a plurality of sections to be extracted according to the multi-level titles of the text content, the related words and related sentences corresponding to the titles of the sections to be extracted are extracted from the sections to be extracted, then the sections to be extracted are preprocessed through the pre-extraction module to obtain a transfer data set, the transfer data set contains the related words and related sentences in each section to be extracted, the data processing amount of the sections to be extracted in the follow-up execution process is reduced, the priority of the related words and related sentences is determined through the evaluation module, and finally the contents in the transfer data set are extracted respectively through the extraction module.

Description

Automatic text extraction system based on dynamic distributed collection
Technical Field
The invention belongs to the technical field of automatic text extraction, and particularly relates to an automatic text extraction system based on dynamic distributed collection.
Background
The text excerpt is to condense complicated text content to obtain simple and clear summary sentences or key words, so that readers can obtain the meaning actually expressed by the text content through the key words or the summary sentences, meanwhile, the traditional mode is to accurately reflect the central thought of the text content, but the excerpt can be performed only by defining the central thought of the text, and along with the development of informatization technology, the text can be automatically excerpted by identifying the document, the reading time of the readers is saved, and the readers can be effectively helped to understand the text content.
In the prior art, when automatic text extraction is performed, all text contents are scanned and identified, so that a plurality of keywords or sentences forming a text summary are obtained, but the text contents are often laid out into a plurality of sections, and the central ideas expressed among the sections may be inconsistent, so that when the keywords or related sentences are extracted, the central ideas in part of sections may not be extracted, and further, the text extraction result cannot meet the requirements of users.
Disclosure of Invention
The invention aims to provide a text automatic extraction system based on dynamic distributed collection, which can classify text contents into a plurality of sections to be extracted according to multi-level titles of the text contents, and extract the contents of the sections to be extracted respectively.
The technical scheme adopted by the invention is as follows:
a text automatic extraction system based on dynamic distributed collection comprises a text acquisition module, a classification and identification module, an association extraction module, a pre-extraction module, an evaluation module and an extraction module;
the text acquisition module is used for scanning and acquiring text contents to obtain a text to be extracted;
the classification and identification module is used for identifying multi-level titles in the text to be extracted and classifying the text content to be extracted into a plurality of sections to be extracted according to the multi-level titles;
the association extraction module is used for extracting association words and association sentences from the sections to be extracted;
the pre-extraction module is used for extracting sample words and sample sentences from the text to be extracted according to the associated words and the associated sentences to obtain a transfer data set;
the evaluation module is used for evaluating the priority of the sample word and the consistency of the sample sentence according to the weight value of the multi-level title;
the extraction module is used for obtaining user requirements, extracting key words from a plurality of sample words according to the user requirements, and summarizing the sample sentences to obtain text summaries corresponding to the text contents.
In a preferred scheme, when the text to be extracted is classified, determining the subordinate relation of the multi-level title to obtain an upper title and a lower title, and judging whether text contents exist between the upper title and the lower title;
if the text content exists, determining the text content between the upper title and the lower title as a section to be extracted;
if not, the upper header is screened out and the lower header is replaced by the upper header.
In a preferred scheme, when the association extraction module executes, the text content of the multi-level title is identified and split to obtain a plurality of reference words;
calling a plurality of sections to be extracted, which correspond to the multi-level titles, respectively, extracting vocabularies associated with the multi-level titles from the sections to be extracted, and calibrating the vocabularies as associated words;
and identifying the sentence in which the associated word is located, and calibrating the sentence as the associated sentence.
In a preferred scheme, the association extraction module comprises a screening unit, after the association sentences are determined, the screening unit performs the processing, counts the number of the association words in each association sentence, marks the association words as parameters to be compared, and sorts all the association sentences according to the size of the parameters to be compared to obtain a plurality of parallel association sentences;
a screening threshold value for comparing with the parameter to be compared is preset in the screening unit;
if the screening threshold value is smaller than the parameter to be compared, screening the corresponding associated sentence;
and if the screening threshold value is greater than or equal to the parameter to be compared, reserving the corresponding associated statement.
In a preferred scheme, the screening threshold is set according to user requirements, wherein the user requirements comprise keyword requirements, associated sentence requirements and text summary requirement words.
In a preferred scheme, when the pre-extraction module executes, generating a transit data set, wherein sample words in the transit data set are arranged according to occurrence frequency, and the sample sentences are arranged according to the subordinate relations of the multi-level titles and the distribution positions of the sample words in the text content;
the pre-extraction module comprises a screening unit, wherein the screening unit is used for screening sample sentences in each section to be extracted;
the screening unit is internally preset with a first-level screening threshold value and a second-level screening threshold value, the first-level screening threshold value is used for evaluating the word number of sample sentences, the second-level screening threshold value is used for evaluating the number of the sample sentences, and the evaluation priority of the first-level screening threshold value is higher than that of the second-level screening threshold value.
In a preferred embodiment, the evaluation module assigns weight values according to the membership of the multi-level title when executed, wherein the assignment of weight values is performed based on a factor analysis method.
In a preferred scheme, a turning vocabulary library is preset in the evaluation module, wherein the turning vocabulary library comprises a plurality of turning vocabularies, and the turning vocabularies are used for transiting sample sentences under adjacent orders.
In one preferred approach, the snippet module, when executed, obtains a snippet sample word and a sample sentence from the transcription dataset and determines them as a keyword and a text summary, respectively.
The invention also provides a text automatic extraction terminal based on dynamic distributed collection, which comprises:
at least one processor;
and a memory communicatively coupled to the at least one processor;
wherein the memory stores a computer program executable by the at least one processor to enable the at least one processor to perform the operations of the above-described automatic text excerpt system based on dynamic distributed aggregation.
The invention has the technical effects that:
according to the method, the text content can be classified into a plurality of sections to be extracted according to the multi-level titles of the text content, the related words and related sentences corresponding to the titles of the sections to be extracted are extracted from the sections to be extracted, then the sections to be extracted are preprocessed through the pre-extraction module to obtain a transfer data set, the transfer data set contains the related words and related sentences in each section to be extracted, the data processing amount of the sections to be extracted in the follow-up execution process is reduced, the priority of the related words and related sentences is determined through the evaluation module, and finally the contents in the transfer data set are extracted respectively through the extraction module.
Drawings
FIG. 1 is a system operational diagram provided by the present invention;
fig. 2 is a block diagram of a system provided by the present invention.
Detailed Description
In order that the above-recited objects, features and advantages of the present invention will become more readily apparent, a more particular description of the invention will be rendered by reference to specific embodiments thereof which are illustrated in the appended drawings.
In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention, but the present invention may be practiced in other ways other than those described herein, and persons skilled in the art will readily appreciate that the present invention is not limited to the specific embodiments disclosed below.
Further, reference herein to "one embodiment" or "an embodiment" means that a particular feature, structure, or characteristic can be included in at least one implementation of the invention. The appearances of the phrase "in one preferred embodiment" in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments.
Referring to fig. 1 and 2, the invention provides a text automatic extraction system based on dynamic distributed collection, which comprises a text acquisition module, a classification and identification module, a correlation extraction module, a pre-extraction module, an evaluation module and an extraction module;
the text acquisition module is used for scanning and acquiring text contents to obtain a text to be extracted;
the classification and identification module is used for identifying multi-level titles in the text to be extracted and classifying the text content to be extracted into a plurality of sections to be extracted according to the multi-level titles;
the association extraction module is used for extracting association words and association sentences from the sections to be extracted;
the pre-extraction module is used for extracting sample words and sample sentences from the text to be extracted according to the associated words and the associated sentences to obtain a transfer data set;
the evaluation module is used for evaluating the priority of the sample word and the consistency of the sample sentence according to the weight value of the multi-level title;
the extraction module is used for obtaining user requirements, extracting key words from a plurality of sample words according to the user requirements, and summarizing the sample sentences to obtain text summaries corresponding to the text contents.
Specifically, the text extraction is to condense complicated text content to obtain simple summary sentences or key words, so that readers can obtain the meaning of the text content actually expressed through the key words or the summary sentences, and simultaneously can accurately reflect the central thought of the text content, in the embodiment, firstly, the text content can be acquired through a text acquisition module, which can be in the format of pictures, charts and the like, and can be scanned and converted into text content through the text acquisition module, so that subsequent recognition and extraction are convenient, the conventional technical means of the field personnel are adopted, excessive description is not adopted, then the multi-stage titles in the text to be extracted are determined through a classification recognition module, corresponding parallel relations and subordinate relations exist among the multi-stage titles, the text to be extracted is classified into a plurality of blocks to be extracted through the text content between adjacent titles, the content in each block to be extracted necessarily surrounds the corresponding title, then, the associated words and associated sentences are extracted from the blocks through an association module, when the associated words are extracted, the associated words are required to be extracted, the associated words can be automatically extracted through the association words, and the associated words can be determined through the association sentence classification module, and the associated words can be automatically determined through the association sentence classification module, and the associated words can be determined, and the associated words can be automatically determined through the comparison and the extraction module, and the associated sentence classification module, and the associated words can be determined by the associated words and the associated words are automatically by the associated with the associated sentence through the extraction module, and the associated words and the extraction module by the associated words, and the associated words with the text to be extracts, and the text to be extracts and the text and the corresponding text and the text to be extracts and the text through the text and the corresponding text and the text, thus, sample sentences and text summaries meeting the requirements of users can be obtained.
In a preferred embodiment, when the text to be extracted is classified, determining the subordinate relation of the multi-level title to obtain an upper-level title and a lower-level title, and judging whether text contents exist between the upper-level title and the lower-level title;
if the text content exists, determining the text content between the upper title and the lower title as a section to be extracted;
if not, the upper header is screened out and the lower header is replaced by the upper header.
In this embodiment, after obtaining the text to be extracted, the subordinate relations between the multi-level titles in the text are determined, and the text can be classified into an upper-level title and a lower-level title according to the subordinate relations, and a parallel relation can exist between the upper-level title and the lower-level title, so that a blank section to be extracted can appear from the beginning in consideration of the fact that text content may not exist between adjacent upper-level titles and lower-level titles.
Secondly, when the association extraction module executes, recognizing the text content of the multi-level title, and splitting the text content to obtain a plurality of reference words;
calling a plurality of sections to be extracted, which correspond to the multi-level titles, respectively, extracting vocabularies associated with the multi-level titles from the sections to be extracted, and calibrating the vocabularies as associated words;
and identifying the sentence in which the associated word is located, and calibrating the sentence as the associated sentence.
In the above, when the association extraction module is executed, the text content in the multi-level title needs to be identified first and split into a plurality of reference words, then based on the reference words, a plurality of vocabularies related to the reference words are extracted from the section to be extracted corresponding to the title as association words, and the sentences where the association words are located are determined as association sentences.
Secondly, the association extraction module comprises a screening unit, after the association sentences are determined, the screening unit performs the screening, counts the number of the association words in each association sentence, marks the association words as parameters to be compared, and sorts all the association sentences according to the size of the parameters to be compared to obtain a plurality of parallel association sentences;
a screening threshold value for comparing with the parameter to be compared is preset in the screening unit;
if the screening threshold value is smaller than the parameter to be compared, screening the corresponding associated sentence;
and if the screening threshold value is greater than or equal to the parameter to be compared, reserving the corresponding associated statement.
In this embodiment, the screening unit is configured to screen out related sentences that are lower than the parameter to be compared, where the parameter to be compared is the number of related words in the same related sentence, that is, the more the number of related words in the same related sentence, the higher the priority of the related words to be extracted is, which indicates that the higher the possibility of the related words to be extracted, and in this embodiment, a screening threshold is preset in the screening unit, after the parameter to be compared is determined, the related words to be compared can be compared with the screening threshold, where the corresponding related sentences can be reserved only when the screening threshold is greater than or equal to the parameter to be compared, so that the amount of subsequent extraction can be reduced, and the result of extraction cannot be affected.
In a preferred embodiment, the screening threshold is set according to a user requirement, wherein the user requirement includes a keyword requirement amount, an associated sentence requirement amount, and a text summary requirement word number.
In this embodiment, when the screening threshold is set, the actual needs of the user, such as the keyword requirement amount, the associated sentence requirement amount, and the text summary requirement word number, need to be clarified, for example, when writing papers, the number of the required keywords is 3, where the screening threshold may be set to 6, 9, etc. so as to be convenient for the user to select, and similarly, when reading a certain document, the central ideas in each section to be extracted need to be summarized, and the screening threshold may also be set based on the above process, which will not be repeated herein.
Secondly, when the pre-extraction module executes, generating a transit data set, wherein sample words in the transit data set are arranged according to occurrence frequency, and the sample words are arranged according to the subordinate relations of the multi-level titles and the distribution positions of the sample words in the text content;
the pre-extraction module comprises a screening unit, wherein the screening unit is used for screening sample sentences in each section to be extracted;
the screening unit is internally preset with a first-level screening threshold value and a second-level screening threshold value, the first-level screening threshold value is used for evaluating the word number of sample sentences, the second-level screening threshold value is used for evaluating the number of the sample sentences, and the evaluation priority of the first-level screening threshold value is higher than that of the second-level screening threshold value.
After the pre-extraction module is executed, a transfer data set can be obtained, sample words in the transfer data set are arranged according to occurrence frequency of the sample words, sample sentences of the transfer data set are arranged according to subordinate relations of multi-level titles and distribution positions of the sample sentences in text contents, the purpose of the transfer data set is to ensure continuity of the sample sentences, so that readers can clearly know sequence of the whole text contents, but in order to ensure that the sample sentences in all sections to be extracted are not too complicated, the word number and the number of the sample sentences in all sections to be extracted are limited through a screening unit, and therefore the text summary formed by the finally obtained sample sentences can accurately reflect the whole meaning of the text contents, and extraction results can meet user requirements.
And then, when the evaluation module executes, weight values are distributed according to the subordinate relations of the multi-level titles, wherein the distribution of the weight values is executed based on a factor analysis method.
In this embodiment, the evaluation module is configured to evaluate the priorities of the keywords and the sample sentences, where the weight value of the upper header is higher than the weight value of the lower header, so that the extraction of the keywords can be more attached to the article topic, and meanwhile, there is a connection between the keywords and each section to be extracted, and the obtained sample sentences can more indirectly summarize the text summary and the central ideas of each section to be extracted, so that the relevance between the text content and the buckling problems can be ensured.
In a preferred embodiment, the evaluation module is preset with a turn vocabulary library, where the turn vocabulary library includes a plurality of turn vocabularies, and the turn vocabularies are used for transiting sample sentences under adjacent orders.
In this embodiment, when the sample sentences are combined into the text summary, there may be a phenomenon of sentence stiffness, and then corresponding turning word assembly is needed to transition, for example, "but," "another," "and" etc., which are not listed here, the word number of the turning word is also written into the total word number of the text summary, and for the case that the sentence stiffness is still caused after the turning word is added, the next sample sentence is automatically matched, so as to ensure sentence consistency of the text summary and ensure applicability of the text extraction result.
When the snippet module executes, snippet sample words and sample sentences are obtained from the intermediate data set and are determined as keywords and text summaries, respectively.
When the text extraction is executed, firstly, text content which can be in the format of pictures, charts and the like is acquired through the text acquisition module, and scanned through the text acquisition module and literal text content can be obtained, so that subsequent recognition and extraction are convenient, the technical means commonly used by the person in the art are not excessively described, then multi-level titles in the text to be extracted are determined through the classification recognition module, corresponding parallel relations and subordinate relations exist among the multi-level titles, the text to be extracted is classified into a plurality of blocks to be extracted through acquiring text content among adjacent titles, the content in each block to be extracted inevitably surrounds the corresponding title, then related words and related sentences are extracted from the blocks to be extracted through the association module, and when the related words are extracted, the title in the section to be extracted is split to obtain a plurality of reference words, the related words in the section to be extracted can be determined one by one according to the reference words, so that the related words can be counted, correspondingly, when the related sentences are extracted, the sentences with the keyword are only required to be determined, then the pre-extraction operation is carried out with the extraction module to obtain a transfer data set, the priority of the sample words and the sample sentences is determined by combining the evaluation module, finally, the automatic extraction is carried out on the sample words and the sample sentences through the extraction module, for example, the number of keywords required by a user is 3, the number of words of text summary content is not higher than 200 words, the words with the priority arranged in the first three digits can be screened out from the data set to serve as the keywords, and the text summary content is formed by splicing the sentences in each section to be extracted, thus, sample sentences and text summaries meeting the requirements of users can be obtained.
The invention also provides a text automatic extraction terminal based on dynamic distributed collection, which comprises:
at least one processor;
and a memory communicatively coupled to the at least one processor;
wherein the memory stores a computer program executable by the at least one processor to enable the at least one processor to perform the operations of the above-described automatic text excerpt system based on dynamic distributed aggregation.
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, apparatus, article, or method that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, apparatus, article, or method. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, apparatus, article or method that comprises the element.
The foregoing is merely a preferred embodiment of the present invention and it should be noted that modifications and adaptations to those skilled in the art may be made without departing from the principles of the present invention, which are intended to be comprehended within the scope of the present invention. Structures, devices and methods of operation not specifically described and illustrated herein, unless otherwise indicated and limited, are implemented according to conventional means in the art.

Claims (8)

1. The utility model provides a text automatic extraction system based on dynamic distributed gathers, includes text collection module, categorised recognition module, association extraction module, pre-extraction module, evaluation module and extraction module, its characterized in that:
the text acquisition module is used for scanning and acquiring text contents to obtain a text to be extracted;
the classification and identification module is used for identifying multi-level titles in the text to be extracted and classifying the text content to be extracted into a plurality of sections to be extracted according to the multi-level titles;
the association extraction module is used for extracting association words and association sentences from the sections to be extracted;
when the association extraction module executes, recognizing the text content of the multi-level title, and splitting the text content to obtain a plurality of reference words;
calling a plurality of sections to be extracted, which correspond to the multi-level titles, respectively, extracting vocabularies associated with the multi-level titles from the sections to be extracted, and calibrating the vocabularies as associated words;
identifying the sentence in which the related word is located, and calibrating the sentence as the related sentence;
the association extraction module comprises a screening unit, after the association sentences are determined, the screening unit performs the operation, counts the number of the association words in each association sentence, marks the association words as parameters to be compared, and sorts all the association sentences according to the size of the parameters to be compared to obtain a plurality of parallel association sentences;
a screening threshold value for comparing with the parameter to be compared is preset in the screening unit;
if the screening threshold value is smaller than the parameter to be compared, screening the corresponding associated sentence;
if the screening threshold value is larger than or equal to the parameter to be compared, reserving the corresponding associated statement;
the pre-extraction module is used for extracting sample words and sample sentences from the text to be extracted according to the associated words and the associated sentences to obtain a transfer data set;
the evaluation module is used for evaluating the priority of the sample word and the consistency of the sample sentence according to the weight value of the multi-level title;
the extraction module is used for obtaining user requirements, extracting key words from a plurality of sample words according to the user requirements, and summarizing the sample sentences to obtain text summaries corresponding to the text contents.
2. The automatic text excerpt system based on dynamic distributed pooling of claim 1, wherein: determining the subordinate relation of the multi-level title when the text to be extracted is classified, obtaining an upper title and a lower title, and judging whether text contents exist between the upper title and the lower title;
if the text content exists, determining the text content between the upper title and the lower title as a section to be extracted;
if not, the upper header is screened out and the lower header is replaced by the upper header.
3. The automatic text excerpt system based on dynamic distributed pooling of claim 1, wherein: the screening threshold is set according to user requirements, wherein the user requirements comprise keyword requirements, associated sentence requirements and text summary requirements.
4. The automatic text excerpt system based on dynamic distributed pooling of claim 1, wherein: when the pre-extraction module executes, generating a transit data set, wherein sample words in the transit data set are arranged according to occurrence frequency, and the sample sentences are arranged according to the subordinate relations of the multi-level titles and the distribution positions of the sample sentences in the text content;
the pre-extraction module comprises a screening unit, wherein the screening unit is used for screening sample sentences in each section to be extracted;
the screening unit is internally preset with a first-level screening threshold value and a second-level screening threshold value, the first-level screening threshold value is used for evaluating the word number of sample sentences, the second-level screening threshold value is used for evaluating the number of the sample sentences, and the evaluation priority of the first-level screening threshold value is higher than that of the second-level screening threshold value.
5. The automatic text excerpt system based on dynamic distributed pooling of claim 1, wherein: and when the evaluation module is executed, weight values are distributed according to the subordination relation of the multi-level titles, wherein the distribution of the weight values is executed based on a factor analysis method.
6. The automatic text excerpt system based on dynamic distributed pooling of claim 1, wherein: the evaluation module is internally preset with a turning vocabulary library, wherein the turning vocabulary library comprises a plurality of turning vocabularies, and the turning vocabularies are used for transiting sample sentences under adjacent orders.
7. The automatic text excerpt system based on dynamic distributed pooling of claim 1, wherein: when the snippet module executes, snippet sample words and sample sentences are obtained from the intermediate data set and are determined as keywords and text summaries, respectively.
8. A text automatic extraction terminal based on dynamic distributed collection is characterized in that: comprising the following steps:
at least one processor;
and a memory communicatively coupled to the at least one processor;
wherein the memory stores a computer program executable by the at least one processor to enable the at least one processor to perform the operations of the dynamic distributed assembly based text automatic snippet system of any one of claims 1 to 7.
CN202310748841.XA 2023-06-25 2023-06-25 Automatic text extraction system based on dynamic distributed collection Active CN116501862B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310748841.XA CN116501862B (en) 2023-06-25 2023-06-25 Automatic text extraction system based on dynamic distributed collection

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310748841.XA CN116501862B (en) 2023-06-25 2023-06-25 Automatic text extraction system based on dynamic distributed collection

Publications (2)

Publication Number Publication Date
CN116501862A CN116501862A (en) 2023-07-28
CN116501862B true CN116501862B (en) 2023-09-12

Family

ID=87323415

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310748841.XA Active CN116501862B (en) 2023-06-25 2023-06-25 Automatic text extraction system based on dynamic distributed collection

Country Status (1)

Country Link
CN (1) CN116501862B (en)

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH10254900A (en) * 1997-03-14 1998-09-25 Omron Corp Automatic document summarizing device and its method
CA2363834A1 (en) * 1999-02-19 2001-01-25 The Trustees Of Columbia University In The City Of New York Cut and paste document summarization system and method
JP2006309347A (en) * 2005-04-26 2006-11-09 Saga Univ Method, system, and program for extracting keyword from object document
JP2011103075A (en) * 2009-11-11 2011-05-26 Kansai Electric Power Co Inc:The Method for extracting excerpt sentence
CN104361111A (en) * 2014-11-28 2015-02-18 青岛大学 Automatic archive editing method
CN104462306A (en) * 2014-11-28 2015-03-25 青岛大学 Automatic archive compiling and researching device
WO2018150244A1 (en) * 2017-02-18 2018-08-23 Yogesh Chunilal Rathod Registering, auto generating and accessing unique word(s) including unique geotags
WO2021164231A1 (en) * 2020-02-18 2021-08-26 平安科技(深圳)有限公司 Official document abstract extraction method and apparatus, and device and computer readable storage medium
CN113919336A (en) * 2021-10-20 2022-01-11 平安科技(深圳)有限公司 Article generation method and device based on deep learning and related equipment
CN114611520A (en) * 2022-04-12 2022-06-10 北京澜舟科技有限公司 Text abstract generating method
WO2022241950A1 (en) * 2021-05-21 2022-11-24 平安科技(深圳)有限公司 Text summarization generation method and apparatus, and device and storage medium

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8346534B2 (en) * 2008-11-06 2013-01-01 University of North Texas System Method, system and apparatus for automatic keyword extraction
US20170075877A1 (en) * 2015-09-16 2017-03-16 Marie-Therese LEPELTIER Methods and systems of handling patent claims
WO2021076606A1 (en) * 2019-10-14 2021-04-22 Stacks LLC Conceptual, contextual, and semantic-based research system and method

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH10254900A (en) * 1997-03-14 1998-09-25 Omron Corp Automatic document summarizing device and its method
CA2363834A1 (en) * 1999-02-19 2001-01-25 The Trustees Of Columbia University In The City Of New York Cut and paste document summarization system and method
JP2006309347A (en) * 2005-04-26 2006-11-09 Saga Univ Method, system, and program for extracting keyword from object document
JP2011103075A (en) * 2009-11-11 2011-05-26 Kansai Electric Power Co Inc:The Method for extracting excerpt sentence
CN104361111A (en) * 2014-11-28 2015-02-18 青岛大学 Automatic archive editing method
CN104462306A (en) * 2014-11-28 2015-03-25 青岛大学 Automatic archive compiling and researching device
WO2018150244A1 (en) * 2017-02-18 2018-08-23 Yogesh Chunilal Rathod Registering, auto generating and accessing unique word(s) including unique geotags
WO2021164231A1 (en) * 2020-02-18 2021-08-26 平安科技(深圳)有限公司 Official document abstract extraction method and apparatus, and device and computer readable storage medium
WO2022241950A1 (en) * 2021-05-21 2022-11-24 平安科技(深圳)有限公司 Text summarization generation method and apparatus, and device and storage medium
CN113919336A (en) * 2021-10-20 2022-01-11 平安科技(深圳)有限公司 Article generation method and device based on deep learning and related equipment
CN114611520A (en) * 2022-04-12 2022-06-10 北京澜舟科技有限公司 Text abstract generating method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
自动文摘基集语句的提取与润色的数学模型;吴岩;李秀坤;;计算机应用研究(05);全文 *

Also Published As

Publication number Publication date
CN116501862A (en) 2023-07-28

Similar Documents

Publication Publication Date Title
CN112347244B (en) Yellow-based and gambling-based website detection method based on mixed feature analysis
US9218326B2 (en) Method and apparatus for detecting pagination constructs including a header and a footer in legacy documents
US6654744B2 (en) Method and apparatus for categorizing information, and a computer product
KR101276602B1 (en) System and method for searching and matching data having ideogrammatic content
Martins et al. Language identification in web pages
US20140307959A1 (en) Method and system of pre-analysis and automated classification of documents
US20110188759A1 (en) Method and System of Pre-Analysis and Automated Classification of Documents
CN107992633A (en) Electronic document automatic classification method and system based on keyword feature
CN105279277A (en) Knowledge data processing method and device
CN110738033B (en) Report template generation method, device and storage medium
CN110472203B (en) Article duplicate checking and detecting method, device, equipment and storage medium
KR102445443B1 (en) Method and system for automating keyword extraction in documents
CN113486664A (en) Text data visualization analysis method, device, equipment and storage medium
CN103246655A (en) Text categorizing method, device and system
CN115618014A (en) Standard document analysis management system and method applying big data technology
KR101803150B1 (en) Important precedents extraction and sorting method using Big Data
CN112199499A (en) Text division method, text classification method, device, equipment and storage medium
CN114117038A (en) Document classification method, device and system and electronic equipment
CN111291535B (en) Scenario processing method and device, electronic equipment and computer readable storage medium
TW201508525A (en) Document sorting system, document sorting method, and document sorting program
CN116501862B (en) Automatic text extraction system based on dynamic distributed collection
KR101951910B1 (en) An E-book Production System Using Automatic Placement Of Illustration And Text
CN110888977B (en) Text classification method, apparatus, computer device and storage medium
CN110765107A (en) Question type identification method and system based on digital coding
CN113609864B (en) Text semantic recognition processing system and method based on industrial control system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20230823

Address after: 541000 No.1 Jinji Road, Qixing District, Guilin City, Guangxi Zhuang Autonomous Region

Applicant after: GUILIN University OF ELECTRONIC TECHNOLOGY

Address before: 710000 room 61203, floor 12, unit 6, building 1, Weiyang impression city, No. 33, Weiyang Road, Weiyang District, Xi'an City, Shaanxi Province

Applicant before: Xi'an outstanding technology Co.,Ltd.

Applicant before: GUILIN University OF ELECTRONIC TECHNOLOGY

GR01 Patent grant
GR01 Patent grant