CN116501862B

CN116501862B - Automatic text extraction system based on dynamic distributed collection

Info

Publication number: CN116501862B
Application number: CN202310748841.XA
Authority: CN
Inventors: 林国义; 刘雨露; 张发明
Original assignee: Guilin University of Electronic Technology
Current assignee: Guilin University of Electronic Technology
Priority date: 2023-06-25
Filing date: 2023-06-25
Publication date: 2023-09-12
Anticipated expiration: 2043-06-25
Also published as: CN116501862A

Abstract

The invention belongs to the technical field of automatic text extraction, and particularly relates to an automatic text extraction system based on dynamic distributed collection. According to the method, the text content can be classified into a plurality of sections to be extracted according to the multi-level titles of the text content, the related words and related sentences corresponding to the titles of the sections to be extracted are extracted from the sections to be extracted, then the sections to be extracted are preprocessed through the pre-extraction module to obtain a transfer data set, the transfer data set contains the related words and related sentences in each section to be extracted, the data processing amount of the sections to be extracted in the follow-up execution process is reduced, the priority of the related words and related sentences is determined through the evaluation module, and finally the contents in the transfer data set are extracted respectively through the extraction module.

Description

Automatic text extraction system based on dynamic distributed collection

Technical Field

The invention belongs to the technical field of automatic text extraction, and particularly relates to an automatic text extraction system based on dynamic distributed collection.

Background

The text excerpt is to condense complicated text content to obtain simple and clear summary sentences or key words, so that readers can obtain the meaning actually expressed by the text content through the key words or the summary sentences, meanwhile, the traditional mode is to accurately reflect the central thought of the text content, but the excerpt can be performed only by defining the central thought of the text, and along with the development of informatization technology, the text can be automatically excerpted by identifying the document, the reading time of the readers is saved, and the readers can be effectively helped to understand the text content.

In the prior art, when automatic text extraction is performed, all text contents are scanned and identified, so that a plurality of keywords or sentences forming a text summary are obtained, but the text contents are often laid out into a plurality of sections, and the central ideas expressed among the sections may be inconsistent, so that when the keywords or related sentences are extracted, the central ideas in part of sections may not be extracted, and further, the text extraction result cannot meet the requirements of users.

Disclosure of Invention

The invention aims to provide a text automatic extraction system based on dynamic distributed collection, which can classify text contents into a plurality of sections to be extracted according to multi-level titles of the text contents, and extract the contents of the sections to be extracted respectively.

The technical scheme adopted by the invention is as follows:

a text automatic extraction system based on dynamic distributed collection comprises a text acquisition module, a classification and identification module, an association extraction module, a pre-extraction module, an evaluation module and an extraction module;

the text acquisition module is used for scanning and acquiring text contents to obtain a text to be extracted;

the classification and identification module is used for identifying multi-level titles in the text to be extracted and classifying the text content to be extracted into a plurality of sections to be extracted according to the multi-level titles;

the association extraction module is used for extracting association words and association sentences from the sections to be extracted;

the pre-extraction module is used for extracting sample words and sample sentences from the text to be extracted according to the associated words and the associated sentences to obtain a transfer data set;

the evaluation module is used for evaluating the priority of the sample word and the consistency of the sample sentence according to the weight value of the multi-level title;

the extraction module is used for obtaining user requirements, extracting key words from a plurality of sample words according to the user requirements, and summarizing the sample sentences to obtain text summaries corresponding to the text contents.

In a preferred scheme, when the text to be extracted is classified, determining the subordinate relation of the multi-level title to obtain an upper title and a lower title, and judging whether text contents exist between the upper title and the lower title;

if the text content exists, determining the text content between the upper title and the lower title as a section to be extracted;

if not, the upper header is screened out and the lower header is replaced by the upper header.

In a preferred scheme, when the association extraction module executes, the text content of the multi-level title is identified and split to obtain a plurality of reference words;

calling a plurality of sections to be extracted, which correspond to the multi-level titles, respectively, extracting vocabularies associated with the multi-level titles from the sections to be extracted, and calibrating the vocabularies as associated words;

and identifying the sentence in which the associated word is located, and calibrating the sentence as the associated sentence.

In a preferred scheme, the association extraction module comprises a screening unit, after the association sentences are determined, the screening unit performs the processing, counts the number of the association words in each association sentence, marks the association words as parameters to be compared, and sorts all the association sentences according to the size of the parameters to be compared to obtain a plurality of parallel association sentences;

a screening threshold value for comparing with the parameter to be compared is preset in the screening unit;

if the screening threshold value is smaller than the parameter to be compared, screening the corresponding associated sentence;

and if the screening threshold value is greater than or equal to the parameter to be compared, reserving the corresponding associated statement.

In a preferred scheme, the screening threshold is set according to user requirements, wherein the user requirements comprise keyword requirements, associated sentence requirements and text summary requirement words.

In a preferred scheme, when the pre-extraction module executes, generating a transit data set, wherein sample words in the transit data set are arranged according to occurrence frequency, and the sample sentences are arranged according to the subordinate relations of the multi-level titles and the distribution positions of the sample words in the text content;

the pre-extraction module comprises a screening unit, wherein the screening unit is used for screening sample sentences in each section to be extracted;

the screening unit is internally preset with a first-level screening threshold value and a second-level screening threshold value, the first-level screening threshold value is used for evaluating the word number of sample sentences, the second-level screening threshold value is used for evaluating the number of the sample sentences, and the evaluation priority of the first-level screening threshold value is higher than that of the second-level screening threshold value.

In a preferred embodiment, the evaluation module assigns weight values according to the membership of the multi-level title when executed, wherein the assignment of weight values is performed based on a factor analysis method.

In a preferred scheme, a turning vocabulary library is preset in the evaluation module, wherein the turning vocabulary library comprises a plurality of turning vocabularies, and the turning vocabularies are used for transiting sample sentences under adjacent orders.

In one preferred approach, the snippet module, when executed, obtains a snippet sample word and a sample sentence from the transcription dataset and determines them as a keyword and a text summary, respectively.

The invention also provides a text automatic extraction terminal based on dynamic distributed collection, which comprises:

at least one processor;

and a memory communicatively coupled to the at least one processor;

wherein the memory stores a computer program executable by the at least one processor to enable the at least one processor to perform the operations of the above-described automatic text excerpt system based on dynamic distributed aggregation.

The invention has the technical effects that:

according to the method, the text content can be classified into a plurality of sections to be extracted according to the multi-level titles of the text content, the related words and related sentences corresponding to the titles of the sections to be extracted are extracted from the sections to be extracted, then the sections to be extracted are preprocessed through the pre-extraction module to obtain a transfer data set, the transfer data set contains the related words and related sentences in each section to be extracted, the data processing amount of the sections to be extracted in the follow-up execution process is reduced, the priority of the related words and related sentences is determined through the evaluation module, and finally the contents in the transfer data set are extracted respectively through the extraction module.

Drawings

FIG. 1 is a system operational diagram provided by the present invention;

fig. 2 is a block diagram of a system provided by the present invention.

Detailed Description

In order that the above-recited objects, features and advantages of the present invention will become more readily apparent, a more particular description of the invention will be rendered by reference to specific embodiments thereof which are illustrated in the appended drawings.

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention, but the present invention may be practiced in other ways other than those described herein, and persons skilled in the art will readily appreciate that the present invention is not limited to the specific embodiments disclosed below.

Further, reference herein to "one embodiment" or "an embodiment" means that a particular feature, structure, or characteristic can be included in at least one implementation of the invention. The appearances of the phrase "in one preferred embodiment" in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments.

Referring to fig. 1 and 2, the invention provides a text automatic extraction system based on dynamic distributed collection, which comprises a text acquisition module, a classification and identification module, a correlation extraction module, a pre-extraction module, an evaluation module and an extraction module;

Specifically, the text extraction is to condense complicated text content to obtain simple summary sentences or key words, so that readers can obtain the meaning of the text content actually expressed through the key words or the summary sentences, and simultaneously can accurately reflect the central thought of the text content, in the embodiment, firstly, the text content can be acquired through a text acquisition module, which can be in the format of pictures, charts and the like, and can be scanned and converted into text content through the text acquisition module, so that subsequent recognition and extraction are convenient, the conventional technical means of the field personnel are adopted, excessive description is not adopted, then the multi-stage titles in the text to be extracted are determined through a classification recognition module, corresponding parallel relations and subordinate relations exist among the multi-stage titles, the text to be extracted is classified into a plurality of blocks to be extracted through the text content between adjacent titles, the content in each block to be extracted necessarily surrounds the corresponding title, then, the associated words and associated sentences are extracted from the blocks through an association module, when the associated words are extracted, the associated words are required to be extracted, the associated words can be automatically extracted through the association words, and the associated words can be determined through the association sentence classification module, and the associated words can be automatically determined through the association sentence classification module, and the associated words can be determined, and the associated words can be automatically determined through the comparison and the extraction module, and the associated sentence classification module, and the associated words can be determined by the associated words and the associated words are automatically by the associated with the associated sentence through the extraction module, and the associated words and the extraction module by the associated words, and the associated words with the text to be extracts, and the text to be extracts and the text and the corresponding text and the text to be extracts and the text through the text and the corresponding text and the text, thus, sample sentences and text summaries meeting the requirements of users can be obtained.

In a preferred embodiment, when the text to be extracted is classified, determining the subordinate relation of the multi-level title to obtain an upper-level title and a lower-level title, and judging whether text contents exist between the upper-level title and the lower-level title;

In this embodiment, after obtaining the text to be extracted, the subordinate relations between the multi-level titles in the text are determined, and the text can be classified into an upper-level title and a lower-level title according to the subordinate relations, and a parallel relation can exist between the upper-level title and the lower-level title, so that a blank section to be extracted can appear from the beginning in consideration of the fact that text content may not exist between adjacent upper-level titles and lower-level titles.

Secondly, when the association extraction module executes, recognizing the text content of the multi-level title, and splitting the text content to obtain a plurality of reference words;

In the above, when the association extraction module is executed, the text content in the multi-level title needs to be identified first and split into a plurality of reference words, then based on the reference words, a plurality of vocabularies related to the reference words are extracted from the section to be extracted corresponding to the title as association words, and the sentences where the association words are located are determined as association sentences.

Secondly, the association extraction module comprises a screening unit, after the association sentences are determined, the screening unit performs the screening, counts the number of the association words in each association sentence, marks the association words as parameters to be compared, and sorts all the association sentences according to the size of the parameters to be compared to obtain a plurality of parallel association sentences;

In this embodiment, the screening unit is configured to screen out related sentences that are lower than the parameter to be compared, where the parameter to be compared is the number of related words in the same related sentence, that is, the more the number of related words in the same related sentence, the higher the priority of the related words to be extracted is, which indicates that the higher the possibility of the related words to be extracted, and in this embodiment, a screening threshold is preset in the screening unit, after the parameter to be compared is determined, the related words to be compared can be compared with the screening threshold, where the corresponding related sentences can be reserved only when the screening threshold is greater than or equal to the parameter to be compared, so that the amount of subsequent extraction can be reduced, and the result of extraction cannot be affected.

In a preferred embodiment, the screening threshold is set according to a user requirement, wherein the user requirement includes a keyword requirement amount, an associated sentence requirement amount, and a text summary requirement word number.

In this embodiment, when the screening threshold is set, the actual needs of the user, such as the keyword requirement amount, the associated sentence requirement amount, and the text summary requirement word number, need to be clarified, for example, when writing papers, the number of the required keywords is 3, where the screening threshold may be set to 6, 9, etc. so as to be convenient for the user to select, and similarly, when reading a certain document, the central ideas in each section to be extracted need to be summarized, and the screening threshold may also be set based on the above process, which will not be repeated herein.

Secondly, when the pre-extraction module executes, generating a transit data set, wherein sample words in the transit data set are arranged according to occurrence frequency, and the sample words are arranged according to the subordinate relations of the multi-level titles and the distribution positions of the sample words in the text content;

After the pre-extraction module is executed, a transfer data set can be obtained, sample words in the transfer data set are arranged according to occurrence frequency of the sample words, sample sentences of the transfer data set are arranged according to subordinate relations of multi-level titles and distribution positions of the sample sentences in text contents, the purpose of the transfer data set is to ensure continuity of the sample sentences, so that readers can clearly know sequence of the whole text contents, but in order to ensure that the sample sentences in all sections to be extracted are not too complicated, the word number and the number of the sample sentences in all sections to be extracted are limited through a screening unit, and therefore the text summary formed by the finally obtained sample sentences can accurately reflect the whole meaning of the text contents, and extraction results can meet user requirements.

And then, when the evaluation module executes, weight values are distributed according to the subordinate relations of the multi-level titles, wherein the distribution of the weight values is executed based on a factor analysis method.

In this embodiment, the evaluation module is configured to evaluate the priorities of the keywords and the sample sentences, where the weight value of the upper header is higher than the weight value of the lower header, so that the extraction of the keywords can be more attached to the article topic, and meanwhile, there is a connection between the keywords and each section to be extracted, and the obtained sample sentences can more indirectly summarize the text summary and the central ideas of each section to be extracted, so that the relevance between the text content and the buckling problems can be ensured.

In a preferred embodiment, the evaluation module is preset with a turn vocabulary library, where the turn vocabulary library includes a plurality of turn vocabularies, and the turn vocabularies are used for transiting sample sentences under adjacent orders.

In this embodiment, when the sample sentences are combined into the text summary, there may be a phenomenon of sentence stiffness, and then corresponding turning word assembly is needed to transition, for example, "but," "another," "and" etc., which are not listed here, the word number of the turning word is also written into the total word number of the text summary, and for the case that the sentence stiffness is still caused after the turning word is added, the next sample sentence is automatically matched, so as to ensure sentence consistency of the text summary and ensure applicability of the text extraction result.

When the snippet module executes, snippet sample words and sample sentences are obtained from the intermediate data set and are determined as keywords and text summaries, respectively.

When the text extraction is executed, firstly, text content which can be in the format of pictures, charts and the like is acquired through the text acquisition module, and scanned through the text acquisition module and literal text content can be obtained, so that subsequent recognition and extraction are convenient, the technical means commonly used by the person in the art are not excessively described, then multi-level titles in the text to be extracted are determined through the classification recognition module, corresponding parallel relations and subordinate relations exist among the multi-level titles, the text to be extracted is classified into a plurality of blocks to be extracted through acquiring text content among adjacent titles, the content in each block to be extracted inevitably surrounds the corresponding title, then related words and related sentences are extracted from the blocks to be extracted through the association module, and when the related words are extracted, the title in the section to be extracted is split to obtain a plurality of reference words, the related words in the section to be extracted can be determined one by one according to the reference words, so that the related words can be counted, correspondingly, when the related sentences are extracted, the sentences with the keyword are only required to be determined, then the pre-extraction operation is carried out with the extraction module to obtain a transfer data set, the priority of the sample words and the sample sentences is determined by combining the evaluation module, finally, the automatic extraction is carried out on the sample words and the sample sentences through the extraction module, for example, the number of keywords required by a user is 3, the number of words of text summary content is not higher than 200 words, the words with the priority arranged in the first three digits can be screened out from the data set to serve as the keywords, and the text summary content is formed by splicing the sentences in each section to be extracted, thus, sample sentences and text summaries meeting the requirements of users can be obtained.

at least one processor;

and a memory communicatively coupled to the at least one processor;

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, apparatus, article, or method that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, apparatus, article, or method. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, apparatus, article or method that comprises the element.

The foregoing is merely a preferred embodiment of the present invention and it should be noted that modifications and adaptations to those skilled in the art may be made without departing from the principles of the present invention, which are intended to be comprehended within the scope of the present invention. Structures, devices and methods of operation not specifically described and illustrated herein, unless otherwise indicated and limited, are implemented according to conventional means in the art.

Claims

1. The utility model provides a text automatic extraction system based on dynamic distributed gathers, includes text collection module, categorised recognition module, association extraction module, pre-extraction module, evaluation module and extraction module, its characterized in that:

when the association extraction module executes, recognizing the text content of the multi-level title, and splitting the text content to obtain a plurality of reference words;

identifying the sentence in which the related word is located, and calibrating the sentence as the related sentence;

the association extraction module comprises a screening unit, after the association sentences are determined, the screening unit performs the operation, counts the number of the association words in each association sentence, marks the association words as parameters to be compared, and sorts all the association sentences according to the size of the parameters to be compared to obtain a plurality of parallel association sentences;

if the screening threshold value is larger than or equal to the parameter to be compared, reserving the corresponding associated statement;

2. The automatic text excerpt system based on dynamic distributed pooling of claim 1, wherein: determining the subordinate relation of the multi-level title when the text to be extracted is classified, obtaining an upper title and a lower title, and judging whether text contents exist between the upper title and the lower title;

3. The automatic text excerpt system based on dynamic distributed pooling of claim 1, wherein: the screening threshold is set according to user requirements, wherein the user requirements comprise keyword requirements, associated sentence requirements and text summary requirements.

4. The automatic text excerpt system based on dynamic distributed pooling of claim 1, wherein: when the pre-extraction module executes, generating a transit data set, wherein sample words in the transit data set are arranged according to occurrence frequency, and the sample sentences are arranged according to the subordinate relations of the multi-level titles and the distribution positions of the sample sentences in the text content;

5. The automatic text excerpt system based on dynamic distributed pooling of claim 1, wherein: and when the evaluation module is executed, weight values are distributed according to the subordination relation of the multi-level titles, wherein the distribution of the weight values is executed based on a factor analysis method.

6. The automatic text excerpt system based on dynamic distributed pooling of claim 1, wherein: the evaluation module is internally preset with a turning vocabulary library, wherein the turning vocabulary library comprises a plurality of turning vocabularies, and the turning vocabularies are used for transiting sample sentences under adjacent orders.

7. The automatic text excerpt system based on dynamic distributed pooling of claim 1, wherein: when the snippet module executes, snippet sample words and sample sentences are obtained from the intermediate data set and are determined as keywords and text summaries, respectively.

8. A text automatic extraction terminal based on dynamic distributed collection is characterized in that: comprising the following steps:

at least one processor;

and a memory communicatively coupled to the at least one processor;

wherein the memory stores a computer program executable by the at least one processor to enable the at least one processor to perform the operations of the dynamic distributed assembly based text automatic snippet system of any one of claims 1 to 7.