CN115952279A

CN115952279A - Text outline extraction method and device, electronic device and storage medium

Info

Publication number: CN115952279A
Application number: CN202211533215.0A
Authority: CN
Inventors: 金征雷; 周创; 张俊
Original assignee: Hangzhou Ruicheng Information Technology Co ltd
Current assignee: Hangzhou Ruicheng Information Technology Co ltd
Priority date: 2022-12-02
Filing date: 2022-12-02
Publication date: 2023-04-11
Anticipated expiration: 2042-12-02
Also published as: CN115952279B

Abstract

The application relates to a method, a device, an electronic device and a storage medium for extracting a text outline, wherein the method comprises the following steps: acquiring sentence content characteristics of each sentence of text in the text to be extracted based on readable characters of the text to be extracted, and acquiring sentence format characteristics of each sentence of text in the text to be extracted based on a format of the text to be extracted, wherein the sentence content characteristics comprise character characteristics of the corresponding sentence of text; acquiring sentence fusion characteristics of each sentence of text in the text to be extracted based on the sentence content characteristics and the sentence format characteristics; acquiring paragraph characteristics of each text segment in the text to be extracted based on the sentence content characteristics and the corresponding weight of each text segment in each text segment; and obtaining outline information corresponding to the text to be extracted based on the sentence fusion characteristics and the paragraph characteristics. The method and the device solve the problem that the accuracy of extracting the text outline in the related technology is not high, enrich the levels of the text features, fuse the correlations among the text features of different levels, and improve the accuracy of extracting the text outline.

Description

Text outline extraction method and device, electronic device and storage medium

Technical Field

The present application relates to the field of semantic recognition, and in particular, to a method and an apparatus for extracting a text outline, an electronic apparatus, and a storage medium.

Background

With the continuous development of information technology, the application of semantic recognition technology becomes more and more extensive. The text outline extraction technology is used as an important branch of the semantic recognition field and has important application in the scenes of government affairs, medicine and the like. For example, outline contents of texts such as government official documents and medical documents can be automatically extracted by an outline extraction technique.

In the existing outline extraction technology, characters, words and sentences are generally taken as dimensions to extract text features, then the text features are input into a preset sequence feature extraction model, and the text features are analyzed through the sequence feature extraction model to finally obtain outline contents. However, when analyzing a text in the related art, each feature of the same dimension is often analyzed in isolation, and the correlation between different features of the same dimension and the correlation between features of different dimensions are not considered, and when analyzing the features, the context of the features is often ignored, which results in low accuracy of extracting the outline of the text in the related art.

Aiming at the technical problem that the accuracy of text outline extraction in the related technology is not high, no effective solution is provided at present.

Disclosure of Invention

The embodiment provides a method, a device, an electronic device and a storage medium for extracting a text outline, so as to solve the problem that the accuracy of extracting the text outline is not high in the related art.

In a first aspect, in this embodiment, a method for extracting a text outline is provided, including:

acquiring sentence content characteristics of each sentence of text in the text to be extracted based on readable characters of the text to be extracted, and acquiring sentence format characteristics of each sentence of text in the text to be extracted based on the format of the text to be extracted;

acquiring sentence fusion characteristics of each sentence in the text to be extracted based on the sentence content characteristics and the sentence format characteristics;

acquiring paragraph features of each text segment in the text to be extracted based on the sentence content features and corresponding weights of each text segment in each text segment;

and acquiring outline information corresponding to the text to be extracted based on the sentence fusion characteristics and the paragraph characteristics.

In some embodiments, the obtaining sentence content characteristics of each sentence of text in the text to be extracted based on the readable characters of the text to be extracted includes:

acquiring character features of the text to be extracted based on the readable characters of the text to be extracted;

and acquiring sentence content characteristics of each sentence text in the text to be extracted based on the character characteristics and corresponding weights of a plurality of readable characters in each sentence text.

In some of these embodiments, the sentence format features include a sentence position feature, a sentence length feature, and a sentence placeholder feature.

In some embodiments, the sentence placeholder feature obtaining method includes:

and acquiring sentence placeholder characteristics of each sentence of text in the text to be extracted based on the format placeholder in the text to be extracted.

In some embodiments, the obtaining sentence fusion characteristics of each sentence of text in the text to be extracted based on the sentence content characteristics and the sentence format characteristics includes:

performing fusion processing on the sentence length characteristic, the sentence placeholder characteristic and the sentence content characteristic to obtain a sentence initial fusion characteristic;

and performing fusion processing on the sentence initial fusion characteristic and the sentence position characteristic to obtain the sentence fusion characteristic.

In some embodiments, the obtaining paragraph features of each text segment in the text to be extracted based on the sentence content features and the corresponding weights of each text segment in each text segment includes:

constructing a weight matrix and a bias matrix corresponding to the sentence content characteristics of all the sentence texts;

obtaining paragraph initial features based on the sentence content features, the weight matrix and the bias matrix;

and carrying out normalization processing and aggregation processing on the paragraph initial features to obtain the paragraph features.

In some embodiments, the obtaining outline information corresponding to the text to be extracted based on the sentence fusion feature and the paragraph feature includes:

weighting the sentence fusion characteristics and the paragraph characteristics, and normalizing the processing result;

and determining outline information of the text to be extracted based on the result of the normalization processing.

In a second aspect, in this embodiment, there is provided an apparatus for extracting a text outline, including:

the first acquisition module is used for acquiring sentence content characteristics of each sentence of text in the text to be extracted based on readable characters of the text to be extracted and acquiring sentence format characteristics of each sentence of text in the text to be extracted based on the format of the text to be extracted, wherein the sentence content characteristics comprise character characteristics of the corresponding sentence of text;

the second obtaining module is used for obtaining sentence fusion characteristics of each sentence of text in the text to be extracted based on the sentence content characteristics and the sentence format characteristics;

the third acquisition module is used for acquiring paragraph characteristics of each text in the text to be extracted based on sentence content characteristics and corresponding weight of each text in each text;

and the fourth acquisition module is used for acquiring the outline information corresponding to the text to be extracted based on the sentence fusion characteristics and the paragraph characteristics.

In a third aspect, in this embodiment, there is provided an electronic apparatus, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and when the processor executes the computer program, the processor implements the method for extracting the text outline according to the first aspect.

In a fourth aspect, in the present embodiment, there is provided a storage medium having stored thereon a computer program which, when executed by a processor, implements the method for extracting a text outline according to the first aspect.

Compared with the related art, the application provides a method, a device, an electronic device and a storage medium for extracting the outline of the text, wherein the method comprises the following steps: acquiring sentence content characteristics of each sentence in the text to be extracted based on readable characters of the text to be extracted, and acquiring sentence format characteristics of each sentence in the text to be extracted based on the format of the text to be extracted, wherein the sentence content characteristics comprise character characteristics of corresponding sentence text; acquiring sentence fusion characteristics of each sentence in the text to be extracted based on the sentence content characteristics and the sentence format characteristics; acquiring paragraph features of each text in the text to be extracted based on sentence content features and corresponding weights of each text in each text; and obtaining outline information corresponding to the text to be extracted based on the sentence fusion characteristics and the paragraph characteristics. The method comprises the steps of obtaining the correlation information between the content and the format of each sentence of text by fusing the sentence content characteristics and the sentence format characteristics of each sentence of text, further obtaining the implicit relationship between the sentence text and the paragraph text by fusing the sentence fusion characteristics and the paragraph characteristics, obtaining the outline information by fusing the multilevel texts, avoiding analyzing the text characteristics in an isolated manner and neglecting the context thereof, solving the technical problem of low accuracy of text outline extraction in the related technology, enriching the layers of the text characteristics, and fusing the correlation among the text characteristics of different layers, thereby improving the accuracy of text outline extraction.

The details of one or more embodiments of the application are set forth in the accompanying drawings and the description below to provide a more concise and understandable description of the application, and features, objects, and advantages of the application.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:

fig. 1 is a block diagram of a terminal hardware structure of a method for extracting a text outline according to an embodiment of the present application;

fig. 2 is a schematic flow chart of a method for extracting a text outline according to an embodiment of the present application;

fig. 3 is a schematic flow chart of a method for extracting a text outline according to another embodiment of the present application;

fig. 4 is a block diagram of a configuration of an apparatus for extracting a text outline according to an embodiment of the present application.

Detailed Description

For a clearer understanding of the objects, aspects and advantages of the present application, reference is made to the following description and accompanying drawings.

Unless defined otherwise, technical or scientific terms referred to herein shall have the same general meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The use of the terms "a" and "an" and "the" and similar referents in the context of describing the invention (including a reference to the context of the specification and claims) are to be construed to cover both the singular and the plural, as well as the singular and plural. The terms "comprises," "comprising," "has," "having," and any variations thereof, as referred to in this application, are intended to cover non-exclusive inclusions; for example, a process, method, and system, article, or apparatus that comprises a list of steps or modules (elements) is not limited to the listed steps or modules, but may include other steps or modules (elements) not listed or inherent to such process, method, article, or apparatus. Reference throughout this application to "connected," "coupled," and the like is not limited to physical or mechanical connections, but may include electrical connections, whether direct or indirect. Reference to "a plurality" in this application means two or more. "and/or" describes an association relationship of associated objects, meaning that three relationships may exist, for example, "A and/or B" may mean: a exists alone, A and B exist simultaneously, and B exists alone. In general, the character "/" indicates a relationship in which the objects associated before and after are an "or". The terms "first," "second," "third," and the like in this application are used for distinguishing between similar items and not necessarily for describing a particular sequential or chronological order.

The method embodiments provided in the present embodiment may be executed in a terminal, a computer, or a similar computing device. For example, the method is executed on a terminal, and fig. 1 is a block diagram of a hardware structure of the terminal according to the method for extracting a text outline in this embodiment. As shown in fig. 1, the terminal may include one or more processors 102 (only one shown in fig. 1) and a memory 104 for storing data, wherein the processor 102 may include, but is not limited to, a processing device such as a microprocessor MCU or a programmable logic device FPGA. Specifically, the processor 102 may be configured as a Central Processing Unit (CPU), and the processor 102 includes an arithmetic unit and a controller. The arithmetic unit is mainly used for the terminal to execute various arithmetic and logic operation operations, and the basic operation of the arithmetic unit comprises four arithmetic operations of addition, subtraction, multiplication and division, and logical operations of AND, OR, NOT, XOR and the like, and also comprises tensor operation, matrix mathematical operation, operations of shifting, comparing, transmitting and the like. The controller is mainly used for analyzing the instruction and sending out a corresponding control signal. The terminal may also include an input-output device 106. It will be understood by those of ordinary skill in the art that the structure shown in fig. 1 is merely an illustration and is not intended to limit the structure of the terminal described above. For example, the terminal may also include more or fewer components than shown in FIG. 1, or have a different configuration than shown in FIG. 1.

The memory 104 may be used to store a computer program, for example, a software program and a module of application software, such as a computer program corresponding to the text outline extraction method in the embodiment, and the processor 102 executes various functional applications and data processing by running the computer program stored in the memory 104, so as to implement the method described above. The memory 104 may include high speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory 104 may further include memory located remotely from the processor 102, which may be connected to the terminal over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

In the related technology, characters, words and sentences are usually taken as dimensionalities to extract features, then the features are input into a preset sequence feature extraction model, target features are analyzed through the sequence feature extraction model, and outline contents are finally obtained. However, in the related art, when the features of the same dimension are analyzed, each feature is often analyzed in isolation, different features of the same dimension and correlations between features of different dimensions are not considered, and context of the features is often ignored when the features are analyzed.

Specifically, the following drawbacks mainly exist in the related art: 1) The proportional relationship between the outline space and the text content space and the relative position relationship between the contents of all parts in the text content space are not considered in the related technology; 2) In the related technology, the inherent rules of the format presented by the outline in the text in different fields are not considered, although the text contents in different fields are different, the outline text is often highlighted in the article by using a certain format as the key inductive prompt information; 3) In the related technology, the outline is not considered as the summary of text content, the semanteme contained in the outline has correlation with the texts of other sentences, and the correlation between the outline sentence and other sentences is often high in the range of the content covered by the outline.

Referring to fig. 2, fig. 2 is a schematic flow chart illustrating a method for extracting a text outline according to an embodiment of the present application.

In one embodiment, the method for extracting the text outline comprises the following steps:

s202: the sentence content characteristics of each sentence of text in the text to be extracted are obtained based on the readable characters of the text to be extracted, and the sentence format characteristics of each sentence of text in the text to be extracted are obtained based on the format of the text to be extracted.

Exemplarily, processing the content in the text to be extracted to obtain readable characters in the text to be extracted, wherein the text to be extracted is the text needing outline information extraction, and the text includes but is not limited to documents such as government official documents, academic documents, news reports and the like; the readable characters are characters that can be displayed in the text to be extracted, and include but are not limited to characters such as chinese, english, numbers, and punctuation marks.

Illustratively, after the readable characters in the text to be extracted are acquired, sentence content characteristics corresponding to each sentence text are acquired based on the readable characters of the sentence text, and the sentence content characteristics are used for representing content information of the corresponding sentence text. Specifically, corresponding word features are extracted based on each readable character, and then fusion processing is performed based on the word features corresponding to all characters of each sentence text to obtain sentence content features corresponding to the sentence text, for example, corresponding word features are extracted based on the code of each readable character, and then all word features in each sentence text are subjected to weighted fusion; or, the sentence content features corresponding to each sentence text are constructed directly based on all characters of each sentence text, for example, the codes of all readable characters of each sentence text are spliced to construct a sentence code, and then the sentence content features are extracted based on the sentence code.

Exemplarily, the format of the text to be extracted is identified to obtain the format information of the text to be extracted, and then the sentence format feature of each sentence of text is obtained. The sentence format characteristics of each sentence text are used for representing the format information of the sentence text, and the format information includes, but is not limited to, the position, the length, the format control characters and the like of the sentence text.

S204: and acquiring sentence fusion characteristics of each sentence text in the text to be extracted based on the sentence content characteristics and the sentence format characteristics.

Illustratively, after the sentence content characteristic and the sentence format characteristic of each sentence text are obtained, the sentence content characteristic and the sentence format characteristic are fused, so as to obtain the sentence fusion characteristic of the sentence text. It can be understood that the sentence fusion feature contains both the content information and the format information of the text of the corresponding sentence.

S206: and acquiring paragraph features of each text segment in the text to be extracted based on the sentence content features and the corresponding weights of each text segment in each text segment.

Illustratively, the corresponding weight is determined according to the sentence content characteristics corresponding to each sentence text in each text, for example, for the sentence text containing the general words, the sentence content characteristics can be assigned with higher weight. The weights corresponding to the sentence content features can be stored in a sentence weight matrix form. And after the weight corresponding to each sentence content characteristic is determined, weighting all sentence content characteristics based on the weight of the sentence content characteristics, thereby obtaining paragraph characteristics representing content information of all sentence texts in the paragraph. It will be appreciated that paragraph features reflect the context of the corresponding paragraph text.

S208: and obtaining outline information corresponding to the text to be extracted based on the sentence fusion characteristics and the paragraph characteristics.

Illustratively, the corresponding sentence text is analyzed by combining the sentence fusion feature and the paragraph feature, and the sentence text meeting the condition is used as the outline information corresponding to the paragraph text. Specifically, for each sentence of text, whether the sentence of text is higher in importance in format is determined through format information in the sentence fusion feature, whether the sentence of text is higher in relevance with the overall context of the paragraph of text is determined through relevance of content information in the sentence fusion feature and the paragraph feature, and finally whether the sentence of text can be used as outline information is determined.

The embodiment obtains sentence content characteristics of each sentence in the text to be extracted based on readable characters of the text to be extracted, and obtains sentence format characteristics of each sentence in the text to be extracted based on a format of the text to be extracted, wherein the sentence content characteristics include character characteristics of a corresponding sentence; acquiring sentence fusion characteristics of each sentence of text in the text to be extracted based on the sentence content characteristics and the sentence format characteristics; acquiring paragraph characteristics of each text segment in the text to be extracted based on the sentence content characteristics and the corresponding weight of each text segment in each text segment; and obtaining outline information corresponding to the text to be extracted based on the sentence fusion characteristics and the paragraph characteristics. The method comprises the steps of obtaining the correlation information between the content and the format of each sentence of text by fusing the sentence content characteristics and the sentence format characteristics of each sentence of text, further obtaining the implicit relationship between the sentence text and the paragraph text by fusing the sentence fusion characteristics and the paragraph characteristics, obtaining the outline information by fusing the multilevel texts, avoiding analyzing the text characteristics in an isolated manner and neglecting the context thereof, solving the technical problem of low accuracy of text outline extraction in the related technology, enriching the layers of the text characteristics, and fusing the correlation among the text characteristics of different layers, thereby improving the accuracy of text outline extraction.

In another embodiment, the obtaining sentence content characteristics of each sentence of text in the text to be extracted based on readable characters of the text to be extracted includes:

step 1: acquiring character features of the text to be extracted based on the readable characters of the text to be extracted;

step 2: and acquiring sentence content characteristics of each sentence of text in the text to be extracted based on the character characteristics and corresponding weights of a plurality of readable characters in each sentence of text.

Exemplarily, characters in the text to be extracted are divided into readable characters and format placeholders, and character features of the text to be extracted are extracted based on the readable characters. The readable characters are characters which can be displayed in the text to be extracted, and include but are not limited to characters such as Chinese, english, numbers, punctuations and the like; a format placeholder is a character that is not displayable in the text to be extracted but occupies a text position and controls the text format, including but not limited to "\\ t", "\\ r", "\ n", "\\ s", etc.

Specifically, after the readable characters of the text to be extracted are obtained, the readable characters are trained based on a training network model, so that character features of character dimensions are obtained. Specifically, the Training network model is used for performing feature extraction on the encoding of the input readable character to generate a feature vector, including but not limited to GPT (general Pre-Training model) or BERT (Bidirectional Encoder Representation from transformations), and the like.

Exemplarily, after the word features of the readable characters are obtained, weights corresponding to different readable characters are determined, and the word features are weighted based on the word features and the corresponding weights of all the readable characters in each sentence text, so as to fuse and generate the sentence content features of the sentence text.

Specifically, based on different readable characters, the corresponding weights of the readable characters are determined, and then a corresponding character weight matrix W is constructed _w 、u _w And word bias matrix b _w (ii) a After character features corresponding to all readable characters in each sentence of text are obtained, a character weight matrix W is obtained _w 、u _w And word bias matrix b _w Extracting weights corresponding to all readable characters in the sentence text, and then performing weighted calculation on the character features corresponding to the readable characters based on the extracted weights to obtain a weighted result corresponding to each character feature, wherein the specific calculation process is as follows:

wherein j is the sequence number of the sentence text in the paragraph text, t is the sequence number of the readable character in the sentence text, h _jt The character characteristic alpha corresponding to the t readable character of the jth sentence text in the paragraph text _jt And adding a weighted word characteristic to the t readable character of the jth sentence text in the paragraph text.

After the character features with the additional weight are obtained, the character features are normalized to obtain a normalization result corresponding to each character feature:

wherein, a _jt And obtaining a normalized result of word characteristics corresponding to the t readable character of the j sentence text in the paragraph text.

After the normalization result is obtained, the normalization results of the character characteristics corresponding to all readable characters of each sentence of text in the paragraph text are aggregated to obtain the sentence content characteristics S of the sentence text _j ：

The method comprises the steps of obtaining character features of a text to be extracted based on readable characters of the text to be extracted; the sentence content characteristics of each sentence text in the text to be extracted are obtained based on the character characteristics and the corresponding weights of a plurality of readable characters in each sentence text, so that the characteristic information of the readable characters and the incidence relation between the readable characters are fully combined, the accuracy of the sentence content characteristics is improved, and the accuracy of text outline extraction is further improved.

In another embodiment, the sentence format features include a sentence position feature, a sentence length feature, and a sentence placeholder feature.

Illustratively, the sentence format features in this embodiment include at least a sentence position feature, a sentence length feature and a sentence placeholder feature. The sentence position characteristic is used for representing position information of the sentence text in the paragraph text, the sentence length characteristic is used for representing length information of the sentence text in the paragraph text, and the length ratio of the sentence text in the paragraph text is generally used as the sentence length characteristic; the sentence placeholder feature is used to characterize the format placeholder in the sentence text.

Specifically, the sentence position feature includes a paragraph head feature, a paragraph middle feature and a paragraph end feature, which are respectively used for representing that the sentence text is located at the paragraph head, the paragraph middle and the paragraph end of the paragraph text. In one embodiment, when the sentence position feature of the sentence text is obtained, if the sentence text is located at the beginning of a paragraph, a character "< PAS >" is added to the sentence beginning; if the sentence text is in the paragraph, then add the character "< PAB >" at the beginning of the sentence; if the sentence text is at the end of the paragraph, the character "< PAE >" is added at the beginning of the sentence. And determining the position characteristics of the sentence text through the added characters of the sentence head of the sentence text.

In particular, the sentence length feature may be determined based on a length proportion of the sentence text in the paragraph text. In one embodiment, the sentence length feature is set to S1 if the length ratio of the sentence text in the paragraph text is lower than 0.15; if the length occupation ratio of the sentence text in the paragraph text is higher than 0.98, setting the sentence length characteristic as F1; if the length ratio of the sentence text in the paragraph text is between 0.15 and 0.98, the sentence length characteristic is set to L1.

In particular, sentence placeholder features can be determined based on format placeholders in the text of the sentence. In one specific embodiment, feature extraction is performed on the codes of the format placeholders, so that corresponding feature vectors are obtained, and the feature vectors are used as sentence placeholder features.

In another embodiment, the method for acquiring sentence placeholder features comprises the following steps:

Exemplarily, dividing characters in a text to be extracted to obtain readable characters and format placeholders; and determining sentence placeholder characteristics corresponding to each sentence text based on the format placeholders in each sentence text.

Specifically, after the format placeholder of each sentence of text is obtained, the format placeholder is trained based on a training network model, so that sentence placeholder characteristics corresponding to the sentence of text are obtained. Specifically, the trained network model is used for performing feature extraction on the codes of the input format placeholders to generate feature vectors, including but not limited to GPT (generic Pre-Training model) or BERT (Bidirectional Encoder retrieval from transformations) and the like.

In the embodiment, the sentence placeholder characteristics of each sentence of text in the text to be extracted are acquired based on the format placeholders in the text to be extracted, so that the sentence placeholder characteristics of each sentence of text are associated with each format placeholder, the accuracy of the sentence placeholder characteristics is improved, and the accuracy of text outline extraction is improved.

In another embodiment, the obtaining of the sentence fusion feature of each sentence in the text to be extracted based on the sentence content feature and the sentence format feature includes:

step 1: performing fusion processing on the sentence length characteristic, the sentence placeholder characteristic and the sentence content characteristic to obtain an initial sentence fusion characteristic;

step 2: and carrying out fusion processing on the sentence initial fusion characteristics and the sentence position characteristics to obtain sentence fusion characteristics.

Illustratively, the sentence format feature in the present embodiment includes a sentence position feature, a sentence length feature, and a sentence placeholder feature at the same time. After the sentence format characteristics are obtained, sentence length characteristics F are firstly matched _l Sentence placeholder features F _b And sentence content characteristics S _j Adding the above-mentioned materials and making fusion treatment to obtain sentence initial fusion characteristics S _r ：

S _r ＝(w _l F _l +w _b F _b +w _r S _j )+b _rr

Wherein w _l 、w _b 、w _r And b _rr To learn parameters. Further, the position information of the sentence text in the paragraph text is added to the sentence initial fusion feature, namely, the sentence initial fusion feature S _r And sentence position feature F _p Performing fusion splicing to obtain final sentence fusion characteristic S _rr ：

Optionally, the method in this embodiment is only an example, and the sentence length feature F may also be directly used in this application _l Sentence placeholder feature F _b Sentence content characteristics S _j And sentence position feature F _p Directly splicing to obtain sentence fusion characteristics S _rr 。

The embodiment combines the sentence length characteristic, the sentence placeholder characteristic, the sentence position characteristic and the sentence content characteristic to generate the sentence fusion characteristic, thereby fully combining the text characteristics of different dimensions such as the relevant content information of characters, sentences, paragraphs and punctuations in the text to be extracted, the length information of the sentence text, the expression space of the outline and the text, the implicit relationship of the mutual positions and the like, improving the richness of the sentence fusion characteristic and further improving the accuracy of the sentence fusion characteristic.

In another embodiment, the obtaining paragraph features of each text segment in the text to be extracted based on the sentence content features and the corresponding weights of each sentence in each text segment includes:

step 1: constructing a weight matrix and a bias matrix corresponding to sentence content characteristics of all sentence texts;

and 2, step: obtaining initial characteristics of the paragraph based on the sentence content characteristics, the weight matrix and the bias matrix;

and step 3: and carrying out normalization processing and aggregation processing on the initial features of the paragraphs to obtain the characteristics of the paragraphs.

Illustratively, based on sentence content characteristics of each sentence text in the paragraph text, corresponding weights are determined, and then a weight matrix and a bias matrix are constructed. And performing weighting processing on the sentence content characteristics based on the weight matrix and the bias matrix to obtain corresponding paragraph initial characteristics. Further, all the paragraph initial features are subjected to normalization processing and aggregation processing, so that the final paragraph features are obtained.

Optionally, before performing the weighting calculation, the sentence content characteristic s may be first calculated _ij Sending the sequence feature extraction model models to perform feature extraction, and then based on the constructed weight matrix W _w2 、u _w2 And a bias matrix b _w2 Weighting to obtain initial feature beta of the paragraph _ij The specific calculation process is as follows:

wherein,i is the sequence number of the paragraph text, and j is the sequence number of the sentence text in the paragraph text. After the initial characteristics of the paragraphs are obtained through calculation, all the paragraph texts in each paragraph text are normalized to obtain a normalization result e _ij ：

Further, the result of the normalization processing and the features extracted by the sequence feature extraction model models are subjected to aggregation training to obtain paragraph features PS _i ：

Specifically, the sequence feature extraction model in this embodiment includes, but is not limited to, a transform (self-attention mechanism model) and a BiLSTM (bidirectional long short term memory model), and the like, and the sentence content features are extracted again by the sequence feature extraction model, so that the expression effect of the sentence content features is improved.

In the embodiment, weight matrixes and bias matrixes corresponding to sentence content characteristics of all sentence texts are constructed; obtaining paragraph initial characteristics based on the sentence content characteristics, the weight matrix and the bias matrix; the paragraph initial features are subjected to normalization processing and aggregation processing to obtain the paragraph features, so that the paragraph features can fully reflect content information of the paragraph text, the accuracy of the paragraph features is improved, and the accuracy of text outline extraction is further improved.

In another embodiment, obtaining outline information corresponding to a text to be extracted based on the sentence fusion characteristics and the paragraph characteristics includes:

step 1: weighting the sentence fusion characteristics and the paragraph characteristics, and normalizing the processing result;

step 2: and determining outline information of the text to be extracted based on the result of the normalization processing.

Illustratively, after sentence fusion characteristics and paragraph characteristics are obtained, weighted fusion and normalization processing are performed on the sentence fusion characteristics and the paragraph characteristics to obtain corresponding processing results. And further, analyzing and predicting the processing result to obtain a corresponding prediction result, and determining whether the sentence text is an outline sentence or not based on the prediction result corresponding to each sentence text.

Specifically, in the training stage, sentence fusion characteristics S are obtained _rr And paragraph feature PS _i Then, the sentence is fused with the feature S _rr And paragraph feature PS _i Features stacked as a column by a weight matrix w _i And a bias matrix b _i And weighting the stacked features, and further processing a weighting result through a normalization function, so that a probability value P of each sentence of text belonging to the outline sentence is calculated:

further, cross entropy loss is calculated according to probability values of all levels of each sentence of text, and loss adjustment is performed through the cross entropy loss. The cross entropy loss L (y, p) is calculated as follows:

wherein N is the total number of samples, K is the total number of label values, i is the sample serial number, K is the label serial number, P _i,k Probability of the kth label value, y, for the ith sample _i,k Is the corresponding predicted value.

Specifically, in the training process, after each training round is finished (or after a certain number of training rounds), a test result is obtained on the verification set, and the best verification set precision of the test result is recorded. And (5) stopping training if the test error of the network model on the verification set rises along with the increase of the number of the training rounds. After the training is finished, extracting the outline information of the text to be extracted through the trained network model.

In this embodiment, weighting processing is performed on the sentence fusion characteristics and the paragraph characteristics, and normalization processing is performed on the processing result; and determining outline information of the text to be extracted based on the result of the normalization processing, thereby fully combining the correlation weight relationship between each sentence of text and other sentences of text, and considering the context of the paragraph of the text and the format information of the sentence when determining whether each sentence of text is an outline sentence, thereby improving the accuracy of extracting the outline information.

In another embodiment, with reference to the above embodiments, the present application further discloses a flow diagram of a specific text outline extraction method. Referring to fig. 3, fig. 3 is a schematic flow chart of a method for extracting a text outline according to another embodiment of the present application. Specifically, as shown in fig. 3, the method for extracting the text outline includes:

s1: and dividing the text to be extracted into readable characters and format placeholders. Wherein Cjt represents the t-th readable character in the j-th sentence, bt represents the t-th format placeholder;

s2: training readable characters Cjt and format placeholder Bt by using a training model to obtain character features hjt and format placeholder features Fb;

s3: constructing a word weight matrix, and obtaining sentence content characteristics Sj through aggregation training;

s4: sentence format characteristics are obtained: extracting sentence position characteristics Fp of a sentence text in the paragraph, wherein the sentence position characteristics Fp comprise three kinds of information of a paragraph head, a paragraph middle and a paragraph tail; extracting length proportion characteristics Fl of the sentence, and classifying according to the length proportion of the sentence in the paragraph; extracting sentence placeholder characteristics Fb contained in the sentence;

s5: and performing characteristic fusion on the sentence content characteristics Sj and the sentence format characteristics to obtain sentence fusion characteristics Srr. Specifically, sentence content characteristics Sj, sentence position characteristics Fp, sentence length characteristics Fl and sentence placeholder characteristics Fb are subjected to characteristic fusion to obtain sentence fusion characteristics Srr;

s6: performing feature extraction on the sentence content features Sj again, constructing a sentence weight matrix, performing weighted calculation on the extracted features, and obtaining paragraph features PSi with the sentence weights fused through aggregation training;

s7: and performing fusion training on the sentence fusion characteristics Srr and the paragraph characteristics PSi to obtain a trained outline extraction model.

It should be noted that the steps illustrated in the above-described flow diagrams or in the flow diagrams of the figures may be performed in a computer system, such as a set of computer-executable instructions, and that, although a logical order is illustrated in the flow diagrams, in some cases, the steps illustrated or described may be performed in an order different than here.

In this embodiment, a device for extracting a text outline is further provided, where the device is used to implement the foregoing embodiment and the preferred embodiment, and details of the description already given are not repeated. The terms "module," "unit," "sub-unit," and the like as used below may implement a combination of software and/or hardware of predetermined functions. Although the means described in the embodiments below are preferably implemented in software, an implementation in hardware, or a combination of software and hardware is also possible and contemplated.

Fig. 4 is a block diagram of a configuration of an apparatus for extracting a text outline according to the present embodiment, and as shown in fig. 4, the apparatus includes:

the first obtaining module 10 is configured to obtain, based on readable characters of a text to be extracted, sentence content characteristics of each sentence of the text in the text to be extracted, and obtain, based on a format of the text to be extracted, sentence format characteristics of each sentence of the text in the text to be extracted, where the sentence content characteristics include character characteristics of a corresponding sentence of the text;

the first obtaining module 10 is further configured to obtain word features of the text to be extracted based on the readable characters of the text to be extracted;

acquiring sentence content characteristics of each sentence of text in the text to be extracted based on character characteristics and corresponding weights of a plurality of readable characters in each sentence of text;

the first obtaining module 10 is further configured to obtain a sentence placeholder feature of each sentence of text in the text to be extracted based on the format placeholder in the text to be extracted;

the second obtaining module 20 is configured to obtain a sentence fusion feature of each sentence of text in the text to be extracted based on the sentence content feature and the sentence format feature;

the second obtaining module 20 is further configured to perform fusion processing on the sentence length characteristic, the sentence placeholder characteristic, and the sentence content characteristic to obtain an initial sentence fusion characteristic;

performing fusion processing on the sentence initial fusion characteristics and the sentence position characteristics to obtain sentence fusion characteristics;

a third obtaining module 30, configured to obtain paragraph features of each text segment in the text to be extracted based on the sentence content features and the corresponding weights of each text segment in each text segment;

the third obtaining module 30 is further configured to construct a weight matrix and a bias matrix corresponding to sentence content characteristics of all sentence texts;

obtaining paragraph initial characteristics based on the sentence content characteristics, the weight matrix and the bias matrix;

carrying out normalization processing and aggregation processing on the paragraph initial features to obtain paragraph features;

a fourth obtaining module 40, configured to obtain outline information corresponding to the text to be extracted based on the sentence fusion feature and the paragraph feature;

the fourth obtaining module 40 is further configured to perform weighting processing on the sentence fusion characteristics and the paragraph characteristics, and perform normalization processing on the processing result;

The above modules may be functional modules or program modules, and may be implemented by software or hardware. For a module implemented by hardware, the modules may be located in the same processor; or the modules can be respectively positioned in different processors in any combination.

There is also provided in this embodiment an electronic device comprising a memory having a computer program stored therein and a processor arranged to run the computer program to perform the steps of any of the above method embodiments.

Optionally, the electronic apparatus may further include an input/output device, and the input/output device is connected to the processor.

Optionally, in this embodiment, the processor may be configured to execute the following steps by a computer program:

step 1: acquiring sentence content characteristics of each sentence of text in the text to be extracted based on readable characters of the text to be extracted, and acquiring sentence format characteristics of each sentence of text in the text to be extracted based on the format of the text to be extracted;

and 2, step: acquiring sentence fusion characteristics of each sentence of text in the text to be extracted based on the sentence content characteristics and the sentence format characteristics;

and 3, step 3: acquiring paragraph characteristics of each text in the text to be extracted based on sentence content characteristics and corresponding weight of each text in each text;

and 4, step 4: and acquiring outline information corresponding to the text to be extracted based on the sentence fusion characteristics and the paragraph characteristics.

It should be noted that, for specific examples in this embodiment, reference may be made to the examples described in the foregoing embodiments and optional implementations, and details are not described again in this embodiment.

In addition, in combination with the method for extracting the outline of the text provided in the above embodiment, a storage medium may also be provided in this embodiment. The storage medium having stored thereon a computer program; the computer program, when executed by a processor, implements the method for extracting a text outline in any of the above embodiments.

It should be understood that the specific embodiments described herein are merely illustrative of this application and are not intended to be limiting. All other embodiments, which can be derived by a person skilled in the art from the examples provided herein without any inventive step, shall fall within the scope of protection of the present application.

It is obvious that the drawings are only examples or embodiments of the present application, and it is obvious to those skilled in the art that the present application can be applied to other similar cases according to the drawings without creative efforts. Moreover, it should be appreciated that in the development of any such actual implementation, as in any engineering or design project, numerous implementation-specific decisions must be made to achieve the developers' specific goals, such as compliance with system-related and business-related constraints, which may vary from one implementation to another.

Reference throughout this application to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the present application. The appearances of such phrases in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is to be expressly and implicitly understood by one of ordinary skill in the art that the embodiments described in this application may be combined with other embodiments without conflict.

The above-mentioned embodiments only express several implementation modes of the present application, and the description thereof is specific and detailed, but not construed as limiting the scope of the patent protection. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present application should be subject to the appended claims.

Claims

1. A method for extracting a text outline is characterized by comprising the following steps:

acquiring sentence content characteristics of each sentence of text in the text to be extracted based on readable characters of the text to be extracted, and acquiring sentence format characteristics of each sentence of text in the text to be extracted based on a format of the text to be extracted;

and obtaining outline information corresponding to the text to be extracted based on the sentence fusion characteristics and the paragraph characteristics.

2. The method for extracting the outline of the text according to claim 1, wherein the obtaining sentence content features of each sentence of text in the text to be extracted based on the readable characters of the text to be extracted comprises:

and acquiring sentence content characteristics of each sentence of text in the text to be extracted based on the character characteristics and corresponding weights of a plurality of readable characters in each sentence of text.

3. The method of extracting a textual outline according to claim 1, wherein the sentence format features include a sentence position feature, a sentence length feature and a sentence placeholder feature.

4. The method for extracting textual outline according to claim 3, wherein the method for obtaining sentence placeholder features comprises:

5. The method for extracting the text outline according to claim 3, wherein the obtaining sentence fusion characteristics of each sentence of text in the text to be extracted based on the sentence content characteristics and the sentence format characteristics comprises:

6. The method for extracting outline of text according to claim 1, wherein said obtaining paragraph features of each text in said text to be extracted based on sentence content features and corresponding weights of each text in each text comprises:

constructing a weight matrix and a bias matrix corresponding to the sentence content characteristics of all sentence texts;

7. The method for extracting outline of text according to claim 1, wherein the obtaining of outline information corresponding to the text to be extracted based on the sentence fusion feature and the paragraph feature comprises:

8. An apparatus for extracting a text outline, comprising:

the first acquisition module is used for acquiring sentence content characteristics of each sentence in the text to be extracted based on readable characters of the text to be extracted and acquiring sentence format characteristics of each sentence in the text to be extracted based on the format of the text to be extracted, wherein the sentence content characteristics comprise character characteristics of a corresponding sentence text;

the third acquisition module is used for acquiring paragraph characteristics of each text segment in the text to be extracted based on the sentence content characteristics and the corresponding weight of each text segment in each text segment;

and the fourth obtaining module is used for obtaining outline information corresponding to the text to be extracted based on the sentence fusion characteristics and the paragraph characteristics.

9. An electronic device comprising a memory and a processor, wherein the memory stores a computer program, and the processor is configured to execute the computer program to perform the method of extracting a text outline according to any one of claims 1 to 7.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method for extracting a textual outline according to any one of claims 1 to 7.