CN112818077B

CN112818077B - Text processing method, device, equipment and storage medium

Info

Publication number: CN112818077B
Application number: CN202011632673.0A
Authority: CN
Inventors: 闫莉; 万根顺; 高建清; 刘聪; 王智国; 胡国平
Original assignee: iFlytek Co Ltd
Current assignee: iFlytek Co Ltd
Priority date: 2020-12-31
Filing date: 2020-12-31
Publication date: 2023-05-30
Anticipated expiration: 2040-12-31
Also published as: CN112818077A

Abstract

The embodiment of the application discloses a text processing method, which is used for processing each sentence in a text according to the text characteristics of each sentence in the text to obtain a boundary position sequence; each boundary position in the sequence of boundary positions indicates a start sentence or an end sentence of one of the valid segments, wherein the start sentence of the kth valid segment is determined based on the end sentence of the kth-1 valid segment and the end sentence of the kth valid segment is determined based on the start sentence of the kth valid segment; based on the sequence of boundary locations, valid segments in the text are obtained to construct the target text. Based on the scheme of the application, the automatic extraction of the effective fragments in the text is realized, and the efficiency of regulating the text is improved.

Description

Text processing method, device, equipment and storage medium

Technical Field

The present invention relates to the field of natural language processing technologies, and in particular, to a text processing method, apparatus, device, and storage medium.

Background

Currently, text is the main form of recorded information, but invalid information may exist in the information recorded by text, such as information which is irrelevant to the field to which the information recorded by text belongs, and the existence of the invalid information may reduce the readability of the text. Therefore, it is necessary to normalize the text to remove invalid information in the text.

Manually regularizing text is an implementation but is inefficient. Therefore, how to improve the efficiency of text normalization is a technical problem to be solved.

Disclosure of Invention

In view of the foregoing, the present application provides a text processing method, apparatus, device, and storage medium, so as to improve efficiency of text normalization.

In order to achieve the above object, the following solutions have been proposed:

a text processing method, comprising:

processing each sentence in the text according to the text characteristics of each sentence in the text to obtain a boundary position sequence; each boundary position in the sequence of boundary positions indicates a start sentence or an end sentence of one valid segment, wherein the start sentence of a kth valid segment is determined based on the end sentence of a kth-1 valid segment, and the end sentence of the kth valid segment is determined based on the start sentence of the kth valid segment; the K is a positive integer greater than zero;

and acquiring effective fragments in the text based on the boundary position sequence to form target text.

In the above method, preferably, the processing each sentence in the text according to the text feature of each sentence in the text to obtain a boundary position sequence includes:

Acquiring a plurality of candidate boundary position sequences according to the text characteristics of each sentence in the text;

calculating the score of each candidate boundary position sequence, wherein the score of each candidate boundary position characterizes the confidence degree of the candidate boundary, and the higher the score is, the higher the confidence degree is characterized;

and taking the candidate boundary position sequence with the highest score as the boundary position sequence.

In the above method, preferably, the obtaining a plurality of candidate boundary position sequences according to text features of each sentence in the text includes:

acquiring a first type candidate boundary position sequence according to the text characteristics of each sentence in the text, and then acquiring a second type candidate boundary position sequence based on the first type candidate boundary position sequence;

each candidate boundary position in the first class of candidate boundary position sequences indicates a candidate start sentence or a candidate end sentence of one of 1 st through K-1 st valid segments in the text;

each boundary position in the second class of candidate boundary position sequences indicates a candidate start sentence or a candidate end sentence of one of the 1 st through K-1 st valid segments in the text, or indicates a candidate start sentence of the K-th valid segment in the text.

Preferably, the method, based on the first type candidate boundary position sequence, obtains a second type candidate boundary position sequence, including:

for candidate end sentences of the K-1 effective fragments indicated by candidate boundary positions in each candidate boundary position sequence of the first class, calculating the probability of each sentence which is positioned behind the candidate end sentence in the text and belongs to the starting sentence of the K effective fragment according to the text characteristics of the candidate end sentences and the text characteristics of each sentence which is positioned behind the candidate end sentence in the text;

for each sentence located after the candidate ending sentence, calculating a score of a new candidate boundary position sequence obtained by adding the sentence into the first type candidate boundary position sequence indicating the candidate boundary position of the candidate ending sentence according to the first type candidate boundary position sequence indicating the candidate boundary position of the candidate ending sentence and the probability that the sentence belongs to the starting sentence of the Kth effective fragment;

and determining a second type of candidate boundary position sequence in all new candidate boundary position sequences according to the scores of all new candidate boundary position sequences obtained based on candidate end sentences of the K-1 valid segments indicated by candidate boundary positions in the first type of candidate boundary position sequences.

acquiring a first type candidate boundary position sequence based on the second type candidate boundary position sequence after acquiring the second type candidate boundary position sequence according to the text characteristics of each sentence in the text;

each candidate boundary position in the first class of candidate boundary position sequences indicates a candidate start sentence or a candidate end sentence of one of 1 st through K-th valid fragments in the text;

Preferably, the method, based on the second type candidate boundary position sequence, obtains a first type candidate boundary position sequence, including:

for the candidate initial sentence of the Kth effective segment indicated by the candidate boundary position in each second class of candidate boundary position sequence, calculating the probability that each sentence positioned behind the candidate ending sentence in the text belongs to the ending sentence of the Kth effective segment according to the text characteristics of the candidate initial sentence and the text characteristics of each sentence positioned behind the candidate initial sentence in the text;

For each sentence located after the candidate start sentence, calculating a score of a new candidate boundary position sequence obtained by adding the sentence into the second type candidate boundary position sequence indicating the candidate boundary position of the candidate start sentence according to the second type candidate boundary position sequence indicating the candidate boundary position of the candidate start sentence and the probability of the sentence belonging to the ending sentence of the Kth effective fragment;

and determining a first type candidate boundary position sequence in all new candidate boundary position sequences according to the scores of all new candidate boundary position sequences obtained based on the candidate start sentences of the Kth valid segment indicated by the candidate boundary positions in each second type candidate boundary position sequence.

The above method, preferably, further comprises:

determining redundant sentences in the target text;

and deleting redundant sentences in the target text.

In the above method, preferably, the determining redundant sentences in the target text includes:

for each sentence in the target text, acquiring the probability that the sentence belongs to a redundant sentence; the probability that the sentence belongs to the redundant sentence is calculated according to the text characteristics of the sentence and the text characteristics of the initial sentence of the effective fragment where the sentence is located;

If the probability that the sentence belongs to the redundant sentence is larger than the probability threshold value, determining the sentence as the redundant sentence.

In the above method, preferably, the text feature of each sentence in the text is obtained by the following manner:

for each sentence in the text, acquiring a word vector of each word in the sentence and a code of the position of each word in the sentence;

obtaining a representation vector of the sentence according to the word vector of each word and the position code of each word in the sentence;

acquiring the codes of the positions of the sentences in the text;

and obtaining the text characteristics of the sentence according to the characterization vector of the sentence and the position code of the sentence.

and processing each sentence in the text by using a text processing model according to the text characteristics of each sentence in the text to obtain a boundary position sequence.

In the above method, preferably, the training process of the text processing model includes:

inputting a first type text sample into the text processing model to obtain a boundary position sequence corresponding to the first type text sample output by the text processing model; each boundary position in the boundary position sequence corresponding to the first type text sample indicates a start sentence or an end sentence of one effective fragment in the first type text sample, wherein the start sentence of the K effective fragment in the first type text sample is determined based on the end sentence of the K-1 effective fragment in the first type text sample, and the end sentence of the K effective fragment in the first type text sample is determined based on the start sentence of the K effective fragment in the first type text sample;

And updating parameters of the text processing model by taking a boundary position sequence label corresponding to the first type text sample approaching to the boundary position sequence label corresponding to the first type text sample as a target.

In the above method, preferably, the end sentence of the kth-1 effective segment in the first type text sample is determined based on the boundary position sequence tag corresponding to the first type text sample, and the start sentence of the kth effective segment in the first type text sample is determined based on the boundary position sequence tag corresponding to the first type text sample.

In the above method, preferably, the first type text sample is obtained by inserting an invalid segment into the valid text, or the first type text sample is an originally acquired text containing the valid segment and the invalid segment;

or alternatively, the process may be performed,

the first type text is obtained by inserting an invalid segment and a redundant segment into the valid text, or the first type text sample is an originally acquired text containing the valid segment, the invalid segment and the redundant segment.

Updating parameters of the text processing model by taking a boundary position sequence corresponding to the first type text sample as a target and approaching the boundary position sequence corresponding to the first type text sample to obtain an initial text processing model; the first-type text sample is obtained by inserting at least an invalid segment into valid text;

inputting a second type text sample into the initial text processing model to obtain a boundary position sequence corresponding to the second type text sample output by the initial text model; the second type text sample is an originally acquired text at least comprising a valid segment and an invalid segment;

and updating parameters of the initial text processing model by taking a boundary position sequence corresponding to the second type text sample approaching to a boundary position sequence label corresponding to the second type text sample as a target.

A text processing apparatus, comprising:

the boundary position sequence acquisition module is used for processing each sentence in the text according to the text characteristics of each sentence in the text to obtain a boundary position sequence; each boundary position in the sequence of boundary positions indicates a start sentence or an end sentence of one valid segment, wherein the start sentence of a kth valid segment is determined based on the end sentence of a kth-1 valid segment, and the end sentence of the kth valid segment is determined based on the start sentence of the kth valid segment; the K is a positive integer greater than zero;

And the target text acquisition module is used for acquiring the effective fragments in the text based on the boundary position sequence so as to form a target text.

A text processing device comprising a memory and a processor;

the memory is used for storing programs;

the processor is configured to execute the program to implement the steps of the text processing method according to any one of the above.

A computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the text processing method as claimed in any one of the preceding claims.

As can be seen from the above technical solution, according to the text processing method provided by the embodiments of the present application, each sentence in the text is processed according to the text characteristics of each sentence in the text, so as to obtain a boundary position sequence; each boundary position in the sequence of boundary positions indicates a start sentence or an end sentence of one of the valid segments, wherein the start sentence of the kth valid segment is determined based on the end sentence of the kth-1 valid segment and the end sentence of the kth valid segment is determined based on the start sentence of the kth valid segment; based on the sequence of boundary locations, valid segments in the text are obtained to construct the target text. Based on the scheme of the application, the automatic extraction of the effective fragments in the text is realized, and the efficiency of regulating the text is improved.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are required to be used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only embodiments of the present application, and that other drawings may be obtained according to the provided drawings without inventive effort to a person skilled in the art.

FIG. 1 is a flowchart of one implementation of a text processing method disclosed in an embodiment of the present application;

FIG. 2 is a flowchart of one implementation of processing each sentence in a text according to text features of each sentence in the text to obtain a boundary position sequence according to the disclosure of the embodiment of the present application;

FIG. 3 is a flowchart of one implementation of obtaining a second type of candidate boundary position sequence based on a first type of candidate boundary position sequence disclosed in an embodiment of the present application;

FIG. 4 is a flowchart of one implementation of obtaining a first type of candidate boundary position sequence based on a second type of candidate boundary position sequence disclosed in an embodiment of the present application;

FIG. 5 is a flowchart of one implementation of determining redundant sentences in target text as disclosed in an embodiment of the present application;

FIG. 6 is a flowchart of one implementation of obtaining text features of sentences disclosed in an embodiment of the present application;

FIG. 7 is a schematic diagram of a structure of a text processing model disclosed in an embodiment of the present application;

FIG. 8 is a schematic diagram of a structure of a text processing device according to an embodiment of the present application;

fig. 9 is a block diagram of a hardware structure of a text processing apparatus according to an embodiment of the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are only some, but not all, of the embodiments of the present application. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are within the scope of the present disclosure.

At present, information recording through a voice recording system is widely applied to various industries, for example, in medical scenes, communication information of doctors and patients can be recorded through the voice recording system; in the conference scene, the communication information of conference participants and the like can be recorded through a voice recording system. After the voice input system collects the voice of the user, the collected voice is transcribed into text to be stored.

In practical applications, various kinds of irrelevant information or repeated information are often included in the communication information, so that the text transcribed from the voice also has irrelevant information or repeated information, and the readability of the text is reduced by the irrelevant information or the repeated information. For example, in a medical setting, when a doctor communicates with a patient, a patient condition document can be entered through a voice entry system for subsequent reading, analysis and archiving. However, in the actual communication process, a plurality of dialogue information irrelevant to the illness state is often inserted between the doctor and the patient, for example, in order to solve the emotion of the patient, a boring dialogue is performed between the doctor and the patient, or a question of the illness state irrelevant to the patient and the family members thereof is answered, for example, a physical examination address, a fee, etc. are inquired, and the doctor and the nurse talk with other staff. Meanwhile, as a doctor may not form a complete thought in the dictation process, redundant information is often recorded, and a plurality of information or redundant information irrelevant to the illness state of a patient exists in the text transcribed by the voice recording system through the irrelevant conversations or repeated voices, the logic line of illness state recording is damaged, and the reading experience is greatly reduced. Therefore, it is necessary to normalize the text to remove extraneous information and duplicate information in the text.

At present, the text is mainly regulated manually, but the efficiency of manual regulation is low, and especially when the text is longer, the defect of low efficiency is more obvious, so how to improve the efficiency of regulating the text becomes a technical problem to be solved urgently.

In order to improve the efficiency of text normalization, an implementation flowchart of the text processing method provided in the embodiment of the present application is shown in fig. 1, and may include:

step S101: and processing each sentence in the text according to the text characteristics of each sentence in the text to obtain a boundary position sequence.

The text is the text to be regulated, and the text can be a doctor-patient communication document obtained through transcription of a voice input system, or a conference record document and the like.

Each boundary position in the sequence of boundary positions indicates a start sentence or an end sentence of one of the valid segments, wherein the start sentence of the kth valid segment is determined based on the end sentences of the kth-1 valid segments and the end sentence of the kth valid segment is determined based on the start sentence of the kth valid segment; k is a positive integer greater than zero.

The kth valid segment is any one of a plurality of valid segments in the text. If the Kth valid segment is the 1 st valid segment in the text, the ending sentence of the Kth-1 st valid segment can be characterized by preset information.

Assuming a total of N boundary positions in the sequence of boundary positions, the sequence of boundary positions indicates a total of N/2 valid fragments of the beginning sentence and the ending sentence, i.e., a total of N/2 valid fragments in the text are predicted, the boundary positions in the sequence of boundary positions may be in A ₁ ，B ₁ ，A ₂ ，B ₂ ，A ₃ ，B ₃ ，……A _N/2-1 ，B _N/2-1 ，A _N/2 ，B _N/2 Pattern storage of (a), wherein a ₁ A start sentence indicating the 1 st valid segment, B ₁ An end sentence indicating the 1 st valid segment, A ₂ A start sentence indicating the 2 nd valid segment, B ₂ End sentence indicating the 2 nd active fragment, A ₃ A start sentence indicating the 3 rd valid segment, B ₃ An end sentence indicating the 3 rd valid segment … … A _N/2-1 A start sentence indicating the N/2-1 th active fragment, B _N/2-1 An end sentence indicating the N/2-1 th active fragment, A _N/2 A start sentence indicating the N/2 th active fragment, B _N/2 The end sentence of the N/2 th valid segment is indicated.

Wherein A is _i May be the position information of the beginning sentence of the i-th effective segment in the text, e.g. the sentence number of the beginning sentence of the i-th effective segment in the text, B _i There may be positional information of the end sentence of the i-th effective segment in the text, for example, a sentence number of the end sentence of the i-th effective segment in the text.

Alternatively, A _i A start sentence which may be the i-th valid fragment, B _i May be the ending sentence of the i-th active fragment.

Alternatively, a preset punctuation (such as comma, period, question mark, exclamation mark) may be used as the paragraph mark, and the text segment between the sentence mark is called a sentence, for example, the text segment between two adjacent commas is a sentence, the text segment between one comma and one period adjacent to the comma is a sentence, the text segment between one question mark and one period adjacent to the question mark is a sentence, the text segment between one question mark and one exclamation mark adjacent to the question mark is a sentence, and so on. That is, a text segment between any two adjacent sentence-break identifiers is referred to as a sentence.

Step S102: based on the sequence of boundary locations, valid segments in the text are obtained to construct the target text.

In the embodiment of the present application, if the start sentence and the end sentence of each effective segment in the text are predicted, for each effective segment, the segment (including the start sentence and the end sentence) located between the start sentence and the end sentence of the effective segment, that is, the effective segment. After determining all the effective fragments in the text, all the effective fragments can be extracted from the text to form the target text, or the non-effective fragments in the text can be deleted directly to obtain the target text.

According to the text processing method provided by the embodiment of the application, each sentence in the text is processed according to the text characteristics of each sentence in the text, so that a boundary position sequence is obtained; each boundary position in the sequence of boundary positions indicates a start sentence or an end sentence of one of the valid segments, wherein the start sentence of the kth valid segment is determined based on the end sentence of the kth-1 valid segment and the end sentence of the kth valid segment is determined based on the start sentence of the kth valid segment; based on the sequence of boundary locations, valid segments in the text are obtained to construct the target text. The automatic extraction of the effective fragments in the text is realized, and the efficiency of regularizing the text is improved.

In an optional embodiment, the processing each sentence in the text according to the text feature of each sentence in the text to obtain the boundary position sequence may be implemented as follows: and processing each sentence in the text according to the text characteristics of each sentence in the text, and directly determining an optimal boundary position sequence.

In an alternative embodiment, the processing each sentence in the text according to the text feature of each sentence in the text to obtain the boundary position sequence may include:

Step S201: and acquiring a plurality of candidate boundary position sequences according to the text characteristics of each sentence in the text. Each candidate boundary position in each candidate boundary position sequence indicates a candidate start sentence or a candidate end sentence of a valid segment.

Wherein the candidate boundary positions of the candidate start sentences indicating the same valid segment in different candidate boundary sequences are the same or different, and the candidate end positions of the candidate end sentences of the same valid segment in different candidate boundary sequences are the same or different.

Step S202: the score of each candidate boundary position sequence is calculated, the score of each candidate boundary position characterizes the confidence of the candidate boundary, and the higher the score is, the higher the confidence is characterized.

For any candidate boundary position sequence, the score of the candidate boundary position sequence may be determined according to the probability value corresponding to each candidate boundary position in the candidate boundary position sequence, where if the candidate boundary position indicates a candidate start sentence of a valid segment, the probability value corresponding to the candidate boundary position is the probability that the candidate start sentence indicated by the candidate boundary position belongs to the start sentence of the valid segment, and if the candidate boundary position indicates a candidate end sentence of a valid segment, the probability value corresponding to the candidate boundary position is the probability that the candidate end sentence indicated by the candidate boundary position belongs to the end sentence of the valid segment.

Alternatively, for any one candidate boundary position sequence, the Score for that candidate boundary position sequence may be determined according to the following formula:

wherein N is the number of candidate boundary positions in the candidate boundary position sequence; p (S) _j ) Is the probability that the j-th boundary position in the candidate boundary position sequence corresponds.

Step S203: and taking the candidate boundary position sequence with the highest score as the boundary position sequence.

In this embodiment, instead of directly determining the optimal boundary position sequence, a plurality of candidate boundary position sequences are determined first, and then an optimal boundary position sequence is selected from the plurality of candidate boundary position sequences.

In an alternative embodiment, one implementation of obtaining a plurality of candidate boundary position sequences according to the text features of each sentence in the text may be:

after the first type candidate boundary position sequence is obtained according to the text characteristics of each sentence in the text, the second type candidate boundary position sequence is obtained based on the first type candidate boundary position sequence.

Wherein each candidate boundary position in the first class of candidate boundary position sequences indicates a candidate start sentence or a candidate end sentence of one of the 1 st through K-1 st valid fragments in the text;

Each boundary position in the second class of candidate boundary position sequences indicates a candidate start sentence or a candidate end sentence for one of the 1 st through K-1 st valid segments in the text, or indicates a candidate start sentence for the K-th valid segment in the text.

That is, in the embodiment of the present application, the candidate start sentence and the candidate end sentence are predicted from one effective segment, and each effective segment predicts the candidate start sentence first and then predicts the candidate end sentence, and after predicting the candidate end sentence of the kth-1 effective segment, the candidate start sentence of the kth effective segment can be predicted.

In an alternative embodiment, an implementation flowchart of the above method for obtaining the second type of candidate boundary position sequence based on the first type of candidate boundary position sequence is shown in fig. 3, and may include:

step S301: and calculating the probability of each sentence which is positioned behind the candidate ending sentence in the text and belongs to the starting sentence of the Kth effective fragment according to the text characteristics of the candidate ending sentence and the text characteristics of each sentence which is positioned behind the candidate ending sentence in the text for the candidate ending sentence of the Kth-1 effective fragment indicated by the candidate boundary position in each candidate boundary position sequence of the first class.

For convenience of description, the candidate ending sentences of the (K-1) th valid segment indicated by the candidate boundary positions in the (p) th candidate boundary position sequence are described as: s is S _P.1.k-1.ce . Candidate ending sentences of the (K-1) th valid fragment indicated by the candidate boundary positions in the different first class candidate boundary sequences are different.

For the candidate ending sentence S located in the text _P.1.k-1.ce Each sentence thereafter, the text characteristics of the sentence and the candidate ending sentence S can be used for processing _P.1.k-1.ce And splicing the text features, and calculating the probability that the sentence belongs to the initial sentence of the Kth effective fragment by using the spliced text features.

Step S302: for the sentence S located at the candidate end _P.1.k-1.ce Each sentence thereafter, according to the indication of the candidate ending sentence S _P.1.k-1.ce A first type of candidate boundary position sequence in which the candidate boundary position is located, and the probability that the sentence belongs to the starting sentence of the kth valid segment, calculates the sentence joining indicator that the candidate ending sentence S _P.1.k-1.ce A score of a new candidate boundary position sequence of the second class where the candidate boundary position is located.

The sentence joining indicates the candidate ending sentence S _P.1.k-1.ce The score of the new candidate boundary position sequence obtained by the first candidate boundary position sequence where the candidate boundary position is located can be calculated by using the formula (1). At this time, N in the formula adds an indication of the candidate ending sentence S for the sentence _P.1.k-1.ce The number of candidate boundary positions in the new candidate boundary position sequence obtained by the first type of candidate boundary position sequence where the candidate boundary position is located, namely 2K-1; p (S) _j ) Adding to the sentence an indication of the candidate ending sentence S _P.1.k-1.ce The probability corresponding to the j (j=1, 2,3, … …, 2K-1) th candidate boundary position in the new candidate boundary position sequence obtained by the first-class candidate boundary position sequence where the candidate boundary position is located.

Step S303:candidate ending sentences S according to the (K-1) th valid segment indicated based on candidate boundary positions in respective first class candidate boundary position sequences _P.1.k-1.ce And determining a second type of candidate boundary position sequence in all the new candidate boundary position sequences according to the obtained scores of all the new candidate boundary position sequences. The second type candidate boundary position sequence is the same as or different from the candidate boundary position sequence of the first type candidate boundary sequence, which indicates the candidate start sentence of the same effective segment, and the candidate boundary position of the candidate end sentence of the same effective segment is the same as or different from the candidate boundary position of the candidate end sentence of the same effective segment.

Assume that the candidate ending sentence S is located in the text _P.1.k-1.ce The number of sentences thereafter is M_S _P.1.k-1.ce Based on the indication of the candidate ending sentence S _P.1.k-1.ce The candidate boundary position sequence of the first type where the candidate boundary position is located can obtain M_S _P.1.k-1.ce A new candidate boundary position sequence, thus, based on the instruction of the candidate ending sentence S _P.1.k-1.ce The candidate boundary position sequence of the first type where the candidate boundary position is located can obtain M_S _P.1.k-1.ce The scores of the new candidate boundary position sequences, assuming that the number of the first type candidate boundary position sequences is B, the number m_total_1 of all new candidate boundary position sequences obtained based on the candidate end sentences of the K-1 th valid segment indicated by the candidate boundary positions in each first type candidate boundary position sequence is:

in this embodiment of the present application, B new candidate boundary position sequences with highest scores may be selected from m_total_1 new candidate boundary position sequences as B second class candidate boundary position sequences. B may be any number between 1 and 5.

after the second type candidate boundary position sequence is obtained according to the text characteristics of each sentence in the text, the first type candidate boundary position sequence is obtained based on the second type candidate boundary position sequence.

Wherein each candidate boundary position in the first class of candidate boundary position sequences indicates a candidate start sentence or a candidate end sentence of one of the 1 st through K th valid fragments in the text.

A candidate start sentence or a candidate end sentence of one of the valid segments, or a candidate start sentence indicating a kth valid segment in the text.

That is, in the embodiment of the present application, the candidate start sentence and the candidate end sentence are predicted by each valid segment, and each valid segment predicts the candidate start sentence first and then predicts the candidate end sentence, and after the candidate start sentence of the kth valid segment is predicted, the candidate end sentence of the kth valid segment can be predicted.

In an alternative embodiment, an implementation flowchart for obtaining the first type of candidate boundary position sequence based on the second type of candidate boundary position sequence is shown in fig. 4, and may include:

step S401: and for the candidate initial sentence of the K effective fragment indicated by the candidate boundary position in each candidate boundary position sequence of the second class, calculating the probability that each sentence positioned after the candidate ending sentence belongs to the ending sentence of the K effective fragment in the text according to the text characteristics of the candidate initial sentence and the text characteristics of each sentence positioned after the candidate initial sentence in the text.

For convenience of description, the candidate start sentence of the kth significant segment indicated by the candidate boundary position in the p-th second class candidate boundary position sequence is described as: s is S _P.2.k.cs . The candidate start sentences of the kth valid segment indicated by the candidate boundary positions in the different second class of candidate boundary sequences are different.

For the candidate start sentence S located in the text _P.2.k.cs Each sentence thereafter, the text characteristics of the sentence and the candidate starting sentence S can be used _P.2.k.cs Text characterization of (a)And (3) line splicing, namely calculating the probability that the sentence belongs to the ending sentence of the Kth effective fragment by using the text characteristics after the line splicing.

The second type of candidate boundary position sequence in this step is the second type of candidate boundary position sequence obtained based on the embodiment shown in fig. 3.

Step S402: for the candidate start sentence S _P.2.k.cs Each sentence thereafter, according to the indication of the candidate starting sentence S _P.2.k.cs A second type of candidate boundary position sequence in which the candidate boundary position is located, and the probability that the sentence belongs to the ending sentence of the kth valid segment, calculating the sentence joining indicator of the candidate starting sentence S _P.2.k.cs A score of a new candidate boundary position sequence of the second class where the candidate boundary position is located.

The sentence joining indicates the candidate start sentence S _P.2.k.cs The score of the new candidate boundary position sequence obtained by the second class of candidate boundary position sequences where the candidate boundary positions are located can be calculated by using the formula (1). At this time, N in the formula indicates the candidate start sentence S for the sentence addition _P.2.k.cs The number of candidate boundary positions in the new candidate boundary position sequence obtained by the second class of candidate boundary position sequences where the candidate boundary positions are located, namely 2K; p (S) _j ) Adding an indication of the candidate start sentence S to the sentence _P.2.k.cs The probability corresponding to the j (j=1, 2,3, … …, 2K) th candidate boundary position in the new candidate boundary position sequence obtained by the second class candidate boundary position sequence where the candidate boundary position is located.

Step S403: and determining a first type of candidate boundary position sequence from all new candidate boundary position sequences according to scores of all new candidate boundary position sequences obtained based on candidate start sentences of the Kth valid segment indicated by the candidate boundary positions in the second type of candidate boundary position sequences.

The first type of candidate boundary position sequences determined at this time are different from those shown in the embodiment shown in fig. 3, and indicate candidate start sentences and candidate end sentences of each of the first K effective fragments, whereas the first type of candidate boundary position sequences in the embodiment shown in fig. 3 indicate only candidate start sentences and candidate end sentences of each of the first K-1 effective fragments.

After obtaining the first type of candidate boundary position sequences based on the embodiment shown in fig. 4, the second type of candidate boundary position sequences may be obtained again, where each boundary position in the second type of candidate boundary position sequences indicates a candidate start sentence or a candidate end sentence of one of the 1 st to K th valid fragments in the text, or indicates a candidate start sentence of the k+1st valid fragment in the text. Thereafter, a first type of candidate boundary position sequences may be acquired again, at which time each boundary position in the first type of candidate boundary position sequences indicates a candidate start sentence or a candidate end sentence of one of the 1 st to k+1 th valid fragments in the text. And so on until the beginning sentence of the valid segment is predicted to be the ending identifier in the text.

That is, in the embodiment of the present application, the first type candidate boundary position sequence and the second type candidate boundary position sequence are updated alternately, after the first type candidate boundary position sequence is obtained, the second type candidate boundary position sequence is obtained by using the obtained first type candidate boundary position sequence, after the second type candidate boundary position sequence is obtained, the new first type candidate boundary position sequence is obtained by using the obtained second type candidate boundary position sequence, then the second type candidate boundary position sequence obtained by using the new first type candidate boundary position sequence is used, and so on.

In the foregoing embodiment, the obtained target text may include redundant information, such as duplicate information, which also reduces the readability of the target text, and although the redundant information may be manually deleted, the efficiency of manual deletion is low, so, in order to further improve the file normalization efficiency, the text processing method provided in the embodiment of the present application further may include:

determining redundant sentences in the target text; redundant sentences are repeated sentences in the target text, for example, L identical sentences in the target text are L-1 sentences in the L identical sentences are redundant sentences

And deleting redundant sentences in the target text.

In the embodiment, through automatically determining and deleting redundant sentences in the target text, the automatic normalization of the target text is realized, and the text normalization efficiency is further improved.

In an alternative embodiment, a flowchart of an implementation of determining redundant sentences in the target text is shown in fig. 5, which may include:

step S501: for each sentence in the target text, acquiring the probability that the sentence belongs to a redundant sentence; the probability that the sentence belongs to a redundant sentence is calculated based on the text characteristics of the sentence and the text characteristics of the starting sentence of the valid segment in which the sentence is located.

The text features of the sentence and the text features of the initial sentence of the effective fragment where the sentence is located can be spliced, and the probability that the sentence belongs to the redundant sentence is calculated by using the text features obtained by splicing.

In the embodiment of the present application, the probability that all sentences in the text belong to redundant sentences may be obtained in advance, and after the target text is obtained, the probability that each sentence in the target text belongs to redundant sentences may be directly read, or the probability that each sentence in the target text belongs to redundant sentences may be calculated after the target text is obtained.

Step S502: if the probability that the sentence belongs to the redundant sentence is larger than the probability threshold value, determining the sentence as the redundant sentence.

If the probability that the sentence belongs to the redundant sentence is less than or equal to the probability threshold, it is determined that the sentence does not belong to the redundant sentence.

In an alternative embodiment, for each sentence in the text, one implementation flowchart for obtaining text features of the sentence is shown in fig. 6, and may include:

step S601: the word vector of each word in the sentence and the encoding of the position of each word in the sentence are obtained.

Alternatively, the position encoding of the word in the sentence may be sinusoidal position encoding of the word in the sentence.

Step S602: and obtaining the characterization vector of the sentence according to the word vector of each word and the position code of each word in the sentence.

Alternatively, the word vector and the position code corresponding to the same word may be added to obtain an initial code corresponding to the word, and the initial code of each word in the sentence is processed to obtain a word representation related to the context of each word in the sentence.

Word tokens of each word in the sentence can be spliced, and the spliced vector is used as a token vector of the sentence.

Or alternatively, the process may be performed,

word representations of all words in the sentence can be spliced, and the spliced vectors are compressed to obtain a vector with a preset length as a representation vector of the sentence.

Step S603: a code of a position of the sentence in the text is obtained. The position encoding of the sentence in the text may be a sinusoidal position encoding of the sentence in the text.

Step S604: and obtaining the text characteristics of the sentence according to the characterization vector of the sentence and the position code of the sentence.

The token vector of the sentence may be added to the position encoding of the sentence to obtain an initial text feature of the sentence, and the initial text feature of the sentence may be processed to obtain a context-dependent text feature of the sentence.

In an optional embodiment, the processing each sentence in the text according to the text feature of each sentence in the text to obtain the boundary position sequence may be implemented as follows:

and processing each sentence in the text according to the text characteristics of each sentence in the text by using the text processing model to obtain a boundary position sequence.

In an alternative embodiment, a schematic structural diagram of a text processing model provided in the embodiment of the present application is shown in fig. 7, and may include:

an encoding module 701 and a decoding module 702; wherein, the liquid crystal display device comprises a liquid crystal display device,

the encoding module 701 is configured to encode each sentence in the text, so as to obtain a text feature of each sentence. Reference is made to the foregoing embodiments for a specific encoding process, which is not described in detail herein.

The decoding module 702 is configured to process each sentence in the text according to the text feature of each sentence in the text, so as to obtain a boundary position sequence. Reference is made to the foregoing embodiments for specific processing, which will not be described in detail herein.

In this embodiment of the present application, the text processing model may adopt a cascading transform structure, and compared with an LSTM structure, the transform structure calculates the relationship between the word or sentence and all other words or sentences at the current time through a self-attention (self-attention) mechanism, that is, can see the global context, which is very important for grasping the subject matter of the text and distinguishing whether the current sentence is related to information. The computational complexity taking into account the self-attention mechanism is O (c ² ) Where c is the number of words in the text, directly modeling the relationship between words, which is unacceptable for long documents, because the scheme uses a concatenated transducer structure, i.e., a word-level transducer encoder and a sentence-level transducer encoder concatenated, where the self-attention computational complexity is O (m ² +n ² ) Where m is the number of words in the sentence (which may be the average of the number of words of all sentences in the text) and n is the number of sentences in the text. Thereby greatly reducing the computational complexity.

In an alternative embodiment, the training process of the text processing model may include:

and inputting the first type text sample into a text processing model to obtain a boundary position sequence corresponding to the first type text sample output by the text processing model. Each boundary position in the sequence of boundary positions corresponding to the first type of text sample indicates a start sentence or an end sentence of one valid segment in the first type of text sample, wherein the start sentence of the kth valid segment in the first type of text sample is determined based on the end sentences of the kth-1 valid segments in the first type of text sample, and the end sentence of the kth valid segment in the first type of text sample is determined based on the start sentence of the kth valid segment in the first type of text sample.

In an alternative embodiment, the ending sentence of the kth-1 effective segment used in determining the starting sentence of the kth effective segment in the first sample text may be predicted by the text processing model, and the starting sentence of the kth effective segment used in determining the starting sentence of the kth effective segment may be predicted by the text processing model.

Or alternatively, the process may be performed,

the ending sentence of the (K-1) effective fragment used for determining the starting sentence of the (K) effective fragment in the first type sample text can be determined according to the boundary position sequence label corresponding to the first type text, and the starting sentence of the (K) effective fragment used for determining the ending sentence of the (K) effective fragment can be determined according to the boundary position sequence label corresponding to the first type text.

And updating parameters of the text processing model by taking a boundary position sequence corresponding to the first type text sample as a target and enabling the boundary position sequence corresponding to the first type text sample to approach to a boundary position sequence label corresponding to the first type text sample.

The difference between the boundary position sequence corresponding to the first type of text sample and the boundary position sequence label corresponding to the first type of text sample can be calculated by using the cross entropy loss function, a counter-propagation gradient is obtained based on the difference, and parameters of the text processing model are updated based on the counter-propagation gradient.

In an alternative embodiment, the first type of text sample may be obtained by inserting an invalid segment into the valid text, or the first type of text sample may be the originally collected text containing the valid segment and the invalid segment. Based on this, the training data set for training the text processing model may contain only the first type of text samples obtained by inserting the invalid segments into the valid text, or may contain only the first type of text samples originally collected containing the valid segments and the invalid segments, or may contain both the first type of text samples obtained by inserting the invalid segments into the valid text and the first type of text samples originally collected containing the valid segments and the invalid segments.

In an alternative embodiment, the first type text is obtained by inserting an invalid segment and a redundant segment into the valid text, or the first type text sample is an originally collected text containing the valid segment, the invalid segment and the redundant segment. Based on this, the training data set for training the text processing model may contain only the first type of text samples obtained by inserting the invalid segments and the redundant segments in the valid text, or may contain only the originally collected first type of text samples containing the valid segments, the invalid segments, and the redundant segments, or may contain both the first type of text samples obtained by inserting the invalid segments and the redundant segments in the valid text and the originally collected first type of text samples containing the valid segments, the invalid segments, and the redundant segments.

In an alternative embodiment, an implementation manner of inserting the invalid segment in the valid text may be:

the M positions are randomly selected as M insertion positions in the valid text. That is, in the embodiment of the present application, M invalid segments are inserted in the valid text. Alternatively, the insertion position may be the position where the sentence-breaking mark is located, and the person-difference position may be before or after the sentence-breaking mark. As an example, M may be any one of values between 1 and 6. However, this is by way of illustration only and is not intended to limit the scope of the present invention.

M consecutive segments are randomly selected from the public corpus as M invalid segments. The disclosed corpus can be an open-source open-chat corpus, such as a short message chat record and the like, and can contain corpora in various fields, usually dialogue data commonly used in life, and the active text belongs to different fields.

And inserting M invalid fragments into M insertion positions in a one-to-one correspondence manner. That is, one inactive segment is inserted at each insertion position, and inactive segments inserted at different insertion positions are different.

After the invalid segment is inserted into the valid text, the beginning sentence and the ending sentence of the valid segment may be marked, for example, the beginning sentence of the valid segment is marked 1, the ending sentence is marked 2, and the corresponding inserted invalid segment is completely repelled from the valid segment, so that the inserted invalid segment does not need to be marked.

In an alternative embodiment, an implementation manner of inserting the redundant segment in the valid text may be:

the Q positions are randomly selected in the valid text as Q insertion positions. That is, in the embodiment of the present application, Q redundant segments are inserted in the valid text. Alternatively, the insertion position may be the position where the sentence-breaking mark is located, and the person-difference position may be before or after the sentence-breaking mark. By way of example, Q may be any number between 1 and 6. However, this is by way of illustration only and is not intended to limit the scope of the present invention.

For each insertion position, a continuous segment of a predetermined length is selected in the region adjacent to the insertion position. Alternatively, consecutive active fragments of a predetermined length may be selected for copying in a window of a predetermined size prior to the insertion position. For example, consecutive 1-10 sentences are selected among 20 sentences before the insertion position.

And adding noise to the selected continuous segments with preset lengths, and inserting the continuous segments into the insertion position. The noise adding mode may be to delete, replace, insert low information words according to TFIDF (Term Frequency-inverse document Frequency), or to add noise by EDA (Easy Data Augmentation) data enhancement modes (i.e. deletion, insertion of text fragments and synonym replacement).

And adding noise to the selected continuous segments with preset lengths to obtain segments, namely redundant segments.

Typically, the insertion positions of the invalid and redundant segments in the valid text are different. Since the redundant fragments are related to the information of the valid fragments and are interspersed in the valid fragments, the continuity of the prediction of the valid text fragments is not imaged, so that the marking of the starting sentence and the ending sentence of the fragments can be omitted, and only each sentence in the redundant fragments is marked in a redundant way, for example, each sentence in the redundant fragments is marked as 3.

In an alternative embodiment, another training process of the text processing model may include:

inputting the first type text sample into a text processing model to obtain a boundary position sequence corresponding to the first type text sample output by the text processing model; each boundary position in the sequence of boundary positions corresponding to the first type of text sample indicates a start sentence or an end sentence of one valid segment in the first type of text sample, wherein the start sentence of the kth valid segment in the first type of text sample is determined based on the end sentences of the kth-1 valid segments in the first type of text sample, and the end sentence of the kth valid segment in the first type of text sample is determined based on the start sentence of the kth valid segment in the first type of text sample.

Taking a boundary position sequence corresponding to the first type text sample as a target, and updating parameters of the text processing model to obtain an initial text processing model; the first type of text sample is obtained by inserting at least an invalid segment into the valid text.

Or alternatively, the process may be performed,

In an alternative embodiment, the first type of text sample may be obtained by inserting an invalid segment into the valid text, or the first type of text sample may be obtained by inserting an invalid segment and a redundant segment into the valid text.

The difference between the boundary position sequence corresponding to the first type of text sample and the boundary position sequence label corresponding to the first type of text sample can be calculated by using the cross entropy loss function, the counter-propagation gradient is obtained based on the difference, and the parameters of the text processing model are updated based on the counter-propagation gradient, so that the initial text processing model is obtained.

Inputting the second type text sample into the initial text processing model to obtain a boundary position sequence corresponding to the second type text sample output by the initial text model; the second type of text sample is the originally collected text containing at least valid and invalid segments.

Alternatively, if the first type of text is obtained by inserting an invalid segment into the valid text, the second type of text sample may be the originally collected text containing the valid segment and the invalid segment.

If the first type of text is obtained by inserting an invalid segment and a redundant segment into the valid text, the second type of text sample may be the originally collected text containing the valid segment, the invalid segment and the redundant segment.

And updating parameters of the initial text processing model by taking a boundary position sequence corresponding to the second type text sample as a target and enabling the boundary position sequence corresponding to the second type text sample to approach to a boundary position sequence label corresponding to the second type text sample.

And calculating the difference between the boundary position sequence corresponding to the second type text sample and the boundary position sequence label corresponding to the second type text sample by using the cross entropy loss function, obtaining a counter-propagation gradient based on the difference, and updating the parameters of the initial text processing model based on the counter-propagation gradient to obtain a final text processing model.

In an alternative embodiment, the processing each sentence in the text according to the text characteristics of each sentence in the text to obtain the boundary position sequence, and the determining the redundant sentence in the text may include:

and processing each sentence in the text according to the text characteristics of each sentence in the text by using the text processing model to obtain a boundary position sequence and a redundant sentence identification result.

inputting the first type text sample into a text processing model to obtain a boundary position sequence corresponding to the first type text sample output by the text processing model and a redundant sentence recognition result in the first type text sample. Each boundary position in the sequence of boundary positions corresponding to the first type of text sample indicates a start sentence or an end sentence of one valid segment in the first type of text sample, wherein the start sentence of the kth valid segment in the first type of text sample is determined based on the end sentences of the kth-1 valid segments in the first type of text sample, and the end sentence of the kth valid segment in the first type of text sample is determined based on the start sentence of the kth valid segment in the first type of text sample.

Optionally, the end sentence of the kth-1 effective segment used in determining the start sentence of the kth effective segment in the first sample text may be predicted by the text processing model, and the start sentence of the kth effective segment used in determining the start sentence of the kth effective segment may be predicted by the text processing model.

Or alternatively, the process may be performed,

And updating parameters of the text processing model by taking a boundary position sequence corresponding to the first type text sample as a target, wherein the boundary position sequence corresponds to the first type text sample and the boundary position sequence label corresponds to the first type text sample, and the redundant sentence identification result corresponds to the first type text and corresponds to the first type text.

The first difference between the boundary position sequence corresponding to the first type of text sample and the boundary position sequence label corresponding to the first type of text sample and the second difference between the redundant sentence identification result corresponding to the first type of text and the redundant sentence label corresponding to the first type of text can be calculated by using the cross entropy loss function, the counter-propagation gradient is obtained based on the first difference and the second difference, and the parameters of the text processing model are updated based on the counter-propagation gradient.

Wherein the first type of text sample is text obtained by inserting invalid segments and redundant segments into valid text. Alternatively, the first type of text sample is originally collected text containing valid segments, invalid segments, and redundant segments. Based on this, only the first type of text samples obtained by inserting the invalid segments and the redundant segments into the valid text, or only the originally collected first type of text samples containing the valid segments, the invalid segments and the redundant segments, or both the first type of text samples obtained by inserting the invalid segments and the redundant segments into the valid text and the originally collected text containing the valid segments, the invalid segments and the redundant segments may be contained in the training data set for training the text processing unit.

inputting the first type text sample into a text processing model to obtain a boundary position sequence corresponding to the first type text sample output by the text processing model and a redundant sentence recognition result in the first type text sample.

And updating parameters of the text processing model by taking a boundary position sequence corresponding to the first type text sample and a boundary position sequence label corresponding to the first type text sample and a redundant sentence identification result corresponding to the first type text and a redundant sentence label corresponding to the first type text as targets to obtain an initial text processing model. Wherein the first type text sample is obtained by inserting an invalid segment and a redundant segment in the valid text.

And inputting the second type text sample into the initial text processing model to obtain a boundary position sequence corresponding to the second type text sample output by the initial text processing model and a redundant sentence recognition result in the second type text sample. Wherein the second type of text sample originally collected text contains invalid segments and redundant segments.

And updating parameters of the initial text processing model by taking a boundary position sequence corresponding to the second type text sample and a boundary position sequence label corresponding to the second type text sample and a redundant sentence identification result corresponding to the second type text and a redundant sentence label corresponding to the second type text as targets, so as to obtain a final text processing model.

and processing each sentence in the text according to the text characteristics of each sentence in the text by using the text processing model, obtaining a boundary position sequence, obtaining effective fragments in a first type text sample based on the boundary position sequence to form a target text, and performing redundant sentence recognition on the target text to obtain a redundant sentence recognition result.

inputting the first type text sample into a text processing model to obtain a boundary position sequence corresponding to the first type text sample output by the text processing model, acquiring an effective fragment in the first type text sample based on the boundary position sequence to form a target text, and performing redundant sentence recognition on the target text to obtain a redundant sentence recognition result. Each boundary position in the sequence of boundary positions corresponding to the first type of text sample indicates a start sentence or an end sentence of one valid segment in the first type of text sample, wherein the start sentence of the kth valid segment in the first type of text sample is determined based on the end sentences of the kth-1 valid segments in the first type of text sample, and the end sentence of the kth valid segment in the first type of text sample is determined based on the start sentence of the kth valid segment in the first type of text sample.

Or alternatively, the process may be performed,

inputting the first type text sample into a text processing model to obtain a boundary position sequence corresponding to the first type text sample output by the text processing model, acquiring an effective fragment in the first type text sample based on the boundary position sequence to form a target text, and performing redundant sentence recognition on the target text to obtain a redundant sentence recognition result corresponding to the first type text sample.

And updating parameters of the text processing model by taking a boundary position sequence corresponding to the first type text sample and a boundary position sequence label corresponding to the first type text sample and a redundant sentence identification result corresponding to the first type text and a redundant sentence label corresponding to the first type text as targets to obtain an initial text processing model.

Inputting the second type text sample into the initial text processing model to obtain a boundary position sequence corresponding to the second type text sample output by the initial text processing model, acquiring effective fragments in the second type text sample based on the boundary position sequence to form a target text, and performing redundant sentence recognition on the target text to obtain a redundant sentence recognition result corresponding to the second type text sample.

Corresponding to the method embodiment, the embodiment of the present application further provides a text processing device, and a schematic structural diagram of the text processing device provided in the embodiment of the present application is shown in fig. 8, which may include:

A boundary position sequence acquisition module 801 and a target text acquisition module 802; wherein, the liquid crystal display device comprises a liquid crystal display device,

the boundary position sequence obtaining module 801 is configured to process each sentence in the text according to the text feature of each sentence in the text, so as to obtain a boundary position sequence; each boundary position in the sequence of boundary positions indicates a start sentence or an end sentence of one valid segment, wherein the start sentence of a kth valid segment is determined based on the end sentence of a kth-1 valid segment, and the end sentence of the kth valid segment is determined based on the start sentence of the kth valid segment; the K is a positive integer greater than zero;

the target text obtaining module 802 is configured to obtain, based on the sequence of boundary positions, valid segments in the text to form a target text.

According to the text processing device provided by the embodiment of the application, each sentence in the text is processed according to the text characteristics of each sentence in the text, so that a boundary position sequence is obtained; each boundary position in the sequence of boundary positions indicates a start sentence or an end sentence of one of the valid segments, wherein the start sentence of the kth valid segment is determined based on the end sentence of the kth-1 valid segment and the end sentence of the kth valid segment is determined based on the start sentence of the kth valid segment; based on the sequence of boundary locations, valid segments in the text are obtained to construct the target text. The automatic extraction of the effective fragments in the text is realized, and the efficiency of regularizing the text is improved.

In an alternative embodiment, the boundary position sequence acquisition module 801 includes:

the candidate boundary position sequence acquisition module is used for acquiring a plurality of candidate boundary position sequences according to the text characteristics of each sentence in the text;

the score calculation module is used for calculating the score of each candidate boundary position sequence, wherein the score of each candidate boundary position represents the confidence degree of the candidate boundary, and the higher the score is, the higher the confidence degree is represented;

and the boundary position sequence determining module is used for taking the candidate boundary position sequence with the highest score as the boundary position sequence.

In an alternative embodiment, the candidate boundary position sequence obtaining module may specifically be configured to:

In an optional embodiment, the candidate boundary position sequence obtaining module is specifically configured to, when obtaining the second type of candidate boundary position sequence based on the first type of candidate boundary position sequence:

In an optional embodiment, the text processing apparatus provided in the embodiment of the present application may further include:

a redundant sentence determining module, configured to determine a redundant sentence in the target text;

and the deleting module is used for deleting redundant sentences in the target text.

In an alternative embodiment, the redundant sentence determination module is specifically configured to:

for each sentence in the target text, acquiring the probability that the sentence belongs to a redundant sentence; the probability that the sentence belongs to the redundant sentence is calculated according to the text characteristics of the sentence and the text characteristics of the initial sentence of the effective fragment where the sentence is located; if the probability that the sentence belongs to the redundant sentence is larger than the probability threshold value, determining the sentence as the redundant sentence.

the text feature acquisition module is used for acquiring, for each sentence in the text, a word vector of each word in the sentence and a code of a position of each word in the sentence; obtaining a representation vector of the sentence according to the word vector of each word and the position code of each word in the sentence; acquiring the codes of the positions of the sentences in the text; and obtaining the text characteristics of the sentence according to the characterization vector of the sentence and the position code of the sentence.

In an alternative embodiment, the boundary position sequence obtaining module 801 is specifically configured to: and processing each sentence in the text by using a text processing model according to the text characteristics of each sentence in the text to obtain a boundary position sequence.

In an optional embodiment, the text processing apparatus provided in the embodiment of the present application may further include: a first model training module for:

In an alternative embodiment, the end sentence of the K-1 th effective segment in the first type text sample is determined based on the boundary position sequence tag corresponding to the first type text sample, and the start sentence of the K-1 th effective segment in the first type text sample is determined based on the boundary position sequence tag corresponding to the first type text sample.

In an optional embodiment, the first type text sample is obtained by inserting an invalid segment into the valid text, or the first type text sample is an originally acquired text containing the valid segment and the invalid segment;

or alternatively, the process may be performed,

In an optional embodiment, the text processing apparatus provided in the embodiment of the present application may further include: a second model training module for:

The text processing device provided by the embodiment of the application can be applied to text processing equipment, such as a PC terminal, a cloud platform, a server cluster and the like. Alternatively, fig. 9 shows a block diagram of a hardware structure of the text processing apparatus, and referring to fig. 9, the hardware structure of the text processing apparatus may include: at least one processor 1, at least one communication interface 2, at least one memory 3 and at least one communication bus 4;

in the embodiment of the application, the number of the processor 1, the communication interface 2, the memory 3 and the communication bus 4 is at least one, and the processor 1, the communication interface 2 and the memory 3 complete communication with each other through the communication bus 4;

processor 1 may be a central processing unit CPU, or a specific integrated circuit ASIC (Application Specific Integrated Circuit), or one or more integrated circuits configured to implement embodiments of the present invention, etc.;

the memory 3 may comprise a high-speed RAM memory, and may further comprise a non-volatile memory (non-volatile memory) or the like, such as at least one magnetic disk memory;

Wherein the memory stores a program, the processor is operable to invoke the program stored in the memory, the program operable to:

Alternatively, the refinement function and the extension function of the program may be described with reference to the above.

The embodiment of the application also provides a storage medium, which may store a program adapted to be executed by a processor, the program being configured to:

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

In the several embodiments provided in this application, it should be understood that the disclosed systems, devices, and methods may be implemented in other manners. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in each embodiment of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer-readable storage medium. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

Finally, it is further noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

In the present specification, each embodiment is described in a progressive manner, and each embodiment is mainly described in a different point from other embodiments, and identical and similar parts between the embodiments are all enough to refer to each other.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A text processing method, comprising:

acquiring a plurality of candidate boundary position sequences according to the text characteristics of each sentence in the text, and taking the candidate boundary position sequence with the highest confidence as a boundary position sequence; each boundary position in the sequence of boundary positions indicates a start sentence or an end sentence of one valid segment, wherein the start sentence of a kth valid segment is determined based on the end sentence of a kth-1 valid segment, and the end sentence of the kth valid segment is determined based on the start sentence of the kth valid segment; the K is a positive integer greater than zero;

Based on the boundary position sequence, acquiring effective fragments in the text to form a target text;

the obtaining a plurality of candidate boundary position sequences according to the text characteristics of each sentence in the text comprises the following steps:

2. The method of claim 1, wherein the obtaining a second type of candidate boundary position sequence based on the first type of candidate boundary position sequence comprises:

For each sentence located after the candidate ending sentence, calculating a score of the new candidate boundary position sequence obtained by adding the sentence into the first candidate boundary position sequence indicating the candidate boundary position of the candidate ending sentence according to the first candidate boundary position sequence indicating the candidate boundary position of the candidate ending sentence and the probability of the sentence belonging to the starting sentence of the Kth effective fragment;

3. The method of claim 1, wherein the obtaining a number of candidate boundary position sequences from text features of respective sentences in the text further comprises:

4. A method according to claim 3, wherein said obtaining a first type of candidate boundary position sequence based on said second type of candidate boundary position sequence comprises:

for the candidate initial sentence of the Kth effective segment indicated by the candidate boundary position in each second class of candidate boundary position sequence, calculating the probability that each sentence positioned behind the candidate initial sentence in the text belongs to the ending sentence of the Kth effective segment according to the text characteristics of the candidate initial sentence and the text characteristics of each sentence positioned behind the candidate initial sentence in the text;

for each sentence located after the candidate start sentence, calculating a score of a new candidate boundary position sequence obtained by adding the sentence into the second class candidate boundary position sequence indicating the candidate boundary position of the candidate start sentence according to the second class candidate boundary position sequence indicating the candidate boundary position of the candidate start sentence and the probability of the sentence belonging to the ending sentence of the Kth valid segment;

5. The method as recited in claim 1, further comprising:

determining redundant sentences in the target text;

and deleting redundant sentences in the target text.

6. The method of claim 5, wherein the determining redundant sentences in the target text comprises:

7. The method of claim 1, wherein the text feature of each sentence in the text is obtained by:

acquiring the codes of the positions of the sentences in the text;

8. The method according to any one of claims 1-7, wherein the processing each sentence in the text according to the text feature of each sentence in the text to obtain the boundary position sequence includes:

9. The method of claim 8, wherein the training process of the text processing model comprises:

10. The method of claim 9, wherein the ending sentence of the K-1 th valid segment in the first type of text sample is determined based on a boundary position sequence tag corresponding to the first type of text sample, and wherein the starting sentence of the K-th valid segment in the first type of text sample is determined based on a boundary position sequence tag corresponding to the first type of text sample.

11. The method of claim 9, wherein the first type of text sample is obtained by inserting an invalid segment into the valid text, or wherein the first type of text sample is originally collected text comprising valid segments and invalid segments;

or alternatively, the process may be performed,

12. The method of claim 8, wherein the training process of the text processing model comprises:

inputting a second type text sample into the initial text processing model to obtain a boundary position sequence corresponding to the second type text sample output by the initial text processing model; the second type text sample is an originally acquired text at least comprising a valid segment and an invalid segment;

13. A text processing apparatus, comprising:

the boundary position sequence acquisition module is used for acquiring a plurality of candidate boundary position sequences according to the text characteristics of each sentence in the text, and taking the candidate boundary position sequence with the highest confidence as the boundary position sequence; each boundary position in the sequence of boundary positions indicates a start sentence or an end sentence of one valid segment, wherein the start sentence of a kth valid segment is determined based on the end sentence of a kth-1 valid segment, and the end sentence of the kth valid segment is determined based on the start sentence of the kth valid segment; the K is a positive integer greater than zero;

the target text acquisition module is used for acquiring effective fragments in the text based on the boundary position sequence to form a target text;

14. A text processing device comprising a memory and a processor;

the memory is used for storing programs;

the processor being configured to execute the program to implement the respective steps of the text processing method according to any one of claims 1 to 12.

15. A computer-readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the text processing method according to any of claims 1-12.