CN112818077A - Text processing method, device, equipment and storage medium - Google Patents
Text processing method, device, equipment and storage medium Download PDFInfo
- Publication number
- CN112818077A CN112818077A CN202011632673.0A CN202011632673A CN112818077A CN 112818077 A CN112818077 A CN 112818077A CN 202011632673 A CN202011632673 A CN 202011632673A CN 112818077 A CN112818077 A CN 112818077A
- Authority
- CN
- China
- Prior art keywords
- text
- sentence
- boundary position
- candidate
- type
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000003672 processing method Methods 0.000 title claims abstract description 15
- 238000012545 processing Methods 0.000 claims description 142
- 238000000034 method Methods 0.000 claims description 57
- 239000013598 vector Substances 0.000 claims description 23
- 230000008569 process Effects 0.000 claims description 21
- 238000012549 training Methods 0.000 claims description 20
- 239000012634 fragment Substances 0.000 claims description 18
- 238000012512 characterization method Methods 0.000 claims description 4
- 238000004590 computer program Methods 0.000 claims description 2
- 238000000605 extraction Methods 0.000 abstract description 4
- 230000006854 communication Effects 0.000 description 14
- 238000004891 communication Methods 0.000 description 13
- 238000003780 insertion Methods 0.000 description 12
- 230000037431 insertion Effects 0.000 description 12
- 230000006870 function Effects 0.000 description 10
- 238000013459 approach Methods 0.000 description 8
- 238000010586 diagram Methods 0.000 description 6
- 238000010606 normalization Methods 0.000 description 4
- 230000009471 action Effects 0.000 description 3
- 230000008878 coupling Effects 0.000 description 3
- 238000010168 coupling process Methods 0.000 description 3
- 238000005859 coupling reaction Methods 0.000 description 3
- 201000010099 disease Diseases 0.000 description 2
- 208000037265 diseases, disorders, signs and symptoms Diseases 0.000 description 2
- 230000007246 mechanism Effects 0.000 description 2
- 241000590419 Polygonia interrogationis Species 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 238000004422 calculation algorithm Methods 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 238000013434 data augmentation Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000012217 deletion Methods 0.000 description 1
- 230000037430 deletion Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 230000008451 emotion Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000000750 progressive effect Effects 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Machine Translation (AREA)
- Document Processing Apparatus (AREA)
Abstract
The embodiment of the application discloses a text processing method, wherein each sentence in a text is processed according to the text characteristics of each sentence in the text to obtain a boundary position sequence; each boundary position in the boundary position sequence indicates a starting sentence or an ending sentence of one effective segment, wherein the starting sentence of the Kth effective segment is determined based on the ending sentence of the Kth-1 effective segment, and the ending sentence of the Kth effective segment is determined based on the starting sentence of the Kth effective segment; and acquiring effective segments in the text based on the boundary position sequence to form the target text. Based on the scheme of the application, the automatic extraction of the effective segments in the text is realized, and the efficiency of regularizing the text is improved.
Description
Technical Field
The present application relates to the field of natural language processing technologies, and in particular, to a text processing method, apparatus, device, and storage medium.
Background
At present, text is the main form of recorded information, but invalid information may exist in the information recorded by the text, for example, information irrelevant to the field to which the information recorded by the text belongs exists in the text, and the existence of the invalid information can reduce the readability of the text. Therefore, it is necessary to regularize the text to remove invalid information in the text.
Manually regularizing text is an implementation but is inefficient. Therefore, how to improve the efficiency of organizing texts becomes an urgent technical problem to be solved.
Disclosure of Invention
In view of this, the present application provides a text processing method, apparatus, device and storage medium to improve efficiency of text normalization.
In order to achieve the above object, the following solutions are proposed:
a text processing method, comprising:
processing each sentence in the text according to the text characteristics of each sentence in the text to obtain a boundary position sequence; each boundary position in the sequence of boundary positions indicates a starting sentence or an ending sentence of one valid segment, wherein the starting sentence of the Kth valid segment is determined based on the ending sentence of the Kth-1 valid segment, and the ending sentence of the Kth valid segment is determined based on the starting sentence of the Kth valid segment; k is a positive integer greater than zero;
and acquiring effective segments in the text based on the boundary position sequence to form a target text.
In the above method, preferably, the processing each sentence in the text according to the text feature of each sentence in the text to obtain the boundary position sequence includes:
acquiring a plurality of candidate boundary position sequences according to the text characteristics of each sentence in the text;
calculating the score of each candidate boundary position sequence, wherein the score of each candidate boundary position represents the confidence coefficient of the candidate boundary, and the higher the score is, the higher the confidence coefficient is;
and taking the candidate boundary position sequence with the highest score as the boundary position sequence.
In the above method, preferably, the obtaining a plurality of candidate boundary position sequences according to text features of each sentence in the text includes:
after a first class candidate boundary position sequence is obtained according to the text features of each sentence in the text, a second class candidate boundary position sequence is obtained based on the first class candidate boundary position sequence;
each candidate boundary position in the first class of candidate boundary position sequences indicates a candidate starting sentence or a candidate ending sentence of one effective fragment from 1 st to K-1 st effective fragments in the text;
each boundary position in the second type of candidate boundary position sequence indicates a candidate starting sentence or a candidate ending sentence of one effective segment from 1 st to K-1 th effective segments in the text, or indicates a candidate starting sentence of the K-th effective segment in the text.
In the above method, preferably, the obtaining a second type of candidate boundary position sequence based on the first type of candidate boundary position sequence includes:
for the candidate ending sentences of the K-1 th effective segment indicated by the candidate boundary positions in each first-class candidate boundary position sequence, calculating the probability that each sentence in the text after the candidate ending sentence belongs to the starting sentence of the K-th effective segment according to the text characteristics of the candidate ending sentences and the text characteristics of each sentence in the text after the candidate ending sentence;
for each sentence after the candidate ending sentence, calculating a score of a new candidate boundary position sequence obtained by adding the sentence into the first candidate boundary position sequence indicating the candidate boundary position of the candidate ending sentence according to the first candidate boundary position sequence indicating the candidate boundary position of the candidate ending sentence and the probability of the sentence belonging to the starting sentence of the Kth effective segment;
and determining the second-class candidate boundary position sequences in all the new candidate boundary position sequences according to the scores of all the new candidate boundary position sequences obtained based on the candidate end sentences of the K-1 th effective segment indicated by the candidate boundary positions in the first-class candidate boundary position sequences.
In the above method, preferably, the obtaining a plurality of candidate boundary position sequences according to text features of each sentence in the text includes:
after a second type of candidate boundary position sequence is obtained according to the text characteristics of each sentence in the text, a first type of candidate boundary position sequence is obtained based on the second type of candidate boundary position sequence;
each candidate boundary position in the first class of candidate boundary position sequences indicates a candidate starting sentence or a candidate ending sentence of one effective segment from 1 st to K effective segments in the text;
each boundary position in the second type of candidate boundary position sequence indicates a candidate starting sentence or a candidate ending sentence of one effective segment from 1 st to K-1 th effective segments in the text, or indicates a candidate starting sentence of the K-th effective segment in the text.
Preferably, the obtaining the first class candidate boundary position sequence based on the second class candidate boundary position sequence includes:
for the candidate starting sentence of the Kth effective segment indicated by the candidate boundary position in each second-class candidate boundary position sequence, calculating the probability that each sentence in the text after the candidate ending sentence belongs to the ending sentence of the Kth effective segment according to the text characteristics of the candidate starting sentence and the text characteristics of each sentence in the text after the candidate starting sentence;
for each sentence after the candidate starting sentence, calculating a score of a new candidate boundary position sequence obtained by adding the sentence into the second candidate boundary position sequence indicating the candidate boundary position of the candidate starting sentence according to the second candidate boundary position sequence indicating the candidate boundary position of the candidate starting sentence and the probability of the sentence belonging to the ending sentence of the Kth effective segment;
and determining the first-class candidate boundary position sequence in all the new candidate boundary position sequences according to the scores of all the new candidate boundary position sequences obtained based on the candidate starting sentences of the Kth effective segment indicated by the candidate boundary positions in the second-class candidate boundary position sequences.
The above method, preferably, further comprises:
determining a redundant sentence in the target text;
and deleting redundant sentences in the target text.
In the above method, preferably, the determining the redundant sentence in the target text includes:
for each sentence in the target text, acquiring the probability that the sentence belongs to a redundant sentence; the probability that the sentence belongs to the redundant sentence is calculated according to the text characteristics of the sentence and the text characteristics of the initial sentence of the effective segment in which the sentence is located;
and if the probability that the sentence belongs to the redundant sentence is greater than the probability threshold value, determining the sentence as the redundant sentence.
In the above method, preferably, the text feature of each sentence in the text is obtained by:
for each sentence in the text, acquiring a word vector of each word in the sentence and a code of the position of each word in the sentence;
obtaining a representation vector of the sentence according to the word vector of each word in the sentence and the position code of each word;
acquiring the code of the position of the sentence in the text;
and obtaining the text characteristics of the sentence according to the characterization vector of the sentence and the position code of the sentence.
In the above method, preferably, the processing each sentence in the text according to the text feature of each sentence in the text to obtain the boundary position sequence includes:
and processing each sentence in the text according to the text characteristics of each sentence in the text by using a text processing model to obtain a boundary position sequence.
In the above method, preferably, the training process of the text processing model includes:
inputting a first type of text sample into the text processing model to obtain a boundary position sequence corresponding to the first type of text sample output by the text processing model; each boundary position in the boundary position sequence corresponding to the first type of text sample indicates a starting sentence or an ending sentence of one effective segment in the first type of text sample, wherein the starting sentence of the Kth effective segment in the first type of text sample is determined based on the ending sentence of the Kth-1 effective segment in the first type of text sample, and the ending sentence of the Kth effective segment in the first type of text sample is determined based on the starting sentence of the Kth effective segment in the first type of text sample;
and updating the parameters of the text processing model by taking the boundary position sequence corresponding to the first type of text sample approaching to the boundary position sequence label corresponding to the first type of text sample as a target.
In the method, preferably, the ending sentence of the K-1 th valid segment in the first type of text sample is determined based on the boundary position sequence tag corresponding to the first type of text sample, and the starting sentence of the K-th valid segment in the first type of text sample is determined based on the boundary position sequence tag corresponding to the first type of text sample.
In the method, preferably, the first type of text sample is obtained by inserting an invalid segment into an effective text, or the first type of text sample is an originally acquired text containing the effective segment and the invalid segment;
or,
the first type of text is obtained by inserting invalid segments and redundant segments into valid text, or the first type of text sample is originally acquired text containing valid segments, invalid segments and redundant segments.
In the above method, preferably, the training process of the text processing model includes:
inputting a first type of text sample into the text processing model to obtain a boundary position sequence corresponding to the first type of text sample output by the text processing model; each boundary position in the boundary position sequence corresponding to the first type of text sample indicates a starting sentence or an ending sentence of one effective segment in the first type of text sample, wherein the starting sentence of the Kth effective segment in the first type of text sample is determined based on the ending sentence of the Kth-1 effective segment in the first type of text sample, and the ending sentence of the Kth effective segment in the first type of text sample is determined based on the starting sentence of the Kth effective segment in the first type of text sample;
taking the boundary position sequence corresponding to the first type of text sample approaching to the boundary position sequence corresponding to the first type of text sample as a target, and updating the parameters of the text processing model to obtain an initial text processing model; the first type of text sample is obtained by inserting at least invalid segments into valid texts;
inputting a second type of text sample into the initial text processing model to obtain a boundary position sequence output by the initial text model and corresponding to the second type of text sample; the second type of text sample is originally acquired text at least comprising valid fragments and invalid fragments;
and updating the parameters of the initial text processing model by taking the boundary position sequence corresponding to the second type of text sample approaching to the boundary position sequence label corresponding to the second type of text sample as a target.
A text processing apparatus comprising:
the boundary position sequence acquisition module is used for processing each sentence in the text according to the text characteristics of each sentence in the text to obtain a boundary position sequence; each boundary position in the sequence of boundary positions indicates a starting sentence or an ending sentence of one valid segment, wherein the starting sentence of the Kth valid segment is determined based on the ending sentence of the Kth-1 valid segment, and the ending sentence of the Kth valid segment is determined based on the starting sentence of the Kth valid segment; k is a positive integer greater than zero;
and the target text acquisition module is used for acquiring the effective segments in the text based on the boundary position sequence to form a target text.
A text processing apparatus comprising a memory and a processor;
the memory is used for storing programs;
the processor is configured to execute the program to implement the steps of the text processing method according to any one of the above items.
A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the text processing method according to any one of the preceding claims.
According to the technical scheme, the text processing method provided by the embodiment of the application processes each sentence in the text according to the text characteristics of each sentence in the text to obtain the boundary position sequence; each boundary position in the boundary position sequence indicates a starting sentence or an ending sentence of one effective segment, wherein the starting sentence of the Kth effective segment is determined based on the ending sentence of the Kth-1 effective segment, and the ending sentence of the Kth effective segment is determined based on the starting sentence of the Kth effective segment; and acquiring effective segments in the text based on the boundary position sequence to form the target text. Based on the scheme of the application, the automatic extraction of the effective segments in the text is realized, and the efficiency of regularizing the text is improved.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, it is obvious that the drawings in the following description are only embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.
FIG. 1 is a flow chart of one implementation of a text processing method disclosed in an embodiment of the present application;
fig. 2 is a flowchart of an implementation of processing each sentence in a text according to a text feature of each sentence in the text to obtain a boundary position sequence, according to an embodiment of the present disclosure;
fig. 3 is a flowchart of an implementation of obtaining a second type of candidate boundary position sequence based on a first type of candidate boundary position sequence disclosed in the embodiment of the present application;
fig. 4 is a flowchart of an implementation of obtaining a first type of candidate boundary position sequence based on a second type of candidate boundary position sequence disclosed in an embodiment of the present application;
FIG. 5 is a flow chart of one implementation of determining redundant sentences in a target text as disclosed in an embodiment of the present application;
FIG. 6 is a flow chart of one implementation of obtaining textual features of a sentence as disclosed in an embodiment of the present application;
FIG. 7 is a schematic diagram of a structure of a text processing model disclosed in an embodiment of the present application;
FIG. 8 is a schematic structural diagram of a document processing apparatus according to an embodiment of the disclosure;
fig. 9 is a block diagram of a hardware configuration of a text processing apparatus according to an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
At present, information recording through a voice recording system is widely applied to various industries, for example, in a medical scene, communication information of doctors and patients can be recorded through the voice recording system; in a conference scene, the communication information of the participants can be recorded through a voice recording system. And after the voice of the user is collected, the voice recording system transfers the collected voice into a text for storage.
In practical applications, various irrelevant information or repeated information is often mixed in the communication information, so that irrelevant information or repeated information also exists in the text transcribed by the voice, and the irrelevant information or the repeated information can reduce the readability of the text. For example, in a medical scenario, when a doctor communicates with a patient, a medical document can be entered through the voice entry system for subsequent reading, analysis and archiving. However, in the actual communication process, a plurality of dialogue information irrelevant to the disease condition is often inserted between the doctors and the patients, such as a dialogue with a chatty style is performed between the doctors and the patients in order to understand the emotion of the patients, or questions answering the disease condition irrelevant to the patients and their family members, such as asking for a physical examination address, a fee, etc., and talking with other doctors, nurses, etc. Meanwhile, because a doctor may not form a complete thought in the dictation process, redundant information is often input, and many information or redundant information irrelevant to the illness state of the patient exist in the text transcribed by the irrelevant conversations or repeated voices through the voice input system, so that the logic line of the illness state input is damaged, and the reading experience is greatly reduced. Therefore, it is necessary to regularize the text to remove irrelevant information and repeated information in the text.
At present, texts are structured manually, but the efficiency of manual structuring is low, and particularly when the texts are long, the defect of low efficiency is more obvious, so that how to improve the efficiency of text structuring becomes a technical problem to be solved urgently.
In order to improve efficiency of regularizing a text, an implementation flowchart of a text processing method provided in an embodiment of the present application is shown in fig. 1, and may include:
step S101: and processing each sentence in the text according to the text characteristics of each sentence in the text to obtain a boundary position sequence.
The text is the text to be normalized, and the text can be a doctor-patient communication document transcribed by a voice input system, or a conference recording document and the like.
Each boundary position in the boundary position sequence indicates a starting sentence or an ending sentence of one effective segment, wherein the starting sentence of the Kth effective segment is determined based on the ending sentence of the Kth-1 effective segment, and the ending sentence of the Kth effective segment is determined based on the starting sentence of the Kth effective segment; k is a positive integer greater than zero.
The Kth effective segment is any one of a plurality of effective segments in the text. If the Kth valid segment is the 1 st valid segment in the text, the ending sentence of the Kth-1 valid segment can be characterized by the preset information.
Assuming that there are N boundary positions in the boundary position sequence, the boundary position sequence indicates a starting sentence and an ending sentence of N/2 valid segments in total, i.e. there are N/2 valid segments in the text predicted in total, and the boundary position in the boundary position sequence may be a1,B1,A2,B2,A3,B3,……AN/2-1,BN/2-1,AN/2,BN/2In which A1Starting sentence indicating the 1 st valid segment, B1Indicating an ending sentence of the 1 st valid segment, A2Starting sentence indicating the 2 nd valid segment, B2Indicating an ending sentence of the 2 nd valid segment, A3Starting sentence indicating the 3 rd valid segment, B3Indicating an end sentence of the 3 rd valid segment, … … AN/2-1Indicating the starting sentence of the N/2-1 valid segment, BN/2-1Indicating an ending sentence of the N/2-1 valid segment, AN/2Starting sentence indicating the N/2 th valid segment, BN/2Indicating the ending sentence of the N/2 th valid segment.
Wherein A isiThe position information of the starting sentence of the ith valid segment in the text, for example, the sentence number of the starting sentence of the ith valid segment in the text, BiThe position of the ending sentence, which may be the ith valid segment, in the textInformation such as the sentence number of the ending sentence of the ith valid segment in the text.
Optionally, AiStarting sentence, which may be the ith valid segment, BiMay be the ending sentence of the ith valid segment.
Alternatively, preset punctuations (e.g., commas, periods, question marks, exclamation marks) may be used as segment marks, and text segments between the segment marks are called sentences, e.g., a text segment between two adjacent commas is a sentence, a text segment between a comma and a period adjacent to the comma is a sentence, a text segment between a question mark and a period adjacent to the question mark is a sentence, a text segment between a question mark and an exclamation mark adjacent to the question mark is a sentence, and a text segment between a question mark and an exclamation mark adjacent to the question mark is a sentence, etc. That is, a text segment between any two adjacent punctuation marks is called a sentence.
Step S102: and acquiring effective segments in the text based on the boundary position sequence to form the target text.
In the embodiment of the present application, the starting sentence and the ending sentence of each valid segment in the text are predicted, and for each valid segment, the segment (including the starting sentence and the ending sentence) between the starting sentence and the ending sentence of the valid segment is the valid segment. After all the effective segments in the text are determined, all the effective segments can be extracted from the text to form the target text, or the ineffective segments in the text can be directly deleted to obtain the target text.
According to the text processing method provided by the embodiment of the application, each sentence in the text is processed according to the text characteristics of each sentence in the text, and a boundary position sequence is obtained; each boundary position in the boundary position sequence indicates a starting sentence or an ending sentence of one effective segment, wherein the starting sentence of the Kth effective segment is determined based on the ending sentence of the Kth-1 effective segment, and the ending sentence of the Kth effective segment is determined based on the starting sentence of the Kth effective segment; and acquiring effective segments in the text based on the boundary position sequence to form the target text. The automatic extraction of effective fragments in the text is realized, and the efficiency of regularizing the text is improved.
In an optional embodiment, one implementation manner of processing each sentence in the text according to the text feature of each sentence in the text to obtain the boundary position sequence may be: and processing each sentence in the text according to the text characteristics of each sentence in the text, and directly determining an optimal boundary position sequence.
In an alternative embodiment, an implementation flowchart of processing each sentence in the text according to the text feature of each sentence in the text to obtain the boundary position sequence is shown in fig. 2, and may include:
step S201: and acquiring a plurality of candidate boundary position sequences according to the text characteristics of each sentence in the text. Each candidate boundary position in each sequence of candidate boundary positions indicates a candidate starting sentence or a candidate ending sentence of one valid segment.
Wherein the candidate boundary positions of the candidate start sentences indicating the same valid segment in different candidate boundary sequences are the same or different, and the candidate end positions of the candidate end sentences indicating the same valid segment in different candidate boundary sequences are the same or different.
Step S202: and calculating the score of each candidate boundary position sequence, wherein the score of each candidate boundary position represents the confidence of the candidate boundary, and the higher the score is, the higher the confidence is.
For any candidate boundary position sequence, the score of the candidate boundary position sequence can be determined according to the probability value corresponding to each candidate boundary position in the candidate boundary position sequence, wherein if the candidate boundary position indicates a candidate starting sentence of an effective segment, the probability value corresponding to the candidate boundary position is the probability that the candidate starting sentence indicated by the candidate boundary position belongs to the starting sentence of the effective segment, and if the candidate boundary position indicates a candidate ending sentence of the effective segment, the probability value corresponding to the candidate boundary position is the probability that the candidate ending sentence indicated by the candidate boundary position belongs to the ending sentence of the effective segment.
Alternatively, for any one candidate boundary position sequence, the Score of the candidate boundary position sequence may be determined according to the following formula:
wherein, N is the number of candidate boundary positions in the candidate boundary position sequence; p (S)j) And the probability corresponding to the jth boundary position in the candidate boundary position sequence is obtained.
Step S203: and taking the candidate boundary position sequence with the highest score as the boundary position sequence.
In this embodiment, instead of directly determining the optimal boundary position sequence, a plurality of candidate boundary position sequences are determined first, and then an optimal boundary position sequence is selected from the plurality of candidate boundary position sequences.
In an optional embodiment, one implementation manner of the above obtaining several candidate boundary position sequences according to the text features of each sentence in the text may be:
after a first class candidate boundary position sequence is obtained according to the text characteristics of each sentence in the text, a second class candidate boundary position sequence is obtained based on the first class candidate boundary position sequence.
Each candidate boundary position in the first-class candidate boundary position sequence indicates a candidate starting sentence or a candidate ending sentence of one effective fragment from 1 st to K-1 st effective fragments in the text;
each boundary position in the second type of sequence of candidate boundary positions indicates a candidate starting sentence or a candidate ending sentence of one of the 1 st to K-1 st valid segments in the text, or indicates a candidate starting sentence of the K-th valid segment in the text.
That is to say, in the embodiment of the present application, the candidate starting sentences and the candidate ending sentences are predicted one by one for each valid segment, and each valid segment predicts the candidate starting sentence first and then predicts the candidate ending sentence, and after the candidate ending sentence of the K-1 th valid segment is predicted, the candidate starting sentence of the K-th valid segment can be predicted.
In an optional embodiment, an implementation flowchart of the obtaining a second type of candidate boundary position sequence based on a first type of candidate boundary position sequence is shown in fig. 3, and may include:
step S301: and for the candidate ending sentences of the K-1 th effective segment indicated by the candidate boundary positions in each first-class candidate boundary position sequence, calculating the probability that each sentence in the text after the candidate ending sentence belongs to the starting sentence of the K-th effective segment according to the text characteristics of the candidate ending sentences and the text characteristics of each sentence in the text after the candidate ending sentence.
For convenience of description, the candidate end sentence of the K-1 th valid segment indicated by the candidate boundary position in the pth first-class candidate boundary position sequence is denoted as: sP.1.k-1.ce. The candidate end sentences of the K-1 th valid segment indicated by the candidate boundary positions in the different first-class candidate boundary sequences are different.
For the candidate ending sentence S located in the textP.1.k-1.ceFor each subsequent sentence, the text characteristics of the sentence and the candidate ending sentence S can be combinedP.1.k-1.ceAnd splicing the text features, and calculating the probability that the sentence belongs to the initial sentence of the Kth effective segment by using the spliced text features.
Step S302: for the candidate ending sentence SP.1.k-1.ceEach subsequent sentence, based on the indication of the candidate ending sentence SP.1.k-1.ceAnd the probability that the sentence belongs to the starting sentence of the Kth effective segment, calculating the addition of the candidate ending sentence S into the sentenceP.1.k-1.ceThe score of a new candidate boundary position sequence obtained from the first type candidate boundary position sequence where the candidate boundary position is located.
The sentence joining indicates the candidate ending sentence SP.1.k-1.ceIs obtained from the first class candidate boundary position sequence of the candidate boundary positionThe score of the new candidate boundary position sequence of (2) can be calculated by using formula (1). At this time, N in the formula adds the candidate ending sentence S to the sentenceP.1.k-1.ceThe number of candidate boundary positions in a new candidate boundary position sequence obtained from the first class of candidate boundary position sequence where the candidate boundary positions are located is 2K-1; p (S)j) Adding the candidate ending sentence S to the sentenceP.1.k-1.ceThe probability that the j (j is 1, 2, 3, … …, 2K-1) th candidate boundary position in the new candidate boundary position sequence obtained from the first class candidate boundary position sequence where the candidate boundary position is located corresponds to the j (j is 1, 2, 3, … …, 2K-1) th candidate boundary position.
Step S303: candidate ending sentence S according to the K-1-th valid segment indicated based on the candidate boundary position in each first-class candidate boundary position sequenceP.1.k-1.ceAnd determining the second type of candidate boundary position sequences in all the new candidate boundary position sequences according to the obtained scores of all the new candidate boundary position sequences. The second type of candidate boundary position sequence is the same as or different from the first type of candidate boundary position sequence, and the candidate boundary position of the candidate starting sentence indicating the same effective segment is the same as or different from the candidate boundary position of the candidate ending sentence indicating the same effective segment.
Suppose that the candidate ending sentence S is located in the textP.1.k-1.ceThe number of sentences thereafter being M _ SP.1.k-1.ceBased on indicating the candidate ending sentence SP.1.k-1.ceThe first class candidate boundary position sequence where the candidate boundary position is located can obtain M _ SP.1.k-1.ceA new sequence of candidate boundary positions, thus based on the indication of the candidate end sentence SP.1.k-1.ceThe first class candidate boundary position sequence where the candidate boundary position is located can obtain M _ SP.1.k-1.ceAnd the score of the new candidate boundary position sequences, assuming that the number of the first-class candidate boundary position sequences is B, the number M _ total _1 of all the new candidate boundary position sequences obtained based on the candidate ending sentences of the K-1-th effective segment indicated by the candidate boundary positions in each first-class candidate boundary position sequence is:
in this embodiment of the present application, B new candidate boundary position sequences with the highest score may be selected from M _ total _1 new candidate boundary position sequences as B second-class candidate boundary position sequences. B may be any value between 1 and 5.
In an optional embodiment, one implementation manner of the above obtaining several candidate boundary position sequences according to the text features of each sentence in the text may be:
after a second type of candidate boundary position sequence is obtained according to the text characteristics of each sentence in the text, a first type of candidate boundary position sequence is obtained based on the second type of candidate boundary position sequence.
Wherein each candidate boundary position in the first class of sequence of candidate boundary positions indicates a candidate starting sentence or a candidate ending sentence of one of the 1 st to K-th valid segments in the text.
A candidate starting sentence or a candidate ending sentence of one of the valid segments, or a candidate starting sentence indicating a K-th valid segment in the text.
That is to say, in the embodiment of the present application, the candidate starting sentences and the candidate ending sentences are predicted one by one for each valid segment, and each valid segment predicts the candidate starting sentence first and then predicts the candidate ending sentence, and after the candidate starting sentence of the kth valid segment is predicted, the candidate ending sentence of the kth valid segment can be predicted.
In an optional embodiment, an implementation flowchart of the obtaining of the first class candidate boundary position sequence based on the second class candidate boundary position sequence is shown in fig. 4, and may include:
step S401: for the candidate starting sentence of the K-th valid segment indicated by the candidate boundary position in each second-class candidate boundary position sequence, calculating the probability that each sentence in the text after the candidate ending sentence belongs to the ending sentence of the K-th valid segment according to the text characteristics of the candidate starting sentence and the text characteristics of each sentence in the text after the candidate starting sentence.
For convenience of description, the candidate starting sentence of the K-th valid segment indicated by the candidate boundary position in the p-th second-class candidate boundary position sequence is denoted as: sP.2.k.cs. The candidate starting sentences of the K-th valid segment indicated by the candidate boundary positions in the different second-class candidate boundary sequences are different.
For the candidate starting sentence S located in the textP.2.k.csFor each subsequent sentence, the text characteristics of the sentence and the candidate starting sentence S can be combinedP.2.k.csThe text features are spliced, and the probability that the sentence belongs to the ending sentence of the Kth effective segment is calculated by using the spliced text features.
The second type of candidate boundary position sequence in this step is the second type of candidate boundary position sequence obtained based on the embodiment shown in fig. 3.
Step S402: for the candidate starting sentence SP.2.k.csEach subsequent sentence, based on the indication of the candidate starting sentence SP.2.k.csAnd the probability that the sentence belongs to the ending sentence of the Kth effective segment, calculating the sentence adding indication of the candidate starting sentence SP.2.k.csThe score of the new candidate boundary position sequence obtained from the second type candidate boundary position sequence where the candidate boundary position is located.
The sentence joining indicates the candidate starting sentence SP.2.k.csThe score of the new candidate boundary position sequence obtained from the second type candidate boundary position sequence where the candidate boundary position is located can be calculated by using formula (1). At this time, N in the formula indicates the candidate starting sentence S for the sentenceP.2.k.csThe number of candidate boundary positions in a new post-selected boundary position sequence obtained from the second type of candidate boundary position sequence where the candidate boundary positions are located is 2K; p (S)j) Adding the indication of the candidate starting sentence S to the sentenceP.2.k.csIs obtained from a second type of candidate boundary position sequence in which the candidate boundary position is locatedAnd (3) the probability corresponding to the j (j ═ 1, 2, 3, … …, 2K) th candidate boundary position in the candidate boundary position sequence.
Step S403: and determining the first-class candidate boundary position sequence in all the new candidate boundary position sequences according to the scores of all the new candidate boundary position sequences obtained based on the candidate starting sentences of the Kth effective segment indicated by the candidate boundary positions in the second-class candidate boundary position sequences.
The first-class candidate boundary position sequence determined at this time is different from the first-class candidate boundary position sequence shown in the embodiment shown in fig. 3, the first-class candidate boundary position sequence determined at this time indicates candidate start sentences and candidate end sentences of respective ones of the first K effective segments, and the first-class candidate boundary position sequence in the embodiment shown in fig. 3 indicates only candidate start sentences and candidate end sentences of respective ones of the first K-1 effective segments.
After obtaining the first class of candidate boundary position sequences based on the embodiment shown in fig. 4, the second class of candidate boundary position sequences may be obtained again, at this time, each boundary position in the second class of candidate boundary position sequences indicates a candidate starting sentence or a candidate ending sentence of one effective segment of the 1 st to K effective segments in the text, or indicates a candidate starting sentence of the K +1 th effective segment in the text. Thereafter, the first class of candidate boundary position sequences may be obtained again, at which time each boundary position in the first class of candidate boundary position sequences indicates a candidate starting sentence or a candidate ending sentence of one of the 1 st to K +1 st valid segments in the text. And so on until the beginning sentence of the effective segment is predicted to be the ending identifier in the text.
That is to say, in the embodiment of the present application, the first class candidate boundary position sequence and the second class candidate boundary position sequence are alternately updated, after the first class candidate boundary position sequence is obtained, the second class candidate boundary position sequence is obtained by using the obtained first class candidate boundary position sequence, after the second class candidate boundary position sequence is obtained, a new first class candidate boundary position sequence is obtained by using the obtained second class candidate boundary position sequence, then, the second class candidate boundary position sequence is obtained by using the new first class candidate boundary position sequence, and so on.
In the foregoing embodiment, the obtained target text may include redundant information, such as repeated information, where the redundant information also reduces readability of the target text, and although the redundant information may be manually deleted, efficiency of manual deletion is low, and therefore, in order to further improve efficiency of document normalization, the text processing method provided in this embodiment of the present application further includes:
determining redundant sentences in the target text; the redundant sentence is a repeated sentence in the target text, for example, if there are L identical sentences in the target text, L-1 sentences in the L identical sentences are the redundant sentence
And deleting redundant sentences in the target text.
In the embodiment, the redundant sentences in the target text are automatically determined and deleted, so that the automatic normalization of the target text is realized, and the text normalization efficiency is further improved.
In an alternative embodiment, a flowchart of an implementation of the above determining a redundant sentence in a target text is shown in fig. 5, and may include:
step S501: for each sentence in the target text, acquiring the probability that the sentence belongs to the redundant sentence; the probability that the sentence belongs to a redundant sentence is calculated based on the textual features of the sentence and the textual features of the starting sentence of the active segment in which the sentence is located.
The text features of the sentence and the text features of the initial sentence of the effective segment in which the sentence is located can be spliced, and the probability that the sentence belongs to the redundant sentence is calculated by using the spliced text features.
In the embodiment of the application, the probability that all sentences in the text belong to the redundant sentences can be obtained in advance, and then the probability that each sentence in the target text belongs to the redundant sentence can be directly read after the target text is obtained, or the probability that each sentence in the target text belongs to the redundant sentence can be calculated after the target text is obtained.
Step S502: and if the probability that the sentence belongs to the redundant sentence is greater than the probability threshold value, determining the sentence as the redundant sentence.
If the probability that the sentence belongs to the redundant sentence is less than or equal to the probability threshold, it is determined that the sentence does not belong to the redundant sentence.
In an alternative embodiment, for each sentence in the text, an implementation flowchart for obtaining the text feature of the sentence is shown in fig. 6, and may include:
step S601: and acquiring a word vector of each word in the sentence and the code of the position of each word in the sentence.
Alternatively, the positional encoding of the word in the sentence may be a sinusoidal positional encoding of the word in the sentence.
Step S602: and obtaining a representation vector of the sentence according to the word vector of each word in the sentence and the position code of each word.
Optionally, the word vectors and the position codes corresponding to the same word may be added to obtain an initial code corresponding to the word, and the initial codes of the words in the sentence are processed to obtain word representations of the words in the sentence, which are related to the context.
The word representations of the words in the sentence can be spliced, and the vector obtained by splicing is used as the representation vector of the sentence.
Or,
the word representations of all words in the sentence can be spliced, and the spliced vectors are compressed to obtain vectors with preset lengths as the representation vectors of the sentence.
Step S603: the encoding of the position of the sentence in the text is obtained. The position code of the sentence in the text may be a sinusoidal position code of the sentence in the text.
Step S604: and obtaining the text characteristics of the sentence according to the characterization vector of the sentence and the position code of the sentence.
The token vector of the sentence and the position code of the sentence can be added to obtain the initial text feature of the sentence, and the initial text feature of the sentence is processed to obtain the text feature of the sentence related to the context.
In an optional embodiment, one implementation manner of processing each sentence in the text according to the text feature of each sentence in the text to obtain the boundary position sequence may be:
and processing each sentence in the text according to the text characteristics of each sentence in the text by using the text processing model to obtain a boundary position sequence.
In an alternative embodiment, a schematic structural diagram of the text processing model provided in the embodiment of the present application is shown in fig. 7, and may include:
an encoding module 701 and a decoding module 702; wherein,
the encoding module 701 is configured to encode each sentence in the text to obtain a text feature of each sentence. The specific encoding process can refer to the foregoing embodiments, and is not described in detail here.
The decoding module 702 is configured to process each sentence in the text according to a text feature of each sentence in the text, so as to obtain a boundary position sequence. The specific processing manner can be referred to the foregoing embodiments, and is not described in detail here.
In the embodiment of the present application, a text processing model may adopt a cascade Transformer structure, and compared with an LSTM structure, the Transformer structure calculates a relationship between a word or a sentence at a current time and all other words or sentences through a self-attention (self-attention) mechanism, that is, a global context may be seen, which is very important for understanding a subject of a text and distinguishing whether information of the current sentence is related. The computational complexity considering the self-attention mechanism is O (c)2) Wherein c is the number of words in the text, directly modeling the relationship between words, which is not acceptable for long documents, therefore, the scheme adopts a cascading Transformer structure, i.e. a Transformer encoder at a word level and a Transformer encoder at a sentence level are cascaded, and the computational complexity of self attention is O (m) at this time2+n2) Where m is the number of words in the sentence (which may be the average of the number of words of all sentences in the text)Value), n is the number of sentences in the text. Thereby greatly reducing the computational complexity.
In an alternative embodiment, the training process of the text processing model may include:
and inputting the first type of text sample into a text processing model to obtain a boundary position sequence corresponding to the first type of text sample output by the text processing model. Each boundary position in the boundary position sequence corresponding to the first type of text sample indicates a starting sentence or an ending sentence of one effective segment in the first type of text sample, wherein the starting sentence of the Kth effective segment in the first type of text sample is determined based on the ending sentence of the Kth-1 effective segment in the first type of text sample, and the ending sentence of the Kth effective segment in the first type of text sample is determined based on the starting sentence of the Kth effective segment in the first type of text sample.
In an alternative embodiment, the ending sentence of the K-1 th valid segment used for determining the starting sentence of the K-th valid segment in the first type of sample text may be predicted by the text processing model, and the starting sentence of the K-th valid segment used for determining the starting sentence of the K-th valid segment may be predicted by the text processing model.
Or,
the end sentence of the K-1 th valid segment used when determining the start sentence of the K-th valid segment in the first type of sample text may be determined according to the boundary position sequence tag corresponding to the first type of sample text, and the start sentence of the K-th valid segment used when determining the end sentence of the K-th valid segment may be determined according to the boundary position sequence tag corresponding to the first type of sample text.
And updating the parameters of the text processing model by taking the boundary position sequence corresponding to the first type of text sample approaching to the boundary position sequence label corresponding to the first type of text sample as a target.
The difference between the boundary position sequence corresponding to the first type of text sample and the boundary position sequence label corresponding to the first type of text sample can be calculated by using a cross entropy loss function, a back propagation gradient is obtained based on the difference, and the parameters of the text processing model are updated based on the back propagation gradient.
In an alternative embodiment, the first type text sample may be obtained by inserting invalid segments into valid text, or the first type text sample is originally acquired text containing valid segments and invalid segments. Based on this, the training data set for training the text processing model may only include the first type of text samples obtained by inserting the invalid segments into the valid text, or may only include the originally acquired first type of text samples including the valid segments and the invalid segments, or may include both the first type of text samples obtained by inserting the invalid segments into the valid text and the originally acquired first type of text samples including the valid segments and the invalid segments.
In an alternative embodiment, the first type of text is obtained by inserting invalid segments and redundant segments into valid text, or the first type of text sample is originally acquired text containing valid segments, invalid segments and redundant segments. Based on this, the training data set for training the text processing model may only include the first type of text samples obtained by inserting the invalid segments and the redundant segments into the valid text, or may only include the originally acquired first type of text samples including the valid segments, the invalid segments, and the redundant segments, or may include both the first type of text samples obtained by inserting the invalid segments and the redundant segments into the valid text and the originally acquired first type of text samples including the valid segments, the invalid segments, and the redundant segments.
In an alternative embodiment, one implementation manner of inserting the invalid segments into the valid text may be as follows:
m positions are randomly selected as M insertion positions in the valid text. That is, in the embodiment of the present application, M invalid segments are inserted into the valid text. Optionally, the insertion position may be a position where the sentence break identifier is located, and the position of the handicapped person may be before the sentence break identifier or after the sentence break identifier. By way of example, M may be any value between 1 and 6. However, this is merely an example and does not limit the present invention.
M consecutive segments are randomly selected from the open corpus as M invalid segments. The open corpus can be an open-source open chat corpus, such as short message chat records, and the like, and can contain corpora of various fields, usually common conversation data in life, and belong to different fields with effective texts.
And inserting the M invalid fragments into M inserting positions in a one-to-one correspondence manner. That is, each insertion location inserts a null segment, and different insertion locations insert different null segments.
After inserting the invalid segment into the valid text, the beginning sentence and the ending sentence of the valid segment may be marked, for example, the beginning sentence of the valid segment is marked as 1, the ending sentence is marked as 2, and the corresponding inserted invalid segment is excluded when the corresponding inserted invalid segment is complete, so that the inserted invalid segment does not need to be marked.
In an alternative embodiment, one implementation manner of inserting the redundant segments into the valid text may be as follows:
and randomly selecting Q positions in the valid text as Q insertion positions. That is, in the embodiment of the present application, Q redundant segments are inserted in the valid text. Optionally, the insertion position may be a position where the sentence break identifier is located, and the position of the handicapped person may be before the sentence break identifier or after the sentence break identifier. By way of example, Q may be any value between 1 and 6. However, this is merely an example and does not limit the present invention.
For each insertion site, a continuous segment of a predetermined length is selected in the area adjacent to the insertion site. Alternatively, a continuous valid segment of a predetermined length may be selected for copying in a window of a predetermined size prior to the insertion position. For example, 1-10 sentences are selected in succession from the 20 sentences before the insertion position.
And adding noise to the selected continuous segments with preset length and inserting the continuous segments into the insertion position. The noise adding method may be deleting, replacing, inserting, and the like of low-information-content words according to TFIDF (Term Frequency-Inverse Document Frequency), or may be adding noise by using an eda (easy Data augmentation) Data enhancement method (i.e., deleting, inserting, and replacing synonyms of text segments).
And adding noise to the selected continuous segments with preset lengths to obtain segments, namely redundant segments.
In general, the insertion positions of the invalid and redundant fragments in the valid text are different. Since the redundant segment is related to the information of the valid segment and interspersed in the valid segment, and the continuity of the prediction of the valid text segment is not affected, the marking of the starting sentence and the ending sentence of the segment may not be performed, but only each sentence in the redundant segment is redundantly marked, for example, each sentence in the redundant segment is marked as 3.
In an alternative embodiment, another training process of the text processing model may include:
inputting the first type of text sample into a text processing model to obtain a boundary position sequence corresponding to the first type of text sample output by the text processing model; each boundary position in the boundary position sequence corresponding to the first type of text sample indicates a starting sentence or an ending sentence of one effective segment in the first type of text sample, wherein the starting sentence of the Kth effective segment in the first type of text sample is determined based on the ending sentence of the Kth-1 effective segment in the first type of text sample, and the ending sentence of the Kth effective segment in the first type of text sample is determined based on the starting sentence of the Kth effective segment in the first type of text sample.
Updating parameters of the text processing model by taking the boundary position sequence corresponding to the first type of text sample approaching to the boundary position sequence corresponding to the first type of text sample as a target to obtain an initial text processing model; the first type of text sample is obtained by inserting at least invalid segments in the valid text.
In an alternative embodiment, the ending sentence of the K-1 th valid segment used for determining the starting sentence of the K-th valid segment in the first type of sample text may be predicted by the text processing model, and the starting sentence of the K-th valid segment used for determining the starting sentence of the K-th valid segment may be predicted by the text processing model.
Or,
the end sentence of the K-1 th valid segment used when determining the start sentence of the K-th valid segment in the first type of sample text may be determined according to the boundary position sequence tag corresponding to the first type of sample text, and the start sentence of the K-th valid segment used when determining the end sentence of the K-th valid segment may be determined according to the boundary position sequence tag corresponding to the first type of sample text.
In an alternative embodiment, the first type of text sample may be obtained by inserting invalid segments into valid text, or the first type of text sample may be obtained by inserting invalid segments and redundant segments into valid text.
The difference between the boundary position sequence corresponding to the first type of text sample and the boundary position sequence label corresponding to the first type of text sample can be calculated by using a cross entropy loss function, a back propagation gradient is obtained based on the difference, and the parameters of the text processing model are updated based on the back propagation gradient to obtain an initial text processing model.
Inputting the second type of text sample into an initial text processing model to obtain a boundary position sequence corresponding to the second type of text sample output by the initial text model; the second type of text sample is originally acquired text containing at least valid segments and invalid segments.
Optionally, if the first type of text is obtained by inserting an invalid segment into the valid text, the second type of text sample may be an originally collected text containing the valid segment and the invalid segment.
If the first type of text is obtained by inserting invalid segments and redundant segments into valid text, the second type of text sample may be originally collected text containing valid segments, invalid segments, and redundant segments.
And updating the parameters of the initial text processing model by taking the boundary position sequence corresponding to the second type of text sample approaching to the boundary position sequence label corresponding to the second type of text sample as a target.
The difference between the boundary position sequence corresponding to the second type of text sample and the boundary position sequence label corresponding to the second type of text sample can be calculated by using a cross entropy loss function, a back propagation gradient is obtained based on the difference, and the parameters of the initial text processing model are updated based on the back propagation gradient to obtain a final text processing model.
In an alternative embodiment, the processing each sentence in the text according to the text feature of each sentence in the text to obtain the boundary position sequence, and the process of determining the redundant sentence in the text may include:
and processing each sentence in the text according to the text characteristics of each sentence in the text by using a text processing model to obtain a boundary position sequence and a redundant sentence identification result.
In an alternative embodiment, the training process of the text processing model may include:
and inputting the first type of text sample into a text processing model to obtain a boundary position sequence corresponding to the first type of text sample output by the text processing model and a redundant sentence recognition result in the first type of text sample. Each boundary position in the boundary position sequence corresponding to the first type of text sample indicates a starting sentence or an ending sentence of one effective segment in the first type of text sample, wherein the starting sentence of the Kth effective segment in the first type of text sample is determined based on the ending sentence of the Kth-1 effective segment in the first type of text sample, and the ending sentence of the Kth effective segment in the first type of text sample is determined based on the starting sentence of the Kth effective segment in the first type of text sample.
Optionally, the ending sentence of the K-1 th valid segment used when determining the starting sentence of the K-th valid segment in the first type of sample text may be predicted by the text processing model, and the starting sentence of the K-th valid segment used when determining the starting sentence of the K-th valid segment may be predicted by the text processing model.
Or,
the end sentence of the K-1 th valid segment used when determining the start sentence of the K-th valid segment in the first type of sample text may be determined according to the boundary position sequence tag corresponding to the first type of sample text, and the start sentence of the K-th valid segment used when determining the end sentence of the K-th valid segment may be determined according to the boundary position sequence tag corresponding to the first type of sample text.
And updating the parameters of the text processing model by taking the aim that the boundary position sequence corresponding to the first type of text sample approaches to the boundary position sequence label corresponding to the first type of text sample and the redundant sentence identification result corresponding to the first type of text approaches to the redundant sentence label corresponding to the first type of text.
The method comprises the steps of calculating a first difference between a boundary position sequence corresponding to a first type of text sample and a boundary position sequence label corresponding to the first type of text sample and a second difference between a redundant sentence identification result corresponding to the first type of text and a redundant sentence label corresponding to the first type of text by using a cross entropy loss function, obtaining a back propagation gradient based on the first difference and the second difference, and updating parameters of a text processing model based on the back propagation gradient.
Wherein, the first type text sample is a text obtained by inserting invalid segments and redundant segments into the valid text. Alternatively, the first type of text sample is originally acquired text containing valid segments, invalid segments and redundant segments. Based on this, only the first type of text samples obtained by inserting the invalid segments and the redundant segments into the valid text, or only the originally acquired first type of text samples containing the valid segments, the invalid segments, and the redundant segments, or both the first type of text samples obtained by inserting the invalid segments and the redundant segments into the valid text and the originally acquired text containing the valid segments, the invalid segments, and the redundant segments may be included in the training data set for training the text processing unit.
In an alternative embodiment, another training process of the text processing model may include:
and inputting the first type of text sample into a text processing model to obtain a boundary position sequence corresponding to the first type of text sample output by the text processing model and a redundant sentence recognition result in the first type of text sample.
And updating the parameters of the text processing model by taking the boundary position sequence corresponding to the first type of text sample approaching to the boundary position sequence label corresponding to the first type of text sample and the redundant sentence identification result corresponding to the first type of text approaching to the redundant sentence label corresponding to the first type of text as targets to obtain the initial text processing model. Wherein the first type text sample is obtained by inserting invalid segments and redundant segments into the valid text.
And inputting the second type of text sample into the initial text processing model to obtain a boundary position sequence corresponding to the second type of text sample output by the initial text processing model and a redundant sentence recognition result in the second type of text sample. Wherein the second type text sample originally collects texts containing invalid segments and redundant segments.
And updating the parameters of the initial text processing model to obtain a final text processing model by taking the aim that the boundary position sequence corresponding to the second type of text sample approaches to the boundary position sequence label corresponding to the second type of text sample and the redundant sentence identification result corresponding to the second type of text approaches to the redundant sentence label corresponding to the second type of text.
In an alternative embodiment, the processing each sentence in the text according to the text feature of each sentence in the text to obtain the boundary position sequence, and the process of determining the redundant sentence in the text may include:
processing each sentence in the text according to the text characteristics of each sentence in the text by using a text processing model to obtain a boundary position sequence, acquiring effective segments in a first type of text sample based on the boundary position sequence to form a target text, and performing redundant sentence recognition on the target text to obtain a redundant sentence recognition result.
In an alternative embodiment, the training process of the text processing model may include:
inputting the first type of text sample into a text processing model to obtain a boundary position sequence corresponding to the first type of text sample output by the text processing model, acquiring effective segments in the first type of text sample based on the boundary position sequence to form a target text, and performing redundant sentence recognition on the target text to obtain a redundant sentence recognition result. Each boundary position in the boundary position sequence corresponding to the first type of text sample indicates a starting sentence or an ending sentence of one effective segment in the first type of text sample, wherein the starting sentence of the Kth effective segment in the first type of text sample is determined based on the ending sentence of the Kth-1 effective segment in the first type of text sample, and the ending sentence of the Kth effective segment in the first type of text sample is determined based on the starting sentence of the Kth effective segment in the first type of text sample.
Optionally, the ending sentence of the K-1 th valid segment used when determining the starting sentence of the K-th valid segment in the first type of sample text may be predicted by the text processing model, and the starting sentence of the K-th valid segment used when determining the starting sentence of the K-th valid segment may be predicted by the text processing model.
Or,
the end sentence of the K-1 th valid segment used when determining the start sentence of the K-th valid segment in the first type of sample text may be determined according to the boundary position sequence tag corresponding to the first type of sample text, and the start sentence of the K-th valid segment used when determining the end sentence of the K-th valid segment may be determined according to the boundary position sequence tag corresponding to the first type of sample text.
And updating the parameters of the text processing model by taking the aim that the boundary position sequence corresponding to the first type of text sample approaches to the boundary position sequence label corresponding to the first type of text sample and the redundant sentence identification result corresponding to the first type of text approaches to the redundant sentence label corresponding to the first type of text.
The method comprises the steps of calculating a first difference between a boundary position sequence corresponding to a first type of text sample and a boundary position sequence label corresponding to the first type of text sample and a second difference between a redundant sentence identification result corresponding to the first type of text and a redundant sentence label corresponding to the first type of text by using a cross entropy loss function, obtaining a back propagation gradient based on the first difference and the second difference, and updating parameters of a text processing model based on the back propagation gradient.
Wherein, the first type text sample is a text obtained by inserting invalid segments and redundant segments into the valid text. Alternatively, the first type of text sample is originally acquired text containing valid segments, invalid segments and redundant segments. Based on this, only the first type of text samples obtained by inserting the invalid segments and the redundant segments into the valid text, or only the originally acquired first type of text samples containing the valid segments, the invalid segments, and the redundant segments, or both the first type of text samples obtained by inserting the invalid segments and the redundant segments into the valid text and the originally acquired text containing the valid segments, the invalid segments, and the redundant segments may be included in the training data set for training the text processing unit.
In an alternative embodiment, another training process of the text processing model may include:
inputting the first type of text sample into a text processing model to obtain a boundary position sequence corresponding to the first type of text sample output by the text processing model, acquiring effective segments in the first type of text sample based on the boundary position sequence to form a target text, and performing redundant sentence identification on the target text to obtain a redundant sentence identification result corresponding to the first type of text sample.
And updating the parameters of the text processing model by taking the boundary position sequence corresponding to the first type of text sample approaching to the boundary position sequence label corresponding to the first type of text sample and the redundant sentence identification result corresponding to the first type of text approaching to the redundant sentence label corresponding to the first type of text as targets to obtain the initial text processing model.
Inputting the second type of text sample into the initial text processing model to obtain a boundary position sequence corresponding to the second type of text sample output by the initial text processing model, acquiring effective segments in the second type of text sample based on the boundary position sequence to form a target text, and performing redundant sentence identification on the target text to obtain a redundant sentence identification result corresponding to the second type of text sample.
And updating the parameters of the initial text processing model to obtain a final text processing model by taking the aim that the boundary position sequence corresponding to the second type of text sample approaches to the boundary position sequence label corresponding to the second type of text sample and the redundant sentence identification result corresponding to the second type of text approaches to the redundant sentence label corresponding to the second type of text.
Corresponding to the method embodiment, an embodiment of the present application further provides a text processing apparatus, and a schematic structural diagram of the text processing apparatus provided in the embodiment of the present application is shown in fig. 8, and the text processing apparatus may include:
a boundary position sequence acquisition module 801 and a target text acquisition module 802; wherein,
the boundary position sequence obtaining module 801 is configured to process each sentence in the text according to a text feature of each sentence in the text to obtain a boundary position sequence; each boundary position in the sequence of boundary positions indicates a starting sentence or an ending sentence of one valid segment, wherein the starting sentence of the Kth valid segment is determined based on the ending sentence of the Kth-1 valid segment, and the ending sentence of the Kth valid segment is determined based on the starting sentence of the Kth valid segment; k is a positive integer greater than zero;
the target text obtaining module 802 is configured to obtain valid segments in the text based on the boundary position sequence to form a target text.
The text processing device provided by the embodiment of the application processes each sentence in the text according to the text characteristics of each sentence in the text to obtain a boundary position sequence; each boundary position in the boundary position sequence indicates a starting sentence or an ending sentence of one effective segment, wherein the starting sentence of the Kth effective segment is determined based on the ending sentence of the Kth-1 effective segment, and the ending sentence of the Kth effective segment is determined based on the starting sentence of the Kth effective segment; and acquiring effective segments in the text based on the boundary position sequence to form the target text. The automatic extraction of effective fragments in the text is realized, and the efficiency of regularizing the text is improved.
In an alternative embodiment, the boundary position sequence obtaining module 801 includes:
the candidate boundary position sequence acquisition module is used for acquiring a plurality of candidate boundary position sequences according to the text characteristics of each sentence in the text;
the score calculation module is used for calculating the score of each candidate boundary position sequence, the score of each candidate boundary position represents the confidence coefficient of the candidate boundary, and the higher the score is, the higher the confidence coefficient is;
and the boundary position sequence determining module is used for taking the candidate boundary position sequence with the highest score as the boundary position sequence.
In an optional embodiment, the candidate boundary position sequence obtaining module may be specifically configured to:
after a first class candidate boundary position sequence is obtained according to the text features of each sentence in the text, a second class candidate boundary position sequence is obtained based on the first class candidate boundary position sequence;
each candidate boundary position in the first class of candidate boundary position sequences indicates a candidate starting sentence or a candidate ending sentence of one effective fragment from 1 st to K-1 st effective fragments in the text;
each boundary position in the second type of candidate boundary position sequence indicates a candidate starting sentence or a candidate ending sentence of one effective segment from 1 st to K-1 th effective segments in the text, or indicates a candidate starting sentence of the K-th effective segment in the text.
In an optional embodiment, when the candidate boundary position sequence obtaining module obtains the second type of candidate boundary position sequence based on the first type of candidate boundary position sequence, the candidate boundary position sequence obtaining module is specifically configured to:
for the candidate ending sentences of the K-1 th effective segment indicated by the candidate boundary positions in each first-class candidate boundary position sequence, calculating the probability that each sentence in the text after the candidate ending sentence belongs to the starting sentence of the K-th effective segment according to the text characteristics of the candidate ending sentences and the text characteristics of each sentence in the text after the candidate ending sentence;
for each sentence after the candidate ending sentence, calculating a score of a new candidate boundary position sequence obtained by adding the sentence into the first candidate boundary position sequence indicating the candidate boundary position of the candidate ending sentence according to the first candidate boundary position sequence indicating the candidate boundary position of the candidate ending sentence and the probability of the sentence belonging to the starting sentence of the Kth effective segment;
and determining the second-class candidate boundary position sequences in all the new candidate boundary position sequences according to the scores of all the new candidate boundary position sequences obtained based on the candidate end sentences of the K-1 th effective segment indicated by the candidate boundary positions in the first-class candidate boundary position sequences.
In an optional embodiment, the candidate boundary position sequence obtaining module may be specifically configured to:
after a second type of candidate boundary position sequence is obtained according to the text characteristics of each sentence in the text, a first type of candidate boundary position sequence is obtained based on the second type of candidate boundary position sequence;
each candidate boundary position in the first class of candidate boundary position sequences indicates a candidate starting sentence or a candidate ending sentence of one effective segment from 1 st to K effective segments in the text;
each boundary position in the second type of candidate boundary position sequence indicates a candidate starting sentence or a candidate ending sentence of one effective segment from 1 st to K-1 th effective segments in the text, or indicates a candidate starting sentence of the K-th effective segment in the text.
In an optional embodiment, the candidate boundary position sequence obtaining module may be specifically configured to:
for the candidate starting sentence of the Kth effective segment indicated by the candidate boundary position in each second-class candidate boundary position sequence, calculating the probability that each sentence in the text after the candidate ending sentence belongs to the ending sentence of the Kth effective segment according to the text characteristics of the candidate starting sentence and the text characteristics of each sentence in the text after the candidate starting sentence;
for each sentence after the candidate starting sentence, calculating a score of a new candidate boundary position sequence obtained by adding the sentence into the second candidate boundary position sequence indicating the candidate boundary position of the candidate starting sentence according to the second candidate boundary position sequence indicating the candidate boundary position of the candidate starting sentence and the probability of the sentence belonging to the ending sentence of the Kth effective segment;
and determining the first-class candidate boundary position sequence in all the new candidate boundary position sequences according to the scores of all the new candidate boundary position sequences obtained based on the candidate starting sentences of the Kth effective segment indicated by the candidate boundary positions in the second-class candidate boundary position sequences.
In an optional embodiment, the text processing apparatus provided in the embodiment of the present application may further include:
a redundant sentence determining module for determining a redundant sentence in the target text;
and the deleting module is used for deleting the redundant sentences in the target text.
In an alternative embodiment, the redundant sentence determination module is specifically configured to:
for each sentence in the target text, acquiring the probability that the sentence belongs to a redundant sentence; the probability that the sentence belongs to the redundant sentence is calculated according to the text characteristics of the sentence and the text characteristics of the initial sentence of the effective segment in which the sentence is located; and if the probability that the sentence belongs to the redundant sentence is greater than the probability threshold value, determining the sentence as the redundant sentence.
In an optional embodiment, the text processing apparatus provided in the embodiment of the present application may further include:
the text characteristic acquisition module is used for acquiring word vectors of all words in the sentence and codes of positions of all the words in the sentence for each sentence in the text; obtaining a representation vector of the sentence according to the word vector of each word in the sentence and the position code of each word; acquiring the code of the position of the sentence in the text; and obtaining the text characteristics of the sentence according to the characterization vector of the sentence and the position code of the sentence.
In an optional embodiment, the boundary position sequence obtaining module 801 is specifically configured to: and processing each sentence in the text according to the text characteristics of each sentence in the text by using a text processing model to obtain a boundary position sequence.
In an optional embodiment, the text processing apparatus provided in the embodiment of the present application may further include: a first model training module to:
inputting a first type of text sample into the text processing model to obtain a boundary position sequence corresponding to the first type of text sample output by the text processing model; each boundary position in the boundary position sequence corresponding to the first type of text sample indicates a starting sentence or an ending sentence of one effective segment in the first type of text sample, wherein the starting sentence of the Kth effective segment in the first type of text sample is determined based on the ending sentence of the Kth-1 effective segment in the first type of text sample, and the ending sentence of the Kth effective segment in the first type of text sample is determined based on the starting sentence of the Kth effective segment in the first type of text sample;
and updating the parameters of the text processing model by taking the boundary position sequence corresponding to the first type of text sample approaching to the boundary position sequence label corresponding to the first type of text sample as a target.
In an optional embodiment, an ending sentence of a K-1 th valid segment in the first type of text sample is determined based on the boundary position sequence tags corresponding to the first type of text sample, and a starting sentence of the K-th valid segment in the first type of text sample is determined based on the boundary position sequence tags corresponding to the first type of text sample.
In an optional embodiment, the first type of text sample is obtained by inserting an invalid segment into an effective text, or the first type of text sample is an originally acquired text containing the effective segment and the invalid segment;
or,
the first type of text is obtained by inserting invalid segments and redundant segments into valid text, or the first type of text sample is originally acquired text containing valid segments, invalid segments and redundant segments.
In an optional embodiment, the text processing apparatus provided in the embodiment of the present application may further include: a second model training module to:
inputting a first type of text sample into the text processing model to obtain a boundary position sequence corresponding to the first type of text sample output by the text processing model; each boundary position in the boundary position sequence corresponding to the first type of text sample indicates a starting sentence or an ending sentence of one effective segment in the first type of text sample, wherein the starting sentence of the Kth effective segment in the first type of text sample is determined based on the ending sentence of the Kth-1 effective segment in the first type of text sample, and the ending sentence of the Kth effective segment in the first type of text sample is determined based on the starting sentence of the Kth effective segment in the first type of text sample;
taking the boundary position sequence corresponding to the first type of text sample approaching to the boundary position sequence corresponding to the first type of text sample as a target, and updating the parameters of the text processing model to obtain an initial text processing model; the first type of text sample is obtained by inserting at least invalid segments into valid texts;
inputting a second type of text sample into the initial text processing model to obtain a boundary position sequence output by the initial text model and corresponding to the second type of text sample; the second type of text sample is originally acquired text at least comprising valid fragments and invalid fragments;
and updating the parameters of the initial text processing model by taking the boundary position sequence corresponding to the second type of text sample approaching to the boundary position sequence label corresponding to the second type of text sample as a target.
The text processing device provided by the embodiment of the application can be applied to text processing equipment, such as a PC terminal, a cloud platform, a server cluster and the like. Alternatively, fig. 9 shows a block diagram of a hardware structure of the text processing apparatus, and referring to fig. 9, the hardware structure of the text processing apparatus may include: at least one processor 1, at least one communication interface 2, at least one memory 3 and at least one communication bus 4;
in the embodiment of the application, the number of the processor 1, the communication interface 2, the memory 3 and the communication bus 4 is at least one, and the processor 1, the communication interface 2 and the memory 3 complete mutual communication through the communication bus 4;
the processor 1 may be a central processing unit CPU, or an application Specific Integrated circuit asic, or one or more Integrated circuits configured to implement embodiments of the present invention, etc.;
the memory 3 may include a high-speed RAM memory, and may further include a non-volatile memory (non-volatile memory) or the like, such as at least one disk memory;
wherein the memory stores a program and the processor can call the program stored in the memory, the program for:
processing each sentence in the text according to the text characteristics of each sentence in the text to obtain a boundary position sequence; each boundary position in the sequence of boundary positions indicates a starting sentence or an ending sentence of one valid segment, wherein the starting sentence of the Kth valid segment is determined based on the ending sentence of the Kth-1 valid segment, and the ending sentence of the Kth valid segment is determined based on the starting sentence of the Kth valid segment; k is a positive integer greater than zero;
and acquiring effective segments in the text based on the boundary position sequence to form a target text.
Alternatively, the detailed function and the extended function of the program may be as described above.
Embodiments of the present application further provide a storage medium, where a program suitable for execution by a processor may be stored, where the program is configured to:
processing each sentence in the text according to the text characteristics of each sentence in the text to obtain a boundary position sequence; each boundary position in the sequence of boundary positions indicates a starting sentence or an ending sentence of one valid segment, wherein the starting sentence of the Kth valid segment is determined based on the ending sentence of the Kth-1 valid segment, and the ending sentence of the Kth valid segment is determined based on the starting sentence of the Kth valid segment; k is a positive integer greater than zero;
and acquiring effective segments in the text based on the boundary position sequence to form a target text.
Alternatively, the detailed function and the extended function of the program may be as described above.
Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.
The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.
Claims (17)
1. A method of text processing, comprising:
processing each sentence in the text according to the text characteristics of each sentence in the text to obtain a boundary position sequence; each boundary position in the sequence of boundary positions indicates a starting sentence or an ending sentence of one valid segment, wherein the starting sentence of the Kth valid segment is determined based on the ending sentence of the Kth-1 valid segment, and the ending sentence of the Kth valid segment is determined based on the starting sentence of the Kth valid segment; k is a positive integer greater than zero;
and acquiring effective segments in the text based on the boundary position sequence to form a target text.
2. The method according to claim 1, wherein the processing each sentence in the text according to the text feature of each sentence in the text to obtain the boundary position sequence comprises:
acquiring a plurality of candidate boundary position sequences according to the text characteristics of each sentence in the text;
calculating the score of each candidate boundary position sequence, wherein the score of each candidate boundary position represents the confidence coefficient of the candidate boundary, and the higher the score is, the higher the confidence coefficient is;
and taking the candidate boundary position sequence with the highest score as the boundary position sequence.
3. The method according to claim 2, wherein the obtaining a plurality of candidate boundary position sequences according to the text features of each sentence in the text comprises:
after a first class candidate boundary position sequence is obtained according to the text features of each sentence in the text, a second class candidate boundary position sequence is obtained based on the first class candidate boundary position sequence;
each candidate boundary position in the first class of candidate boundary position sequences indicates a candidate starting sentence or a candidate ending sentence of one effective fragment from 1 st to K-1 st effective fragments in the text;
each boundary position in the second type of candidate boundary position sequence indicates a candidate starting sentence or a candidate ending sentence of one effective segment from 1 st to K-1 th effective segments in the text, or indicates a candidate starting sentence of the K-th effective segment in the text.
4. The method of claim 3, wherein obtaining the second class of candidate boundary position sequences based on the first class of candidate boundary position sequences comprises:
for the candidate ending sentences of the K-1 th effective segment indicated by the candidate boundary positions in each first-class candidate boundary position sequence, calculating the probability that each sentence in the text after the candidate ending sentence belongs to the starting sentence of the K-th effective segment according to the text characteristics of the candidate ending sentences and the text characteristics of each sentence in the text after the candidate ending sentence;
for each sentence after the candidate ending sentence, calculating a score of a new candidate boundary position sequence obtained by adding the sentence into the first candidate boundary position sequence indicating the candidate boundary position of the candidate ending sentence according to the first candidate boundary position sequence indicating the candidate boundary position of the candidate ending sentence and the probability of the sentence belonging to the starting sentence of the Kth effective segment;
and determining the second-class candidate boundary position sequences in all the new candidate boundary position sequences according to the scores of all the new candidate boundary position sequences obtained based on the candidate end sentences of the K-1 th effective segment indicated by the candidate boundary positions in the first-class candidate boundary position sequences.
5. The method according to claim 2, wherein the obtaining a plurality of candidate boundary position sequences according to the text features of each sentence in the text comprises:
after a second type of candidate boundary position sequence is obtained according to the text characteristics of each sentence in the text, a first type of candidate boundary position sequence is obtained based on the second type of candidate boundary position sequence;
each candidate boundary position in the first class of candidate boundary position sequences indicates a candidate starting sentence or a candidate ending sentence of one effective segment from 1 st to K effective segments in the text;
each boundary position in the second type of candidate boundary position sequence indicates a candidate starting sentence or a candidate ending sentence of one effective segment from 1 st to K-1 th effective segments in the text, or indicates a candidate starting sentence of the K-th effective segment in the text.
6. The method according to claim 5, wherein obtaining the first class of candidate boundary position sequences based on the second class of candidate boundary position sequences comprises:
for the candidate starting sentence of the Kth effective segment indicated by the candidate boundary position in each second-class candidate boundary position sequence, calculating the probability that each sentence in the text after the candidate ending sentence belongs to the ending sentence of the Kth effective segment according to the text characteristics of the candidate starting sentence and the text characteristics of each sentence in the text after the candidate starting sentence;
for each sentence after the candidate starting sentence, calculating a score of a new candidate boundary position sequence obtained by adding the sentence into the second candidate boundary position sequence indicating the candidate boundary position of the candidate starting sentence according to the second candidate boundary position sequence indicating the candidate boundary position of the candidate starting sentence and the probability of the sentence belonging to the ending sentence of the Kth effective segment;
and determining the first-class candidate boundary position sequence in all the new candidate boundary position sequences according to the scores of all the new candidate boundary position sequences obtained based on the candidate starting sentences of the Kth effective segment indicated by the candidate boundary positions in the second-class candidate boundary position sequences.
7. The method of claim 1, further comprising:
determining a redundant sentence in the target text;
and deleting redundant sentences in the target text.
8. The method of claim 7, wherein the determining redundant sentences in the target text comprises:
for each sentence in the target text, acquiring the probability that the sentence belongs to a redundant sentence; the probability that the sentence belongs to the redundant sentence is calculated according to the text characteristics of the sentence and the text characteristics of the initial sentence of the effective segment in which the sentence is located;
and if the probability that the sentence belongs to the redundant sentence is greater than the probability threshold value, determining the sentence as the redundant sentence.
9. The method of claim 1, wherein the text characteristics of each sentence in the text are obtained by:
for each sentence in the text, acquiring a word vector of each word in the sentence and a code of the position of each word in the sentence;
obtaining a representation vector of the sentence according to the word vector of each word in the sentence and the position code of each word;
acquiring the code of the position of the sentence in the text;
and obtaining the text characteristics of the sentence according to the characterization vector of the sentence and the position code of the sentence.
10. The method according to any one of claims 1 to 9, wherein the processing each sentence in the text according to the text feature of each sentence in the text to obtain the boundary position sequence comprises:
and processing each sentence in the text according to the text characteristics of each sentence in the text by using a text processing model to obtain a boundary position sequence.
11. The method of claim 10, wherein the training process of the text processing model comprises:
inputting a first type of text sample into the text processing model to obtain a boundary position sequence corresponding to the first type of text sample output by the text processing model; each boundary position in the boundary position sequence corresponding to the first type of text sample indicates a starting sentence or an ending sentence of one effective segment in the first type of text sample, wherein the starting sentence of the Kth effective segment in the first type of text sample is determined based on the ending sentence of the Kth-1 effective segment in the first type of text sample, and the ending sentence of the Kth effective segment in the first type of text sample is determined based on the starting sentence of the Kth effective segment in the first type of text sample;
and updating the parameters of the text processing model by taking the boundary position sequence corresponding to the first type of text sample approaching to the boundary position sequence label corresponding to the first type of text sample as a target.
12. The method according to claim 11, wherein the ending sentence of the K-1 th valid segment in the first type of text sample is determined based on the boundary position sequence tags corresponding to the first type of text sample, and the starting sentence of the K-th valid segment in the first type of text sample is determined based on the boundary position sequence tags corresponding to the first type of text sample.
13. The method according to claim 11, wherein the first type of text sample is obtained by inserting invalid segments into valid text, or the first type of text sample is originally acquired text containing valid segments and invalid segments;
or,
the first type of text is obtained by inserting invalid segments and redundant segments into valid text, or the first type of text sample is originally acquired text containing valid segments, invalid segments and redundant segments.
14. The method of claim 10, wherein the training process of the text processing model comprises:
inputting a first type of text sample into the text processing model to obtain a boundary position sequence corresponding to the first type of text sample output by the text processing model; each boundary position in the boundary position sequence corresponding to the first type of text sample indicates a starting sentence or an ending sentence of one effective segment in the first type of text sample, wherein the starting sentence of the Kth effective segment in the first type of text sample is determined based on the ending sentence of the Kth-1 effective segment in the first type of text sample, and the ending sentence of the Kth effective segment in the first type of text sample is determined based on the starting sentence of the Kth effective segment in the first type of text sample;
taking the boundary position sequence corresponding to the first type of text sample approaching to the boundary position sequence corresponding to the first type of text sample as a target, and updating the parameters of the text processing model to obtain an initial text processing model; the first type of text sample is obtained by inserting at least invalid segments into valid texts;
inputting a second type of text sample into the initial text processing model to obtain a boundary position sequence output by the initial text model and corresponding to the second type of text sample; the second type of text sample is originally acquired text at least comprising valid fragments and invalid fragments;
and updating the parameters of the initial text processing model by taking the boundary position sequence corresponding to the second type of text sample approaching to the boundary position sequence label corresponding to the second type of text sample as a target.
15. A text processing apparatus, comprising:
the boundary position sequence acquisition module is used for processing each sentence in the text according to the text characteristics of each sentence in the text to obtain a boundary position sequence; each boundary position in the sequence of boundary positions indicates a starting sentence or an ending sentence of one valid segment, wherein the starting sentence of the Kth valid segment is determined based on the ending sentence of the Kth-1 valid segment, and the ending sentence of the Kth valid segment is determined based on the starting sentence of the Kth valid segment; k is a positive integer greater than zero;
and the target text acquisition module is used for acquiring the effective segments in the text based on the boundary position sequence to form a target text.
16. A text processing apparatus comprising a memory and a processor;
the memory is used for storing programs;
the processor, which executes the program, implements the steps of the text processing method according to any one of claims 1 to 14.
17. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the text processing method according to any one of claims 1 to 14.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011632673.0A CN112818077B (en) | 2020-12-31 | 2020-12-31 | Text processing method, device, equipment and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011632673.0A CN112818077B (en) | 2020-12-31 | 2020-12-31 | Text processing method, device, equipment and storage medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112818077A true CN112818077A (en) | 2021-05-18 |
CN112818077B CN112818077B (en) | 2023-05-30 |
Family
ID=75856482
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202011632673.0A Active CN112818077B (en) | 2020-12-31 | 2020-12-31 | Text processing method, device, equipment and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112818077B (en) |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107305541A (en) * | 2016-04-20 | 2017-10-31 | 科大讯飞股份有限公司 | Speech recognition text segmentation method and device |
US20180365323A1 (en) * | 2017-06-16 | 2018-12-20 | Elsevier, Inc. | Systems and methods for automatically generating content summaries for topics |
CN109299179A (en) * | 2018-10-15 | 2019-02-01 | 西门子医疗系统有限公司 | Structural data extraction element, method and storage medium |
CN109977219A (en) * | 2019-03-19 | 2019-07-05 | 国家计算机网络与信息安全管理中心 | Text snippet automatic generation method and device based on heuristic rule |
CN110888976A (en) * | 2019-11-14 | 2020-03-17 | 北京香侬慧语科技有限责任公司 | Text abstract generation method and device |
CN111666759A (en) * | 2020-04-17 | 2020-09-15 | 北京百度网讯科技有限公司 | Method and device for extracting key information of text, electronic equipment and storage medium |
-
2020
- 2020-12-31 CN CN202011632673.0A patent/CN112818077B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107305541A (en) * | 2016-04-20 | 2017-10-31 | 科大讯飞股份有限公司 | Speech recognition text segmentation method and device |
US20180365323A1 (en) * | 2017-06-16 | 2018-12-20 | Elsevier, Inc. | Systems and methods for automatically generating content summaries for topics |
CN109299179A (en) * | 2018-10-15 | 2019-02-01 | 西门子医疗系统有限公司 | Structural data extraction element, method and storage medium |
CN109977219A (en) * | 2019-03-19 | 2019-07-05 | 国家计算机网络与信息安全管理中心 | Text snippet automatic generation method and device based on heuristic rule |
CN110888976A (en) * | 2019-11-14 | 2020-03-17 | 北京香侬慧语科技有限责任公司 | Text abstract generation method and device |
CN111666759A (en) * | 2020-04-17 | 2020-09-15 | 北京百度网讯科技有限公司 | Method and device for extracting key information of text, electronic equipment and storage medium |
Non-Patent Citations (2)
Title |
---|
QING LI等: "Personalized text snippet extraction using statistical language models", 《PATTERN RECOGNITON》 * |
徐永东: "多文档自动文摘关键技术研究", 《中国博士学位论文全文数据库信息科技辑》 * |
Also Published As
Publication number | Publication date |
---|---|
CN112818077B (en) | 2023-05-30 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111581976B (en) | Medical term standardization method, device, computer equipment and storage medium | |
CN111177319B (en) | Method and device for determining risk event, electronic equipment and storage medium | |
CN109271641B (en) | Text similarity calculation method and device and electronic equipment | |
WO2022142011A1 (en) | Method and device for address recognition, computer device, and storage medium | |
CN108538286A (en) | A kind of method and computer of speech recognition | |
CN111931491B (en) | Domain dictionary construction method and device | |
CN111949802A (en) | Construction method, device and equipment of knowledge graph in medical field and storage medium | |
US20160188569A1 (en) | Generating a Table of Contents for Unformatted Text | |
CN112287680B (en) | Entity extraction method, device and equipment of inquiry information and storage medium | |
CN112287069A (en) | Information retrieval method and device based on voice semantics and computer equipment | |
CN109885831B (en) | Keyword extraction method, device, equipment and computer readable storage medium | |
CN116402166B (en) | Training method and device of prediction model, electronic equipment and storage medium | |
CN111177375A (en) | Electronic document classification method and device | |
CN113190675A (en) | Text abstract generation method and device, computer equipment and storage medium | |
CN112188311A (en) | Method and apparatus for determining video material of news | |
CN114330335A (en) | Keyword extraction method, device, equipment and storage medium | |
CN113407677A (en) | Method, apparatus, device and storage medium for evaluating quality of consultation session | |
CN116150621A (en) | Training method, device and equipment for text model | |
CN114692594A (en) | Text similarity recognition method and device, electronic equipment and readable storage medium | |
CN113836261A (en) | Patent text novelty/creativity prediction method and device | |
CN115952854B (en) | Training method of text desensitization model, text desensitization method and application | |
CN112818077B (en) | Text processing method, device, equipment and storage medium | |
CN111507109A (en) | Named entity identification method and device of electronic medical record | |
CN113553410B (en) | Long document processing method, processing device, electronic equipment and storage medium | |
CN115858776A (en) | Variant text classification recognition method, system, storage medium and electronic equipment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |