CN111199150A - Text segmentation method, related device and readable storage medium - Google Patents

Text segmentation method, related device and readable storage medium Download PDF

Info

Publication number
CN111199150A
CN111199150A CN201911398383.1A CN201911398383A CN111199150A CN 111199150 A CN111199150 A CN 111199150A CN 201911398383 A CN201911398383 A CN 201911398383A CN 111199150 A CN111199150 A CN 111199150A
Authority
CN
China
Prior art keywords
text
segmentation
unit
word
text unit
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201911398383.1A
Other languages
Chinese (zh)
Other versions
CN111199150B (en
Inventor
闫莉
孔常青
万根顺
高建清
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
iFlytek Co Ltd
Original Assignee
iFlytek Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by iFlytek Co Ltd filed Critical iFlytek Co Ltd
Priority to CN201911398383.1A priority Critical patent/CN111199150B/en
Publication of CN111199150A publication Critical patent/CN111199150A/en
Application granted granted Critical
Publication of CN111199150B publication Critical patent/CN111199150B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Machine Translation (AREA)

Abstract

The application discloses a text segmentation method, related equipment and a readable storage medium, wherein after a text to be segmented is obtained, segmentation characteristics of each text unit in the text to be segmented are obtained, segmentation boundaries of the text to be segmented are determined according to the segmentation characteristics of each text unit, and finally the text to be segmented is segmented based on the segmentation boundaries of the text to be segmented. Based on the scheme, the segmentation of the text to be segmented can be realized.

Description

Text segmentation method, related device and readable storage medium
Technical Field
The present application relates to the field of natural language processing technologies, and in particular, to a text segmentation method, a related device, and a readable storage medium.
Background
With the rapid development of statistical natural language processing technology, text segmentation increasingly becomes an important research direction. The text segmentation is to determine a segmentation boundary of the long text without segmentation, and segment the long text without segmentation into text segments according to the determined segmentation boundary, wherein compared with the long text without segmentation, the segmented text segments are short and short in length and accord with reading habits of users; meanwhile, the segmented text segment has a simple and clear theme, so that the user can be helped to quickly extract key information, and the reading pressure is relieved.
Therefore, it is desirable to provide a text segmentation method.
Disclosure of Invention
In view of the foregoing problems, the present application provides a text segmentation method, a related device and a readable storage medium. The specific scheme is as follows:
a text segmentation method, comprising:
acquiring a text to be segmented;
acquiring the segmentation characteristics of each text unit in the text to be segmented;
determining a segmentation boundary of the text to be segmented according to the segmentation characteristics of each text unit;
and segmenting the text to be segmented based on the segmentation boundary of the text to be segmented.
Optionally, the obtaining the segmentation feature of each text unit in the text to be segmented includes:
and acquiring the word sequence and clue word characteristics of each text unit in the text to be segmented, wherein the word sequence and clue word characteristics of each text unit are used as the segmentation characteristics of each text unit.
Optionally, the extracting and obtaining the word sequence and the cue word feature of each text unit in the text to be segmented includes:
performing word segmentation on each text unit to obtain a word sequence of each text unit;
determining clue words from the word sequence based on a predetermined set of clue words;
acquiring position information of the clue words in corresponding text units;
and generating clue word characteristics of each text unit according to the position information of the clue words in each text unit.
Optionally, the determining the segmentation boundary of the text to be segmented according to the segmentation feature of each text unit includes:
inputting the segmentation characteristics of each text unit into a text segmentation model to obtain an output result of whether each text unit is a segmentation boundary of the text to be segmented; the text segmentation model is obtained by taking segmentation features of each text unit in a training text as a training sample and taking segmentation boundary identification marking information of the training text as a sample label for training.
Optionally, the text segmentation model includes:
a word encoding layer, an attention layer, a fusion layer, a sentence encoding layer and an output layer.
Optionally, the inputting the segmentation feature of each text unit into the text segmentation model to obtain an output result of whether the starting position of each text unit is the segmentation boundary of the text to be segmented includes:
acquiring a segment length feature of each text unit by using a text segmentation model, wherein the segment length feature is used for representing segment length information from a last segmentation boundary of each text unit to each text unit;
performing word coding on the segmentation characteristics of each text unit by using a word coding layer of the text segmentation model to obtain semantic representation of each text unit;
performing attention calculation on the semantic representation of each text unit by using an attention layer of the text segmentation model to obtain the semantic representation of a sentence of each text unit;
fusing the semantic representation of the sentence of each text unit and the segment length feature of each text unit by utilizing a fusion layer of the text segmentation model to obtain the complete word representation of the sentence of each text unit;
sentence coding is carried out on the complete word representation of the sentence of each text unit by utilizing a sentence coding layer of the text segmentation model, and the sentence representation of each text unit is obtained;
and calculating the sentence representation of each text unit and the sentence representation at the previous moment by using an output layer of the text segmentation model to obtain an output result of whether each text unit is the segmentation boundary of the text to be segmented.
Optionally, the performing word encoding on the segmentation features of each text unit by using a word encoding layer of the text segmentation model to obtain a semantic representation of each text unit includes:
performing word coding on the word sequence in the segmentation characteristics of each text unit to obtain a word meaning representation of each text unit;
obtaining clue word meaning representation of each text unit based on the word meaning representation of each text unit and clue word characteristics in the segmentation characteristics of each text unit; the term meaning token and the cue term meaning token serve as the semantic token.
Optionally, the performing attention calculation on the semantic representation of each text unit by using the attention layer of the text segmentation model to obtain the sentence semantic representation of each text unit includes:
performing attention calculation on the word meaning representation to obtain a first sentence meaning representation of each text unit;
and performing attention calculation on the clue word semantic representations to obtain a second sentence semantic representation of each text unit, wherein the first sentence semantic representation and the second sentence semantic representation are used as the sentence semantic representations.
A text segmentation apparatus comprising:
the segmentation text acquisition unit is used for acquiring a text to be segmented;
the segmentation feature acquisition unit is used for acquiring the segmentation features of each text unit in the text to be segmented;
the segmentation boundary determining unit is used for determining the segmentation boundary of the text to be segmented according to the segmentation characteristics of each text unit;
and the segmentation unit is used for segmenting the text to be segmented based on the segmentation boundary of the text to be segmented.
Optionally, the segmentation feature obtaining unit includes:
and the word sequence and clue word characteristic acquisition unit is used for acquiring the word sequence and clue word characteristics of each text unit in the text to be segmented, and the word sequence and clue word characteristics of each text unit are used as the segmentation characteristics of each text unit.
Optionally, the word sequence and clue feature obtaining unit includes:
the word segmentation unit is used for segmenting each text unit to obtain a word sequence of each text unit;
a clue word determining unit, configured to determine clue words from the word sequence based on a predetermined clue word set;
the clue word position information acquisition unit is used for acquiring the position information of the clue words in the corresponding text units;
and the clue word characteristic generating unit is used for generating clue word characteristics of each text unit according to the position information of the clue words in each text unit.
Optionally, the segmentation boundary determining unit includes:
the model application unit is used for inputting the segmentation characteristics of each text unit into a text segmentation model to obtain an output result of whether each text unit is a segmentation boundary of the text to be segmented; the text segmentation model is obtained by taking segmentation features of each text unit in a training text as a training sample and taking segmentation boundary identification marking information of the training text as a sample label for training.
Optionally, the text segmentation model includes:
a word encoding layer, an attention layer, a fusion layer, a sentence encoding layer and an output layer.
Optionally, the model application unit includes:
the segment length feature acquisition unit is used for acquiring the segment length feature of each text unit by using a text segmentation model, wherein the segment length feature is used for representing the segment length information from the last segmentation boundary of each text unit to each text unit;
the word coding unit is used for carrying out word coding on the segmentation characteristics of each text unit by utilizing a word coding layer of the text segmentation model to obtain semantic representation of each text unit;
the attention calculation unit is used for performing attention calculation on the semantic representation of each text unit by utilizing an attention layer of the text segmentation model to obtain the semantic representation of the sentence of each text unit;
the fusion unit is used for fusing the semantic representation of the sentence of each text unit and the segment length feature of each text unit by utilizing the fusion layer of the text segmentation model to obtain the complete word representation of the sentence of each text unit;
the sentence coding unit is used for coding the complete word representation of the sentence of each text unit by utilizing a sentence coding layer of the text segmentation model to obtain the sentence representation of each text unit;
and the calculation unit is used for calculating the sentence representation of each text unit and the sentence representation at the previous moment by utilizing the output layer of the text segmentation model to obtain an output result of whether each text unit is the segmentation boundary of the text to be segmented.
Optionally, the word encoding unit includes:
the first word coding subunit is used for carrying out word coding on the word sequence in the segmentation characteristics of each text unit to obtain a word meaning representation of each text unit;
the second word coding subunit is used for obtaining clue word meaning representation of each text unit based on the word meaning representation of each text unit and clue word characteristics in the segmentation characteristics of each text unit; the term meaning token and the cue term meaning token serve as the semantic token.
Optionally, the attention calculation unit comprises:
the first attention calculation unit is used for carrying out attention calculation on the word meaning representation to obtain a first sentence meaning representation of each text unit;
and the second attention calculation unit is used for performing attention calculation on the clue word semantic representations to obtain a second sentence semantic representation of each text unit, and the first sentence semantic representation and the second sentence semantic representation are used as the sentence semantic representations.
A text segmentation device comprising a memory and a processor;
the memory is used for storing programs;
the processor is configured to execute the program to implement the steps of the text segmentation method.
A readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the text segmentation method as described above.
By means of the technical scheme, the application discloses a text segmentation method, related equipment and a readable storage medium, after a text to be segmented is obtained, segmentation features of each text unit in the text to be segmented are obtained, segmentation boundaries of the text to be segmented are determined according to the segmentation features of each text unit, and finally the text to be segmented is segmented based on the segmentation boundaries of the text to be segmented. Based on the scheme, the segmentation of the text to be segmented can be realized.
Drawings
Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the application. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:
fig. 1 is a schematic flowchart of a text segmentation method disclosed in an embodiment of the present application;
FIG. 2 is a diagram of a text segmentation model according to an embodiment of the present disclosure;
fig. 3 is a schematic structural diagram of a text segmentation apparatus disclosed in an embodiment of the present application;
fig. 4 is a block diagram of a hardware structure of a text segmentation apparatus according to an embodiment of the present disclosure.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
The text segmentation method disclosed by the application can be applied to post-processing modules of a voice recognition system, a man-machine question-answering system, an information retrieval system and the like, and then, the text segmentation method disclosed by the application is introduced through the following embodiments.
Referring to fig. 1, fig. 1 is a schematic flowchart of a text segmentation method disclosed in an embodiment of the present application, where the method includes:
s101: and acquiring a text to be segmented.
In the present application, the text to be segmented may be any non-segmented text, for example, a non-segmented lecture record manuscript obtained by performing voice recognition on a lecture voice of a user by using a voice recognition system, a non-segmented electronic book, or the like.
In the present application, the text to be segmented may be obtained in an uploading manner by a user, or may also be obtained from outputs of other natural language processing systems (such as a speech recognition system, a man-machine question-answering system, an information retrieval system, and the like), and the present application is not limited in any way.
S102: and acquiring the segmentation characteristics of each text unit in the text to be segmented.
In this application, each text unit in the text to be segmented may be a sentence segmented by an ending punctuation (e.g., a period, an exclamation mark, a question mark, etc.) in the text, or may be a clause or phrase segmented by punctuation marks in the text, and the application is not limited in any way.
In the present application, the segmentation feature of each text unit may be any feature that can be used to determine the segmentation boundary of the text to be segmented, such as a word sequence of each text unit in the text to be segmented, a clue word feature of each text unit in the text to be segmented, a segment length segmentation threshold of the text to be segmented, and the like, and the present application is not limited in any way.
It should be noted that the word sequence of each text unit is a sequence of words contained in the text unit. Clue words are a class of words with strong text segmentation guidance, such as words that often appear at the beginning of a paragraph, such as "first", "next", "last", and the like. The clue word characteristics can be characteristics used for representing information such as the content, number, position and the like of clue words of each text unit in the text to be segmented. The segment length segmentation threshold is a threshold used for limiting the length of a segment after the text to be segmented is segmented.
As a preferred embodiment, the obtaining of the segmentation feature of each text unit in the text to be segmented may include obtaining a word sequence and a clue word feature of each text unit in the text to be segmented, where the word sequence and the clue word feature of each text unit serve as the segmentation feature of each text unit.
It should be noted that, in the following embodiments of the present application, the flow of text segmentation is described on the basis that the word sequence and the clue word feature of each text unit are taken as the segmentation feature of each text unit. However, it is within the scope of the present application to combine other segmentation features with the word sequence and clue word features of each text unit as the segmentation features of each text unit.
S103: and determining the segmentation boundary of the text to be segmented according to the segmentation characteristics of each text unit.
In this application, determining a segmentation boundary of a text to be segmented according to the segmentation feature of each text unit specifically means determining whether the text unit is the segmentation boundary of the text to be segmented according to the segmentation feature of each text unit. Specifically, determining whether the text unit is a segmentation boundary of the text to be segmented refers to determining whether the starting position of the text unit is a segmentation boundary of the text to be segmented. According to the segmentation characteristics of each text unit, one or more segmentation boundaries of the text to be segmented are determined.
S104: and segmenting the text to be segmented based on the segmentation boundary of the text to be segmented.
In the present application, after determining the segmentation boundary of the text to be segmented, the text units at both ends of the segmentation boundary may be segmented into different paragraphs.
The embodiment discloses a text segmentation method, which includes the steps of obtaining segmentation characteristics of each text unit in a text to be segmented after obtaining the text to be segmented, determining segmentation boundaries of the text to be segmented according to the segmentation characteristics of each text unit, and finally segmenting the text to be segmented based on the segmentation boundaries of the text to be segmented. Based on the scheme, the segmentation of the text to be segmented can be realized.
In this application, a specific implementation manner for obtaining a word sequence and clue word features of each text unit in a text to be segmented is disclosed, and the implementation manner includes:
s201: and performing word segmentation on each text unit to obtain a word sequence of each text unit.
In the present application, an existing word segmentation system may be adopted to perform word segmentation on each text unit to obtain a word sequence of each text unit.
S202: clue words are determined from the sequence of words based on a predetermined set of clue words.
In the present application, the words in the word sequence may be sequentially searched in a preset clue dictionary, if the words are searched, the words are determined to be clue words, and if all the words are not searched in the clue dictionary, it is determined that there is no clue word in the text unit.
However, in some cases, the number of words in the word sequence of the text unit is too large, and the probability that the word appearing at the end of the text unit is a clue word is also low, in order to improve the efficiency of determining the clue word, the first few words in the word sequence may be sequentially searched in a preset clue word dictionary, if the first few words are found, the word is determined to be a clue word, and if the first few words are not found in the clue word dictionary, it is determined that there is no clue word in the text unit.
In the application, the training text can be predetermined, for example, news text, electronic books and the like are collected as the training text through a network, the text has natural segmentation information, the acquisition is simple, the scale is large, or the segmented labeling can be performed on the non-segmented text through manual labeling to obtain the training text. After the training text is determined, a clue word dictionary is constructed based on the training text.
The clue word dictionary building method includes the steps of firstly obtaining prepositions, conjunctions and adverbs which are reserved after real words such as nouns, adjectives and numbers with actual meanings are removed from first words of each segment in a training text, then counting the frequency of each reserved word in the training text, then conducting descending sequencing on the reserved words according to the frequency of each word in the training text, and obtaining a preset number of word group linear word dictionaries with sequencing positions close to the front.
S203: and acquiring the position information of the clue words in the corresponding text units.
In the current method of text segmentation depending on clue words, the same clue word is described by using the same word representation.
However, the same clue word has different meanings in different contexts, and the segmentation guidance is also different, such as "last" in sentence "last, and the natural language understanding task is very challenging. "and" last lecture guest is XX ", the division guidance of" is obviously different. If the unified word representation of the clue word "last" in all sentences is given, the semantic difference between the two "last" words cannot be reflected, and the accuracy of text segmentation is influenced.
In order to solve the above problem, in the present application, position information of clue words in corresponding text units is obtained, so as to generate clue word features of each text unit according to the position information of the clue words in each text unit. The position information based on clue words can determine the word meaning corresponding to the position information from the word meanings of all words in the whole text unit as the word meaning of the clue words, and at the moment, different clue words have different meanings at different positions of different text units, so that the same clue word can be described by adopting different word representations in one text unit.
S204: and generating clue word characteristics of each text unit according to the position information of the clue words in each text unit.
In the present application, the position information of the clue word in each text unit can be determined as the clue word feature of each text unit.
It should be noted that, if there is no clue word in the text unit, the characteristic of the clue word in the text unit is determined to be a preset characteristic, for example, it may be determined that the semantic representation of the clue word in the text unit without the clue word is-1.
If the first 3 words in the word sequence are sequentially searched in a preset clue word dictionary, the text unit 'the product to be introduced next by me is a text segmentation system', and the clue word 'next' exists in the 1 st position (0 is the starting position), the clue word characteristic of the text unit is {1 }; the text unit 'needs to remind people to leave the scene in sequence at the end of the speech', and the clue word is characterized as { -1} because the 'last' is not the word in the first 3 words in the word sequence of the text unit and the corresponding clue word is not found.
Based on the method, the same clue word can be described by adopting different word representations, so that when the text is segmented, the clue word can be considered, the context of the clue word can be considered, and the accuracy of text segmentation is improved.
In this application, a specific implementation manner for determining a segmentation boundary of a text to be segmented according to a segmentation feature of each text unit is also disclosed, and the manner may be as follows:
and inputting the segmentation characteristics of each text unit into a text segmentation model to obtain an output result of whether each text unit is a segmentation boundary of the text to be segmented.
The text segmentation model is obtained by training with the segmentation features of each text unit in the training text as training samples and the segmentation boundary identification marking information of the training text as sample labels. The segmentation features of each text unit in the training text are the word sequence and clue word features of the training text. The input of the text segmentation model is the word sequence and clue word characteristics of the training text, and the output result is whether each text unit is a segmentation boundary.
In the present application, the output result of whether each text unit is a segmentation boundary may have various representations. As an implementation manner, the output result of whether each text unit is a segmentation boundary may be represented as a probability that each text unit is a segmentation boundary, and when the probability is greater than a preset threshold, the text unit is a segmentation boundary, otherwise, the text unit is not a segmentation boundary. As another possible implementation, the output result of whether each text unit is a segmentation boundary may be represented as a classification result, the text unit being a segmentation boundary when the classification result is a first numerical value, and the text unit not being a segmentation boundary when the classification result is a second numerical value.
In this application, the training text used for training the text segmentation model may be all or part of the training text described in S202, or may be a training text redetermined by using the method for determining the training text described in S202, which is not limited in this application.
In the application, the segmentation boundary identification marking information of the training text can be a paragraph segmentation identification of the training text, the identification can be marked manually or obtained by identifying the training text, and the identification can be obtained in any mode, so that the application is not limited.
In this application, a specific implementation manner of a text segmentation model is also disclosed, as shown in fig. 2, fig. 2 is a schematic diagram of a text segmentation model disclosed in this application embodiment, and as can be seen from fig. 2, the text segmentation model includes: a word encoding layer, an attention layer, a fusion layer, a sentence encoding layer and an output layer.
Based on the text segmentation model shown in fig. 2, the present application further discloses a specific implementation manner for inputting the segmentation characteristics of each text unit into the text segmentation model to obtain an output result of whether the initial position of each text unit is the segmentation boundary of the text to be segmented, which specifically includes the following steps:
s301: and acquiring a segment length feature of each text unit by using a text segmentation model, wherein the segment length feature is used for representing segment length information from a last segmentation boundary of each text unit to each text unit.
In the present application, the segment length feature of each text unit is used to represent segment length information from a last segmentation boundary predicted by the text segmentation model to the current text unit, and the segment length information may be represented by the number of text units, words, or words included between the last segmentation boundary and the current text unit.
Since the segment length information is a discrete value and has a large value range, the segment length information can be constrained in the range of 0 to 1 through a nonlinear mapping sigmoid function in the application.
It should be noted that, in the present application, the text segmentation model is a time sequence structure, and when it is predicted whether the current text unit is a segmentation boundary, it is already obtained whether all text units before the current text unit in the text are the results of the segmentation boundary, so in the present application, before word coding is performed on the segmentation features of the current text unit, for example, after the previous text unit of the current text unit is processed, the segment length features of the current text unit are immediately obtained and stored in the text segmentation model, so that the fusion layer of the text segmentation model is called.
Of course, in the present application, the segment length feature of the current text unit may be obtained at any time before the fusion layer utilizes the segment length feature after the segmentation feature of the current text unit is encoded, and the present application is not limited in any way. However, since the segment length feature of the current text unit is utilized at the fusion layer, if the segment length feature of the current text unit is obtained before the segment length feature is utilized at the fusion layer, it needs to be saved.
In addition, in the present application, a module may be added to the text segmentation model, where the module is used to obtain the segment length feature of the current text unit, and of course, a module for obtaining the segment length feature of the current text unit may also be added to the word coding layer, the attention layer, or the fusion layer, which is not limited in this application.
S302: and performing word coding on the segmentation characteristics of each text unit by using a word coding layer of the text segmentation model to obtain the semantic representation of each text unit.
In the application, a word coding layer of the text segmentation model can perform word coding on a word sequence in the segmentation features of each text unit to obtain a word meaning representation of each text unit; obtaining clue word meaning representation of each text unit based on the word meaning representation of each text unit and clue word characteristics in the segmentation characteristics of each text unit; the term meaning token and the cue term meaning token serve as the semantic token.
Specifically, in the present application, word sequence W of text unit can be processed based on word embedding methodi={wi,1,wi,2,...,wi,mProcessing to obtain word vector of word sequence, and obtaining word meaning representation of word sequence based on bidirectional LSTM structure
Figure BDA0002346912840000121
i is the index of the text unit and m is the index of the word in the ith text unit.
Word sense representation of word at time t
Figure BDA0002346912840000122
In
Figure BDA0002346912840000123
Representing hidden layer output generated at the time t after the forward LSTM sequentially reads in the current text unit,
Figure BDA0002346912840000124
and (4) the hidden layer output generated at the m-t moment after the reverse LSTM is read into the current text unit in the reverse order is shown, and the hidden layer output are spliced to be used as the word meaning representation of the word at the t moment.
In the application, the word meaning representation of the corresponding position can be extracted from the word meaning representations of each text unit as the semantic representation of the clue word of each text unit based on the position information of the clue word in each text unit, and the clue word meaning representation of each text unit has both the information amount of the clue word and the information amount of the clue word context.
In particular, word sense tokens from each text unit
Figure BDA0002346912840000125
Extracting the word meaning representation of the position corresponding to the clue word characteristics as the clue word meaning representation and recording the representation as the clue word meaning representation
Figure BDA0002346912840000126
If the clue word characteristic of the current text unit is { -1}, setting
Figure BDA0002346912840000127
upadParameters for text segmentation model training.
S303: and performing attention calculation on the semantic representation of each text unit by using an attention layer of the text segmentation model to obtain the semantic representation of the sentence of each text unit.
The attention layer is used for compressing the semantic representation of the text unit to obtain a sentence representation with a fixed length. In the application, attention calculation can be performed on the word meaning representation to obtain a first sentence meaning representation of each text unit; and performing attention calculation on the clue word semantic representations to obtain a second sentence semantic representation of each text unit, wherein the first sentence semantic representation and the second sentence semantic representation are used as the sentence semantic representations.
In particular, different word tokens may be given different attention weights, weighted to sum up to obtain a sentence token. The attention mechanism consists of query, key name, and key value. The attention to the key value important content is achieved by calculating the correlation between the query and the key value name as the attention weight and weighting the attention weight on the key value. The scheme introduces a global vector uwAnd uclueAs a query, i.e. uw,uclueShared for all text units of different text. u. ofw,uclueMeaning a simple query for all words in a text unit, respectively indicating "currentWhich words in the sentence are important? "and" which clue words are important in the current sentence? ". In the scheme, the key value name and the key value are the same and are the word sequence or the representation of the clue word in the current text unit
Figure BDA0002346912840000131
First sentence semantic representation
Figure BDA0002346912840000132
The calculation method is as follows:
Figure BDA0002346912840000133
Figure BDA0002346912840000134
Figure BDA0002346912840000135
wherein, the word meaning of the word at the time t is characterized as
Figure BDA0002346912840000136
uwAnd uclueAs a global vector, Wa,baAre model training parameters.
The semantic representation of the second sentence can be calculated by referring to the above mode
Figure BDA0002346912840000137
S304: and fusing the semantic representation of the sentence of each text unit and the segment length feature of each text unit by utilizing the fusion layer of the text segmentation model to obtain the complete word representation of the sentence of each text unit.
In the application, the fusion layer is used for obtaining auxiliary information through clue word characteristics and segment length characteristics when semantic information is ambiguous; meanwhile, with the increase of the paragraph length, the text segmentation model can obtain corresponding segmentation excitation, so that the uniformity of the whole paragraph space is controlled. The clue word characteristics have the function of guiding the model to be divided on the sentence boundary with clear clue word information when the paragraph length characteristics carry out paragraph space constraint, so that a more reasonable segmentation effect is achieved.
In the application, when the fusion layer fuses semantic representations of sentences of each text unit and segment length features of each text unit, an adopted fusion strategy can be a Gate structure, and the calculation is as follows:
Figure BDA0002346912840000141
Figure BDA0002346912840000142
wherein, WgAnd bgIn order to train the parameters for the model,
Figure BDA0002346912840000143
a complete word representation of a sentence of text units.
S305: and carrying out sentence coding on the complete word representation of the sentence of each text unit by using a sentence coding layer of the text segmentation model to obtain the sentence representation of each text unit.
Complete word characterization of the sentence of each text unit resulting from S304
Figure BDA0002346912840000144
Only relevant to the current text unit, the LSTM may be used in this application to model the interrelationship between text units to learn semantic switches between text units to obtain segmentation boundaries. Complete word representation of sentences per text unit
Figure BDA0002346912840000145
Sentence representation of each text unit can be obtained through an LSTM structure
Figure BDA0002346912840000146
In different scenarios, different LSTM structures may be used, for example, in a real-time speech recognition scenario, future sentence information cannot be obtained, and thus the forward LSTM structure is used to extract deep semantic information. In an off-line scenario, for example, in a question-answering system, a bidirectional LSTM structure can be adopted to obtain richer sentence representations.
S306: and calculating the sentence representation of each text unit and the sentence representation at the previous moment by using an output layer of the text segmentation model to obtain an output result of whether each text unit is the segmentation boundary of the text to be segmented.
In the application, the output layer of the text segmentation model can calculate the sentence representation of each text unit and the sentence representation at the previous moment through a softmax function, and obtain whether each text unit is the output result of the segmentation boundary of the text to be segmented.
The following describes a text segmentation apparatus disclosed in an embodiment of the present application, and the text segmentation apparatus described below and the text segmentation method described above may be referred to correspondingly.
Referring to fig. 3, fig. 3 is a schematic structural diagram of a text segmentation apparatus disclosed in the embodiment of the present application. As shown in fig. 3, the text segmentation apparatus may include:
a segmented text acquisition unit 11, configured to acquire a text to be segmented;
a segmentation feature obtaining unit 12, configured to obtain a segmentation feature of each text unit in the text to be segmented;
a segmentation boundary determining unit 13, configured to determine a segmentation boundary of the text to be segmented according to the segmentation feature of each text unit;
a dividing unit 14, configured to divide the text to be divided based on a dividing boundary of the text to be divided.
Optionally, the segmentation feature obtaining unit includes:
and the word sequence and clue word characteristic acquisition unit is used for acquiring the word sequence and clue word characteristics of each text unit in the text to be segmented, and the word sequence and clue word characteristics of each text unit are used as the segmentation characteristics of each text unit.
Optionally, the word sequence and clue feature obtaining unit includes:
the word segmentation unit is used for segmenting each text unit to obtain a word sequence of each text unit;
a clue word determining unit, configured to determine clue words from the word sequence based on a predetermined clue word set;
the clue word position information acquisition unit is used for acquiring the position information of the clue words in the corresponding text units;
and the clue word characteristic generating unit is used for generating clue word characteristics of each text unit according to the position information of the clue words in each text unit.
Optionally, the segmentation boundary determining unit includes:
the model application unit is used for inputting the segmentation characteristics of each text unit into a text segmentation model to obtain an output result of whether each text unit is a segmentation boundary of the text to be segmented; the text segmentation model is obtained by taking segmentation features of each text unit in a training text as a training sample and taking segmentation boundary identification marking information of the training text as a sample label for training.
Optionally, the text segmentation model includes:
a word encoding layer, an attention layer, a fusion layer, a sentence encoding layer and an output layer.
Optionally, the model application unit includes:
the segment length feature acquisition unit is used for acquiring the segment length feature of each text unit by using a text segmentation model, wherein the segment length feature is used for representing the segment length information from the last segmentation boundary of each text unit to each text unit;
the word coding unit is used for carrying out word coding on the segmentation characteristics of each text unit by utilizing a word coding layer of the text segmentation model to obtain semantic representation of each text unit;
the attention calculation unit is used for performing attention calculation on the semantic representation of each text unit by utilizing an attention layer of the text segmentation model to obtain the semantic representation of the sentence of each text unit;
the fusion unit is used for fusing the semantic representation of the sentence of each text unit and the segment length feature of each text unit by utilizing the fusion layer of the text segmentation model to obtain the complete word representation of the sentence of each text unit;
the sentence coding unit is used for coding the complete word representation of the sentence of each text unit by utilizing a sentence coding layer of the text segmentation model to obtain the sentence representation of each text unit;
and the calculation unit is used for calculating the sentence representation of each text unit and the sentence representation at the previous moment by utilizing the output layer of the text segmentation model to obtain an output result of whether each text unit is the segmentation boundary of the text to be segmented.
Optionally, the word encoding unit includes:
the first word coding subunit is used for carrying out word coding on the word sequence in the segmentation characteristics of each text unit to obtain a word meaning representation of each text unit;
the second word coding subunit is used for obtaining clue word meaning representation of each text unit based on the word meaning representation of each text unit and clue word characteristics in the segmentation characteristics of each text unit; the term meaning token and the cue term meaning token serve as the semantic token.
Optionally, the attention calculation unit comprises:
the first attention calculation unit is used for carrying out attention calculation on the word meaning representation to obtain a first sentence meaning representation of each text unit;
and the second attention calculation unit is used for performing attention calculation on the clue word semantic representations to obtain a second sentence semantic representation of each text unit, and the first sentence semantic representation and the second sentence semantic representation are used as the sentence semantic representations.
Fig. 4 is a block diagram of a hardware structure of a text segmentation apparatus disclosed in an embodiment of the present application, and referring to fig. 4, the hardware structure of the text segmentation apparatus may include: at least one processor 1, at least one communication interface 2, at least one memory 3 and at least one communication bus 4;
in the embodiment of the application, the number of the processor 1, the communication interface 2, the memory 3 and the communication bus 4 is at least one, and the processor 1, the communication interface 2 and the memory 3 complete mutual communication through the communication bus 4;
the processor 1 may be a central processing unit CPU, or an application specific Integrated circuit asic, or one or more Integrated circuits configured to implement embodiments of the present invention, etc.;
the memory 3 may include a high-speed RAM memory, and may further include a non-volatile memory (non-volatile memory) or the like, such as at least one disk memory;
wherein the memory stores a program and the processor can call the program stored in the memory, the program for:
acquiring a text to be segmented;
acquiring the segmentation characteristics of each text unit in the text to be segmented;
determining a segmentation boundary of the text to be segmented according to the segmentation characteristics of each text unit;
and segmenting the text to be segmented based on the segmentation boundary of the text to be segmented.
Alternatively, the detailed function and the extended function of the program may be as described above.
Embodiments of the present application further provide a storage medium, where a program suitable for execution by a processor may be stored, where the program is configured to:
acquiring a text to be segmented;
acquiring the segmentation characteristics of each text unit in the text to be segmented;
determining a segmentation boundary of the text to be segmented according to the segmentation characteristics of each text unit;
and segmenting the text to be segmented based on the segmentation boundary of the text to be segmented.
Alternatively, the detailed function and the extended function of the program may be as described above.
Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (11)

1. A method of text segmentation, comprising:
acquiring a text to be segmented;
acquiring the segmentation characteristics of each text unit in the text to be segmented;
determining a segmentation boundary of the text to be segmented according to the segmentation characteristics of each text unit;
and segmenting the text to be segmented based on the segmentation boundary of the text to be segmented.
2. The method according to claim 1, wherein the obtaining the segmentation feature of each text unit in the text to be segmented comprises:
and acquiring the word sequence and clue word characteristics of each text unit in the text to be segmented, wherein the word sequence and clue word characteristics of each text unit are used as the segmentation characteristics of each text unit.
3. The method of claim 2, wherein obtaining the word sequence and clue word features of each text unit in the text to be segmented comprises:
performing word segmentation on each text unit to obtain a word sequence of each text unit;
determining clue words from the word sequence based on a predetermined set of clue words;
acquiring position information of the clue words in corresponding text units;
and generating clue word characteristics of each text unit according to the position information of the clue words in each text unit.
4. The method according to claim 2, wherein the determining the segmentation boundary of the text to be segmented according to the segmentation feature of each text unit comprises:
inputting the segmentation characteristics of each text unit into a text segmentation model to obtain an output result of whether each text unit is a segmentation boundary of the text to be segmented; the text segmentation model is obtained by taking segmentation features of each text unit in a training text as a training sample and taking segmentation boundary identification marking information of the training text as a sample label for training.
5. The method of claim 4, wherein the text segmentation model comprises:
a word encoding layer, an attention layer, a fusion layer, a sentence encoding layer and an output layer.
6. The method according to claim 5, wherein the inputting the segmentation feature of each text unit into the text segmentation model to obtain an output result of whether the starting position of each text unit is the segmentation boundary of the text to be segmented comprises:
acquiring a segment length feature of each text unit by using a text segmentation model, wherein the segment length feature is used for representing segment length information from a last segmentation boundary of each text unit to each text unit;
performing word coding on the segmentation characteristics of each text unit by using a word coding layer of the text segmentation model to obtain semantic representation of each text unit;
performing attention calculation on the semantic representation of each text unit by using an attention layer of the text segmentation model to obtain the semantic representation of a sentence of each text unit;
fusing the semantic representation of the sentence of each text unit and the segment length feature of each text unit by utilizing a fusion layer of the text segmentation model to obtain the complete word representation of the sentence of each text unit;
sentence coding is carried out on the complete word representation of the sentence of each text unit by utilizing a sentence coding layer of the text segmentation model, and the sentence representation of each text unit is obtained;
and calculating the sentence representation of each text unit and the sentence representation at the previous moment by using an output layer of the text segmentation model to obtain an output result of whether each text unit is the segmentation boundary of the text to be segmented.
7. The method according to claim 6, wherein the word encoding the segmentation features of each text unit by using the word encoding layer of the text segmentation model to obtain the semantic representation of each text unit comprises:
performing word coding on the word sequence in the segmentation characteristics of each text unit to obtain a word meaning representation of each text unit;
obtaining clue word meaning representation of each text unit based on the word meaning representation of each text unit and clue word characteristics in the segmentation characteristics of each text unit; the term meaning token and the cue term meaning token serve as the semantic token.
8. The method according to claim 7, wherein said performing attention calculation on semantic representation of each text unit by using attention layer of the text segmentation model to obtain sentence semantic representation of each text unit comprises:
performing attention calculation on the word meaning representation to obtain a first sentence meaning representation of each text unit;
and performing attention calculation on the clue word semantic representations to obtain a second sentence semantic representation of each text unit, wherein the first sentence semantic representation and the second sentence semantic representation are used as the sentence semantic representations.
9. A text segmentation apparatus, comprising:
the segmentation text acquisition unit is used for acquiring a text to be segmented;
the segmentation feature acquisition unit is used for acquiring the segmentation features of each text unit in the text to be segmented;
the segmentation boundary determining unit is used for determining the segmentation boundary of the text to be segmented according to the segmentation characteristics of each text unit;
and the segmentation unit is used for segmenting the text to be segmented based on the segmentation boundary of the text to be segmented.
10. A text segmentation device comprising a memory and a processor;
the memory is used for storing programs;
the processor, configured to execute the program, implementing the steps of the text segmentation method according to any one of claims 1 to 8.
11. A readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the text segmentation method according to any one of claims 1 to 8.
CN201911398383.1A 2019-12-30 2019-12-30 Text segmentation method, related device and readable storage medium Active CN111199150B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911398383.1A CN111199150B (en) 2019-12-30 2019-12-30 Text segmentation method, related device and readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911398383.1A CN111199150B (en) 2019-12-30 2019-12-30 Text segmentation method, related device and readable storage medium

Publications (2)

Publication Number Publication Date
CN111199150A true CN111199150A (en) 2020-05-26
CN111199150B CN111199150B (en) 2024-04-16

Family

ID=70744535

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911398383.1A Active CN111199150B (en) 2019-12-30 2019-12-30 Text segmentation method, related device and readable storage medium

Country Status (1)

Country Link
CN (1) CN111199150B (en)

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2004157337A (en) * 2002-11-06 2004-06-03 Nippon Telegr & Teleph Corp <Ntt> Method, device and program for topic boundary determination
US20110252010A1 (en) * 2008-12-31 2011-10-13 Alibaba Group Holding Limited Method and System of Selecting Word Sequence for Text Written in Language Without Word Boundary Markers
US20140214402A1 (en) * 2013-01-25 2014-07-31 Cisco Technology, Inc. Implementation of unsupervised topic segmentation in a data communications environment
US20150134320A1 (en) * 2013-11-14 2015-05-14 At&T Intellectual Property I, L.P. System and method for translating real-time speech using segmentation based on conjunction locations
US9141867B1 (en) * 2012-12-06 2015-09-22 Amazon Technologies, Inc. Determining word segment boundaries
CN107229609A (en) * 2016-03-25 2017-10-03 佳能株式会社 Method and apparatus for splitting text
CN107480143A (en) * 2017-09-12 2017-12-15 山东师范大学 Dialogue topic dividing method and system based on context dependence
CN109829151A (en) * 2018-11-27 2019-05-31 国网浙江省电力有限公司 A kind of text segmenting method based on layering Di Li Cray model
WO2019214145A1 (en) * 2018-05-10 2019-11-14 平安科技(深圳)有限公司 Text sentiment analyzing method, apparatus and storage medium

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2004157337A (en) * 2002-11-06 2004-06-03 Nippon Telegr & Teleph Corp <Ntt> Method, device and program for topic boundary determination
US20110252010A1 (en) * 2008-12-31 2011-10-13 Alibaba Group Holding Limited Method and System of Selecting Word Sequence for Text Written in Language Without Word Boundary Markers
US9141867B1 (en) * 2012-12-06 2015-09-22 Amazon Technologies, Inc. Determining word segment boundaries
US20140214402A1 (en) * 2013-01-25 2014-07-31 Cisco Technology, Inc. Implementation of unsupervised topic segmentation in a data communications environment
US20150134320A1 (en) * 2013-11-14 2015-05-14 At&T Intellectual Property I, L.P. System and method for translating real-time speech using segmentation based on conjunction locations
CN107229609A (en) * 2016-03-25 2017-10-03 佳能株式会社 Method and apparatus for splitting text
CN107480143A (en) * 2017-09-12 2017-12-15 山东师范大学 Dialogue topic dividing method and system based on context dependence
WO2019214145A1 (en) * 2018-05-10 2019-11-14 平安科技(深圳)有限公司 Text sentiment analyzing method, apparatus and storage medium
CN109829151A (en) * 2018-11-27 2019-05-31 国网浙江省电力有限公司 A kind of text segmenting method based on layering Di Li Cray model

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
刘耀;帅远华;龚幸伟;黄毅;: "基于领域本体的文本分割方法研究", no. 01, pages 128 - 132 *

Also Published As

Publication number Publication date
CN111199150B (en) 2024-04-16

Similar Documents

Publication Publication Date Title
CN110634487B (en) Bilingual mixed speech recognition method, device, equipment and storage medium
CN108536654A (en) Identify textual presentation method and device
CN107679032A (en) Voice changes error correction method and device
JP6677419B2 (en) Voice interaction method and apparatus
CN112784696B (en) Lip language identification method, device, equipment and storage medium based on image identification
CN114580382A (en) Text error correction method and device
CN113239666B (en) Text similarity calculation method and system
CN112017643B (en) Speech recognition model training method, speech recognition method and related device
CN113158687B (en) Semantic disambiguation method and device, storage medium and electronic device
CN113553848A (en) Long text classification method, system, electronic equipment and computer readable storage medium
CN115859164A (en) Method and system for identifying and classifying building entities based on prompt
CN113076720B (en) Long text segmentation method and device, storage medium and electronic device
CN110852071A (en) Knowledge point detection method, device, equipment and readable storage medium
CN112151019A (en) Text processing method and device and computing equipment
JP5278425B2 (en) Video segmentation apparatus, method and program
CN112069816A (en) Chinese punctuation adding method, system and equipment
CN111199150B (en) Text segmentation method, related device and readable storage medium
CN115129843A (en) Dialog text abstract extraction method and device
CN115240712A (en) Multi-mode-based emotion classification method, device, equipment and storage medium
CN111090720B (en) Hot word adding method and device
CN114417891A (en) Reply sentence determination method and device based on rough semantics and electronic equipment
CN115033683A (en) Abstract generation method, device, equipment and storage medium
CN114722832A (en) Abstract extraction method, device, equipment and storage medium
CN113012685B (en) Audio recognition method and device, electronic equipment and storage medium
CN112634878A (en) Speech recognition post-processing method and system and related equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant