CN111199150A

CN111199150A - Text segmentation method, related device and readable storage medium

Info

Publication number: CN111199150A
Application number: CN201911398383.1A
Authority: CN
Inventors: 闫莉; 孔常青; 万根顺; 高建清
Original assignee: iFlytek Co Ltd
Current assignee: iFlytek Co Ltd
Priority date: 2019-12-30
Filing date: 2019-12-30
Publication date: 2020-05-26
Anticipated expiration: 2039-12-30
Also published as: CN111199150B

Abstract

The application discloses a text segmentation method, related equipment and a readable storage medium, wherein after a text to be segmented is obtained, segmentation characteristics of each text unit in the text to be segmented are obtained, segmentation boundaries of the text to be segmented are determined according to the segmentation characteristics of each text unit, and finally the text to be segmented is segmented based on the segmentation boundaries of the text to be segmented. Based on the scheme, the segmentation of the text to be segmented can be realized.

Description

Text segmentation method, related device and readable storage medium

Technical Field

The present application relates to the field of natural language processing technologies, and in particular, to a text segmentation method, a related device, and a readable storage medium.

Background

With the rapid development of statistical natural language processing technology, text segmentation increasingly becomes an important research direction. The text segmentation is to determine a segmentation boundary of the long text without segmentation, and segment the long text without segmentation into text segments according to the determined segmentation boundary, wherein compared with the long text without segmentation, the segmented text segments are short and short in length and accord with reading habits of users; meanwhile, the segmented text segment has a simple and clear theme, so that the user can be helped to quickly extract key information, and the reading pressure is relieved.

Therefore, it is desirable to provide a text segmentation method.

Disclosure of Invention

In view of the foregoing problems, the present application provides a text segmentation method, a related device and a readable storage medium. The specific scheme is as follows:

a text segmentation method, comprising:

acquiring a text to be segmented;

acquiring the segmentation characteristics of each text unit in the text to be segmented;

determining a segmentation boundary of the text to be segmented according to the segmentation characteristics of each text unit;

and segmenting the text to be segmented based on the segmentation boundary of the text to be segmented.

Optionally, the obtaining the segmentation feature of each text unit in the text to be segmented includes:

and acquiring the word sequence and clue word characteristics of each text unit in the text to be segmented, wherein the word sequence and clue word characteristics of each text unit are used as the segmentation characteristics of each text unit.

Optionally, the extracting and obtaining the word sequence and the cue word feature of each text unit in the text to be segmented includes:

performing word segmentation on each text unit to obtain a word sequence of each text unit;

determining clue words from the word sequence based on a predetermined set of clue words;

acquiring position information of the clue words in corresponding text units;

and generating clue word characteristics of each text unit according to the position information of the clue words in each text unit.

Optionally, the determining the segmentation boundary of the text to be segmented according to the segmentation feature of each text unit includes:

inputting the segmentation characteristics of each text unit into a text segmentation model to obtain an output result of whether each text unit is a segmentation boundary of the text to be segmented; the text segmentation model is obtained by taking segmentation features of each text unit in a training text as a training sample and taking segmentation boundary identification marking information of the training text as a sample label for training.

Optionally, the text segmentation model includes:

a word encoding layer, an attention layer, a fusion layer, a sentence encoding layer and an output layer.

Optionally, the inputting the segmentation feature of each text unit into the text segmentation model to obtain an output result of whether the starting position of each text unit is the segmentation boundary of the text to be segmented includes:

acquiring a segment length feature of each text unit by using a text segmentation model, wherein the segment length feature is used for representing segment length information from a last segmentation boundary of each text unit to each text unit;

performing word coding on the segmentation characteristics of each text unit by using a word coding layer of the text segmentation model to obtain semantic representation of each text unit;

performing attention calculation on the semantic representation of each text unit by using an attention layer of the text segmentation model to obtain the semantic representation of a sentence of each text unit;

fusing the semantic representation of the sentence of each text unit and the segment length feature of each text unit by utilizing a fusion layer of the text segmentation model to obtain the complete word representation of the sentence of each text unit;

sentence coding is carried out on the complete word representation of the sentence of each text unit by utilizing a sentence coding layer of the text segmentation model, and the sentence representation of each text unit is obtained;

and calculating the sentence representation of each text unit and the sentence representation at the previous moment by using an output layer of the text segmentation model to obtain an output result of whether each text unit is the segmentation boundary of the text to be segmented.

Optionally, the performing word encoding on the segmentation features of each text unit by using a word encoding layer of the text segmentation model to obtain a semantic representation of each text unit includes:

performing word coding on the word sequence in the segmentation characteristics of each text unit to obtain a word meaning representation of each text unit;

obtaining clue word meaning representation of each text unit based on the word meaning representation of each text unit and clue word characteristics in the segmentation characteristics of each text unit; the term meaning token and the cue term meaning token serve as the semantic token.

Optionally, the performing attention calculation on the semantic representation of each text unit by using the attention layer of the text segmentation model to obtain the sentence semantic representation of each text unit includes:

performing attention calculation on the word meaning representation to obtain a first sentence meaning representation of each text unit;

and performing attention calculation on the clue word semantic representations to obtain a second sentence semantic representation of each text unit, wherein the first sentence semantic representation and the second sentence semantic representation are used as the sentence semantic representations.

A text segmentation apparatus comprising:

the segmentation text acquisition unit is used for acquiring a text to be segmented;

the segmentation feature acquisition unit is used for acquiring the segmentation features of each text unit in the text to be segmented;

the segmentation boundary determining unit is used for determining the segmentation boundary of the text to be segmented according to the segmentation characteristics of each text unit;

and the segmentation unit is used for segmenting the text to be segmented based on the segmentation boundary of the text to be segmented.

Optionally, the segmentation feature obtaining unit includes:

and the word sequence and clue word characteristic acquisition unit is used for acquiring the word sequence and clue word characteristics of each text unit in the text to be segmented, and the word sequence and clue word characteristics of each text unit are used as the segmentation characteristics of each text unit.

Optionally, the word sequence and clue feature obtaining unit includes:

the word segmentation unit is used for segmenting each text unit to obtain a word sequence of each text unit;

a clue word determining unit, configured to determine clue words from the word sequence based on a predetermined clue word set;

the clue word position information acquisition unit is used for acquiring the position information of the clue words in the corresponding text units;

and the clue word characteristic generating unit is used for generating clue word characteristics of each text unit according to the position information of the clue words in each text unit.

Optionally, the segmentation boundary determining unit includes:

the model application unit is used for inputting the segmentation characteristics of each text unit into a text segmentation model to obtain an output result of whether each text unit is a segmentation boundary of the text to be segmented; the text segmentation model is obtained by taking segmentation features of each text unit in a training text as a training sample and taking segmentation boundary identification marking information of the training text as a sample label for training.

Optionally, the text segmentation model includes:

Optionally, the model application unit includes:

the segment length feature acquisition unit is used for acquiring the segment length feature of each text unit by using a text segmentation model, wherein the segment length feature is used for representing the segment length information from the last segmentation boundary of each text unit to each text unit;

the word coding unit is used for carrying out word coding on the segmentation characteristics of each text unit by utilizing a word coding layer of the text segmentation model to obtain semantic representation of each text unit;

the attention calculation unit is used for performing attention calculation on the semantic representation of each text unit by utilizing an attention layer of the text segmentation model to obtain the semantic representation of the sentence of each text unit;

the fusion unit is used for fusing the semantic representation of the sentence of each text unit and the segment length feature of each text unit by utilizing the fusion layer of the text segmentation model to obtain the complete word representation of the sentence of each text unit;

the sentence coding unit is used for coding the complete word representation of the sentence of each text unit by utilizing a sentence coding layer of the text segmentation model to obtain the sentence representation of each text unit;

and the calculation unit is used for calculating the sentence representation of each text unit and the sentence representation at the previous moment by utilizing the output layer of the text segmentation model to obtain an output result of whether each text unit is the segmentation boundary of the text to be segmented.

Optionally, the word encoding unit includes:

the first word coding subunit is used for carrying out word coding on the word sequence in the segmentation characteristics of each text unit to obtain a word meaning representation of each text unit;

the second word coding subunit is used for obtaining clue word meaning representation of each text unit based on the word meaning representation of each text unit and clue word characteristics in the segmentation characteristics of each text unit; the term meaning token and the cue term meaning token serve as the semantic token.

Optionally, the attention calculation unit comprises:

the first attention calculation unit is used for carrying out attention calculation on the word meaning representation to obtain a first sentence meaning representation of each text unit;

and the second attention calculation unit is used for performing attention calculation on the clue word semantic representations to obtain a second sentence semantic representation of each text unit, and the first sentence semantic representation and the second sentence semantic representation are used as the sentence semantic representations.

A text segmentation device comprising a memory and a processor;

the memory is used for storing programs;

the processor is configured to execute the program to implement the steps of the text segmentation method.

A readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the text segmentation method as described above.

By means of the technical scheme, the application discloses a text segmentation method, related equipment and a readable storage medium, after a text to be segmented is obtained, segmentation features of each text unit in the text to be segmented are obtained, segmentation boundaries of the text to be segmented are determined according to the segmentation features of each text unit, and finally the text to be segmented is segmented based on the segmentation boundaries of the text to be segmented. Based on the scheme, the segmentation of the text to be segmented can be realized.

Drawings

Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the application. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:

fig. 1 is a schematic flowchart of a text segmentation method disclosed in an embodiment of the present application;

FIG. 2 is a diagram of a text segmentation model according to an embodiment of the present disclosure;

fig. 3 is a schematic structural diagram of a text segmentation apparatus disclosed in an embodiment of the present application;

fig. 4 is a block diagram of a hardware structure of a text segmentation apparatus according to an embodiment of the present disclosure.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The text segmentation method disclosed by the application can be applied to post-processing modules of a voice recognition system, a man-machine question-answering system, an information retrieval system and the like, and then, the text segmentation method disclosed by the application is introduced through the following embodiments.

Referring to fig. 1, fig. 1 is a schematic flowchart of a text segmentation method disclosed in an embodiment of the present application, where the method includes:

s101: and acquiring a text to be segmented.

In the present application, the text to be segmented may be any non-segmented text, for example, a non-segmented lecture record manuscript obtained by performing voice recognition on a lecture voice of a user by using a voice recognition system, a non-segmented electronic book, or the like.

In the present application, the text to be segmented may be obtained in an uploading manner by a user, or may also be obtained from outputs of other natural language processing systems (such as a speech recognition system, a man-machine question-answering system, an information retrieval system, and the like), and the present application is not limited in any way.

S102: and acquiring the segmentation characteristics of each text unit in the text to be segmented.

In this application, each text unit in the text to be segmented may be a sentence segmented by an ending punctuation (e.g., a period, an exclamation mark, a question mark, etc.) in the text, or may be a clause or phrase segmented by punctuation marks in the text, and the application is not limited in any way.

In the present application, the segmentation feature of each text unit may be any feature that can be used to determine the segmentation boundary of the text to be segmented, such as a word sequence of each text unit in the text to be segmented, a clue word feature of each text unit in the text to be segmented, a segment length segmentation threshold of the text to be segmented, and the like, and the present application is not limited in any way.

It should be noted that the word sequence of each text unit is a sequence of words contained in the text unit. Clue words are a class of words with strong text segmentation guidance, such as words that often appear at the beginning of a paragraph, such as "first", "next", "last", and the like. The clue word characteristics can be characteristics used for representing information such as the content, number, position and the like of clue words of each text unit in the text to be segmented. The segment length segmentation threshold is a threshold used for limiting the length of a segment after the text to be segmented is segmented.

As a preferred embodiment, the obtaining of the segmentation feature of each text unit in the text to be segmented may include obtaining a word sequence and a clue word feature of each text unit in the text to be segmented, where the word sequence and the clue word feature of each text unit serve as the segmentation feature of each text unit.

It should be noted that, in the following embodiments of the present application, the flow of text segmentation is described on the basis that the word sequence and the clue word feature of each text unit are taken as the segmentation feature of each text unit. However, it is within the scope of the present application to combine other segmentation features with the word sequence and clue word features of each text unit as the segmentation features of each text unit.

S103: and determining the segmentation boundary of the text to be segmented according to the segmentation characteristics of each text unit.

In this application, determining a segmentation boundary of a text to be segmented according to the segmentation feature of each text unit specifically means determining whether the text unit is the segmentation boundary of the text to be segmented according to the segmentation feature of each text unit. Specifically, determining whether the text unit is a segmentation boundary of the text to be segmented refers to determining whether the starting position of the text unit is a segmentation boundary of the text to be segmented. According to the segmentation characteristics of each text unit, one or more segmentation boundaries of the text to be segmented are determined.

S104: and segmenting the text to be segmented based on the segmentation boundary of the text to be segmented.

In the present application, after determining the segmentation boundary of the text to be segmented, the text units at both ends of the segmentation boundary may be segmented into different paragraphs.

The embodiment discloses a text segmentation method, which includes the steps of obtaining segmentation characteristics of each text unit in a text to be segmented after obtaining the text to be segmented, determining segmentation boundaries of the text to be segmented according to the segmentation characteristics of each text unit, and finally segmenting the text to be segmented based on the segmentation boundaries of the text to be segmented. Based on the scheme, the segmentation of the text to be segmented can be realized.

In this application, a specific implementation manner for obtaining a word sequence and clue word features of each text unit in a text to be segmented is disclosed, and the implementation manner includes:

s201: and performing word segmentation on each text unit to obtain a word sequence of each text unit.

In the present application, an existing word segmentation system may be adopted to perform word segmentation on each text unit to obtain a word sequence of each text unit.

S202: clue words are determined from the sequence of words based on a predetermined set of clue words.

In the present application, the words in the word sequence may be sequentially searched in a preset clue dictionary, if the words are searched, the words are determined to be clue words, and if all the words are not searched in the clue dictionary, it is determined that there is no clue word in the text unit.

However, in some cases, the number of words in the word sequence of the text unit is too large, and the probability that the word appearing at the end of the text unit is a clue word is also low, in order to improve the efficiency of determining the clue word, the first few words in the word sequence may be sequentially searched in a preset clue word dictionary, if the first few words are found, the word is determined to be a clue word, and if the first few words are not found in the clue word dictionary, it is determined that there is no clue word in the text unit.

In the application, the training text can be predetermined, for example, news text, electronic books and the like are collected as the training text through a network, the text has natural segmentation information, the acquisition is simple, the scale is large, or the segmented labeling can be performed on the non-segmented text through manual labeling to obtain the training text. After the training text is determined, a clue word dictionary is constructed based on the training text.

The clue word dictionary building method includes the steps of firstly obtaining prepositions, conjunctions and adverbs which are reserved after real words such as nouns, adjectives and numbers with actual meanings are removed from first words of each segment in a training text, then counting the frequency of each reserved word in the training text, then conducting descending sequencing on the reserved words according to the frequency of each word in the training text, and obtaining a preset number of word group linear word dictionaries with sequencing positions close to the front.

S203: and acquiring the position information of the clue words in the corresponding text units.

In the current method of text segmentation depending on clue words, the same clue word is described by using the same word representation.

However, the same clue word has different meanings in different contexts, and the segmentation guidance is also different, such as "last" in sentence "last, and the natural language understanding task is very challenging. "and" last lecture guest is XX ", the division guidance of" is obviously different. If the unified word representation of the clue word "last" in all sentences is given, the semantic difference between the two "last" words cannot be reflected, and the accuracy of text segmentation is influenced.

In order to solve the above problem, in the present application, position information of clue words in corresponding text units is obtained, so as to generate clue word features of each text unit according to the position information of the clue words in each text unit. The position information based on clue words can determine the word meaning corresponding to the position information from the word meanings of all words in the whole text unit as the word meaning of the clue words, and at the moment, different clue words have different meanings at different positions of different text units, so that the same clue word can be described by adopting different word representations in one text unit.

S204: and generating clue word characteristics of each text unit according to the position information of the clue words in each text unit.

In the present application, the position information of the clue word in each text unit can be determined as the clue word feature of each text unit.

It should be noted that, if there is no clue word in the text unit, the characteristic of the clue word in the text unit is determined to be a preset characteristic, for example, it may be determined that the semantic representation of the clue word in the text unit without the clue word is-1.

If the first 3 words in the word sequence are sequentially searched in a preset clue word dictionary, the text unit 'the product to be introduced next by me is a text segmentation system', and the clue word 'next' exists in the 1 st position (0 is the starting position), the clue word characteristic of the text unit is {1 }; the text unit 'needs to remind people to leave the scene in sequence at the end of the speech', and the clue word is characterized as { -1} because the 'last' is not the word in the first 3 words in the word sequence of the text unit and the corresponding clue word is not found.

Based on the method, the same clue word can be described by adopting different word representations, so that when the text is segmented, the clue word can be considered, the context of the clue word can be considered, and the accuracy of text segmentation is improved.

In this application, a specific implementation manner for determining a segmentation boundary of a text to be segmented according to a segmentation feature of each text unit is also disclosed, and the manner may be as follows:

and inputting the segmentation characteristics of each text unit into a text segmentation model to obtain an output result of whether each text unit is a segmentation boundary of the text to be segmented.

The text segmentation model is obtained by training with the segmentation features of each text unit in the training text as training samples and the segmentation boundary identification marking information of the training text as sample labels. The segmentation features of each text unit in the training text are the word sequence and clue word features of the training text. The input of the text segmentation model is the word sequence and clue word characteristics of the training text, and the output result is whether each text unit is a segmentation boundary.

In the present application, the output result of whether each text unit is a segmentation boundary may have various representations. As an implementation manner, the output result of whether each text unit is a segmentation boundary may be represented as a probability that each text unit is a segmentation boundary, and when the probability is greater than a preset threshold, the text unit is a segmentation boundary, otherwise, the text unit is not a segmentation boundary. As another possible implementation, the output result of whether each text unit is a segmentation boundary may be represented as a classification result, the text unit being a segmentation boundary when the classification result is a first numerical value, and the text unit not being a segmentation boundary when the classification result is a second numerical value.

In this application, the training text used for training the text segmentation model may be all or part of the training text described in S202, or may be a training text redetermined by using the method for determining the training text described in S202, which is not limited in this application.

In the application, the segmentation boundary identification marking information of the training text can be a paragraph segmentation identification of the training text, the identification can be marked manually or obtained by identifying the training text, and the identification can be obtained in any mode, so that the application is not limited.

In this application, a specific implementation manner of a text segmentation model is also disclosed, as shown in fig. 2, fig. 2 is a schematic diagram of a text segmentation model disclosed in this application embodiment, and as can be seen from fig. 2, the text segmentation model includes: a word encoding layer, an attention layer, a fusion layer, a sentence encoding layer and an output layer.

Based on the text segmentation model shown in fig. 2, the present application further discloses a specific implementation manner for inputting the segmentation characteristics of each text unit into the text segmentation model to obtain an output result of whether the initial position of each text unit is the segmentation boundary of the text to be segmented, which specifically includes the following steps:

s301: and acquiring a segment length feature of each text unit by using a text segmentation model, wherein the segment length feature is used for representing segment length information from a last segmentation boundary of each text unit to each text unit.

In the present application, the segment length feature of each text unit is used to represent segment length information from a last segmentation boundary predicted by the text segmentation model to the current text unit, and the segment length information may be represented by the number of text units, words, or words included between the last segmentation boundary and the current text unit.

Since the segment length information is a discrete value and has a large value range, the segment length information can be constrained in the range of 0 to 1 through a nonlinear mapping sigmoid function in the application.

It should be noted that, in the present application, the text segmentation model is a time sequence structure, and when it is predicted whether the current text unit is a segmentation boundary, it is already obtained whether all text units before the current text unit in the text are the results of the segmentation boundary, so in the present application, before word coding is performed on the segmentation features of the current text unit, for example, after the previous text unit of the current text unit is processed, the segment length features of the current text unit are immediately obtained and stored in the text segmentation model, so that the fusion layer of the text segmentation model is called.

Of course, in the present application, the segment length feature of the current text unit may be obtained at any time before the fusion layer utilizes the segment length feature after the segmentation feature of the current text unit is encoded, and the present application is not limited in any way. However, since the segment length feature of the current text unit is utilized at the fusion layer, if the segment length feature of the current text unit is obtained before the segment length feature is utilized at the fusion layer, it needs to be saved.

In addition, in the present application, a module may be added to the text segmentation model, where the module is used to obtain the segment length feature of the current text unit, and of course, a module for obtaining the segment length feature of the current text unit may also be added to the word coding layer, the attention layer, or the fusion layer, which is not limited in this application.

S302: and performing word coding on the segmentation characteristics of each text unit by using a word coding layer of the text segmentation model to obtain the semantic representation of each text unit.

In the application, a word coding layer of the text segmentation model can perform word coding on a word sequence in the segmentation features of each text unit to obtain a word meaning representation of each text unit; obtaining clue word meaning representation of each text unit based on the word meaning representation of each text unit and clue word characteristics in the segmentation characteristics of each text unit; the term meaning token and the cue term meaning token serve as the semantic token.

Specifically, in the present application, word sequence W of text unit can be processed based on word embedding method_i＝{w_i,1,w_i,2,...,w_i,mProcessing to obtain word vector of word sequence, and obtaining word meaning representation of word sequence based on bidirectional LSTM structure

i is the index of the text unit and m is the index of the word in the ith text unit.

Word sense representation of word at time t

In

Representing hidden layer output generated at the time t after the forward LSTM sequentially reads in the current text unit,

and (4) the hidden layer output generated at the m-t moment after the reverse LSTM is read into the current text unit in the reverse order is shown, and the hidden layer output are spliced to be used as the word meaning representation of the word at the t moment.

In the application, the word meaning representation of the corresponding position can be extracted from the word meaning representations of each text unit as the semantic representation of the clue word of each text unit based on the position information of the clue word in each text unit, and the clue word meaning representation of each text unit has both the information amount of the clue word and the information amount of the clue word context.

In particular, word sense tokens from each text unit

Extracting the word meaning representation of the position corresponding to the clue word characteristics as the clue word meaning representation and recording the representation as the clue word meaning representation

If the clue word characteristic of the current text unit is { -1}, setting

u_padParameters for text segmentation model training.

S303: and performing attention calculation on the semantic representation of each text unit by using an attention layer of the text segmentation model to obtain the semantic representation of the sentence of each text unit.

The attention layer is used for compressing the semantic representation of the text unit to obtain a sentence representation with a fixed length. In the application, attention calculation can be performed on the word meaning representation to obtain a first sentence meaning representation of each text unit; and performing attention calculation on the clue word semantic representations to obtain a second sentence semantic representation of each text unit, wherein the first sentence semantic representation and the second sentence semantic representation are used as the sentence semantic representations.

In particular, different word tokens may be given different attention weights, weighted to sum up to obtain a sentence token. The attention mechanism consists of query, key name, and key value. The attention to the key value important content is achieved by calculating the correlation between the query and the key value name as the attention weight and weighting the attention weight on the key value. The scheme introduces a global vector u_wAnd u_clueAs a query, i.e. u_w，u_clueShared for all text units of different text. u. of_w，u_clueMeaning a simple query for all words in a text unit, respectively indicating "currentWhich words in the sentence are important? "and" which clue words are important in the current sentence? ". In the scheme, the key value name and the key value are the same and are the word sequence or the representation of the clue word in the current text unit

First sentence semantic representation

The calculation method is as follows:

wherein, the word meaning of the word at the time t is characterized as

u_wAnd u_clueAs a global vector, W_a，b_aAre model training parameters.

The semantic representation of the second sentence can be calculated by referring to the above mode

S304: and fusing the semantic representation of the sentence of each text unit and the segment length feature of each text unit by utilizing the fusion layer of the text segmentation model to obtain the complete word representation of the sentence of each text unit.

In the application, the fusion layer is used for obtaining auxiliary information through clue word characteristics and segment length characteristics when semantic information is ambiguous; meanwhile, with the increase of the paragraph length, the text segmentation model can obtain corresponding segmentation excitation, so that the uniformity of the whole paragraph space is controlled. The clue word characteristics have the function of guiding the model to be divided on the sentence boundary with clear clue word information when the paragraph length characteristics carry out paragraph space constraint, so that a more reasonable segmentation effect is achieved.

In the application, when the fusion layer fuses semantic representations of sentences of each text unit and segment length features of each text unit, an adopted fusion strategy can be a Gate structure, and the calculation is as follows:

wherein, W_gAnd b_gIn order to train the parameters for the model,

a complete word representation of a sentence of text units.

S305: and carrying out sentence coding on the complete word representation of the sentence of each text unit by using a sentence coding layer of the text segmentation model to obtain the sentence representation of each text unit.

Complete word characterization of the sentence of each text unit resulting from S304

Only relevant to the current text unit, the LSTM may be used in this application to model the interrelationship between text units to learn semantic switches between text units to obtain segmentation boundaries. Complete word representation of sentences per text unit

Sentence representation of each text unit can be obtained through an LSTM structure

In different scenarios, different LSTM structures may be used, for example, in a real-time speech recognition scenario, future sentence information cannot be obtained, and thus the forward LSTM structure is used to extract deep semantic information. In an off-line scenario, for example, in a question-answering system, a bidirectional LSTM structure can be adopted to obtain richer sentence representations.

S306: and calculating the sentence representation of each text unit and the sentence representation at the previous moment by using an output layer of the text segmentation model to obtain an output result of whether each text unit is the segmentation boundary of the text to be segmented.

In the application, the output layer of the text segmentation model can calculate the sentence representation of each text unit and the sentence representation at the previous moment through a softmax function, and obtain whether each text unit is the output result of the segmentation boundary of the text to be segmented.

The following describes a text segmentation apparatus disclosed in an embodiment of the present application, and the text segmentation apparatus described below and the text segmentation method described above may be referred to correspondingly.

Referring to fig. 3, fig. 3 is a schematic structural diagram of a text segmentation apparatus disclosed in the embodiment of the present application. As shown in fig. 3, the text segmentation apparatus may include:

a segmented text acquisition unit 11, configured to acquire a text to be segmented;

a segmentation feature obtaining unit 12, configured to obtain a segmentation feature of each text unit in the text to be segmented;

a segmentation boundary determining unit 13, configured to determine a segmentation boundary of the text to be segmented according to the segmentation feature of each text unit;

a dividing unit 14, configured to divide the text to be divided based on a dividing boundary of the text to be divided.

Optionally, the segmentation feature obtaining unit includes:

Optionally, the word sequence and clue feature obtaining unit includes:

Optionally, the segmentation boundary determining unit includes:

Optionally, the text segmentation model includes:

Optionally, the model application unit includes:

Optionally, the word encoding unit includes:

Optionally, the attention calculation unit comprises:

Fig. 4 is a block diagram of a hardware structure of a text segmentation apparatus disclosed in an embodiment of the present application, and referring to fig. 4, the hardware structure of the text segmentation apparatus may include: at least one processor 1, at least one communication interface 2, at least one memory 3 and at least one communication bus 4;

in the embodiment of the application, the number of the processor 1, the communication interface 2, the memory 3 and the communication bus 4 is at least one, and the processor 1, the communication interface 2 and the memory 3 complete mutual communication through the communication bus 4;

the processor 1 may be a central processing unit CPU, or an application specific Integrated circuit asic, or one or more Integrated circuits configured to implement embodiments of the present invention, etc.;

the memory 3 may include a high-speed RAM memory, and may further include a non-volatile memory (non-volatile memory) or the like, such as at least one disk memory;

wherein the memory stores a program and the processor can call the program stored in the memory, the program for:

acquiring a text to be segmented;

Alternatively, the detailed function and the extended function of the program may be as described above.

Embodiments of the present application further provide a storage medium, where a program suitable for execution by a processor may be stored, where the program is configured to:

acquiring a text to be segmented;

Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A method of text segmentation, comprising:

acquiring a text to be segmented;

2. The method according to claim 1, wherein the obtaining the segmentation feature of each text unit in the text to be segmented comprises:

3. The method of claim 2, wherein obtaining the word sequence and clue word features of each text unit in the text to be segmented comprises:

acquiring position information of the clue words in corresponding text units;

4. The method according to claim 2, wherein the determining the segmentation boundary of the text to be segmented according to the segmentation feature of each text unit comprises:

5. The method of claim 4, wherein the text segmentation model comprises:

6. The method according to claim 5, wherein the inputting the segmentation feature of each text unit into the text segmentation model to obtain an output result of whether the starting position of each text unit is the segmentation boundary of the text to be segmented comprises:

7. The method according to claim 6, wherein the word encoding the segmentation features of each text unit by using the word encoding layer of the text segmentation model to obtain the semantic representation of each text unit comprises:

8. The method according to claim 7, wherein said performing attention calculation on semantic representation of each text unit by using attention layer of the text segmentation model to obtain sentence semantic representation of each text unit comprises:

9. A text segmentation apparatus, comprising:

10. A text segmentation device comprising a memory and a processor;

the memory is used for storing programs;

the processor, configured to execute the program, implementing the steps of the text segmentation method according to any one of claims 1 to 8.

11. A readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the text segmentation method according to any one of claims 1 to 8.