CN111931482A

CN111931482A - Text segmentation method and device

Info

Publication number: CN111931482A
Application number: CN202011003293.0A
Authority: CN
Inventors: 王雪志
Original assignee: AI Speech Ltd
Current assignee: AI Speech Ltd
Priority date: 2020-09-22
Filing date: 2020-09-22
Publication date: 2020-11-13
Anticipated expiration: 2040-09-22
Also published as: CN111931482B

Abstract

The invention discloses a text segmentation method and a text segmentation device, wherein the text segmentation method comprises the following steps: inputting the text into a first model to obtain labels of the segmentation points of the text, wherein the first model is a trained model capable of labeling the segmentation points of the text; inputting the audio corresponding to the text into a second model to obtain the speaker information of the audio, wherein the second model is a trained model capable of identifying the speaker information in the audio; carrying out fusion processing on the labels of the segmentation points of the text and the speaker information of the audio to obtain segmentation labels; and segmenting the text based on the segmentation labels and outputting the segmented text. By innovatively introducing the speaker information and introducing a post-processing algorithm, the accuracy of segmentation can be effectively improved, some abnormal segmentation points are eliminated, and meanwhile, the resource consumption is low while the user experience is improved.

Description

Text segmentation method and device

Technical Field

The invention belongs to the field of neural networks, and particularly relates to a text segmentation method and a text segmentation device.

Background

At present, there are two main technologies for segmenting unstructured speech recognition long texts in the market: the first is that the similarity between sentences in a window is calculated, and the inflection point of the similarity is found as the basis of segmentation, which is an unsupervised mode; the other is to convert the text segmentation technology into a sequence labeling task, label each sentence, and train in a supervision mode.

The method adopts an unsupervised mode to segment unstructured long texts, and compared with the prior art, the method has the advantage that TextTiling (a text segmentation model based on windows) is adopted. This technique is divided into three steps: 1. segmenting a text; 2. calculating the similarity; 3. and (4) selecting a dividing point. The text segmentation person is to specify the sentence length, and segment the text into sentences of fixed length. The similarity calculation is to calculate the similarity of a certain sentence length around the interval point. The TextTiling algorithm selects K words before and after the interval point, and the K words form a block (block). Two blocks between the segmentation points are a window, and the similarity of the front block and the rear block in the window is calculated to be the similarity of the texts before and after the segmentation points. And calculating the similarity of the texts before and after all the segmentation points through a sliding window. The selection of the segmentation point is mainly to set a threshold value, and when the similarity of the segmentation point is greater than the threshold value, the segmentation point is taken as the position of the segmentation.

The supervised learning mode mainly converts a task of the unstructured text segmentation into a task of sequence labeling. Each sentence word segmented by punctuation is characterized into a sentence vector, then each sentence word is labeled by applying the algorithm of a neural network, and labeled labels are divided into two parts: segmented, not segmented. This approach has been applied relatively quickly in recent years with the development of neural networks.

Disclosure of Invention

An embodiment of the present invention provides a text segmentation method and apparatus, which are used to solve at least one of the above technical problems.

In a first aspect, an embodiment of the present invention provides a text segmentation method, including: inputting the text into a first model to obtain labels of the segmentation points of the text, wherein the first model is a trained model capable of labeling the segmentation points of the text; inputting the audio corresponding to the text into a second model to obtain the speaker information of the audio, wherein the second model is a trained model capable of identifying the speaker information in the audio; carrying out fusion processing on the labels of the segmentation points of the text and the speaker information of the audio to obtain segmentation labels; and segmenting the text based on the segmentation labels and outputting the segmented text.

In a second aspect, an embodiment of the present invention provides a text segmenting apparatus, including: the first input module is configured to input the text into a first model to obtain labels of the segmentation points of the text, wherein the first model is a trained model capable of labeling the segmentation points of the text; the second input module is configured to input the audio corresponding to the text into a second model to obtain the speaker information of the audio, wherein the second model is a trained model capable of identifying the speaker information in the audio; the fusion module is configured to perform fusion processing on the labels of the segmentation points of the text and the speaker information of the audio to obtain segmentation labels; and the output module is configured to segment and output the text based on the segmentation labels.

In a third aspect, there is provided a computer program product comprising a computer program stored on a non-transitory computer readable storage medium, the computer program comprising program instructions which, when executed by a computer, cause the computer to perform the steps of the text segmentation method of the first aspect.

In a fourth aspect, an embodiment of the present invention further provides an electronic device, which includes: at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the steps of the method of the first aspect.

The method provided by the embodiment of the application innovatively introduces the speaker information by changing the condition that the original speech recognition text segmentation is mostly based on texts, so that the accuracy of segmentation can be effectively improved. Post-processing algorithms are also introduced so that segmentation points that eliminate some anomalies can be achieved. The method can be applied to a plurality of application scenes such as long voice transcription, user experience can be visually improved, and meanwhile, resource consumption is low while the user experience is improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on the drawings without creative efforts.

Fig. 1 is a flowchart of a text segmentation method according to an embodiment of the present invention;

FIG. 2 is a flow chart of another text segmentation method provided by an embodiment of the invention;

FIG. 3 is a flow chart of semantic segmentation according to an embodiment of the present invention;

FIG. 4 is a semantic segmentation beta flow diagram according to a specific embodiment of a text segmentation scheme according to an embodiment of the present invention;

FIG. 5 is a block diagram of a text segmentation apparatus according to an embodiment of the present invention;

fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to FIG. 1, a flow diagram of one embodiment of a text segmentation method of the present invention is shown.

As shown in fig. 1, in step 101, inputting the text into a first model to obtain a label of a segmentation point of the text, where the first model is a trained model capable of labeling segmentation points of the text;

in step 102, inputting the audio corresponding to the text into a second model to obtain speaker information of the audio, wherein the second model is a trained model capable of identifying the speaker information in the audio;

in step 103, performing fusion processing on the labels of the segmentation points of the text and the speaker information of the audio to obtain segmentation labels;

in step 104, the text is segmented and output based on the segmentation labels.

In this embodiment, for step 101, the text segmenting apparatus inputs the text into a first model to obtain a label of the segmentation point of the text, where the first model is a trained model capable of labeling the segmentation point of the text, for example, labeling each sentence in the text, and labeling whether the sentence is a segmentation point, for example, converting a task of segmenting unstructured text into a task with a sequence label. Each sentence of punctuation segmentation is characterized as a sentence vector, and then labeling is performed by applying a first model, for example, labeled label is divided into two: segmented and non-segmented;

for step 102, the text segmenting device inputs the audio corresponding to the text into a second model to obtain the speaker information of the audio, wherein the second model is a trained model capable of identifying the speaker information in the audio, for example, the content of a segment of audio is an interactive conversation between a user a and a user B, and then the second model can identify which segment is spoken by the user a and which segment is spoken by the user B according to the voiceprint identification of the audio, and label the segments as the speaker a and the speaker B according to the identification result;

for step 103, the text segmenting device performs fusion processing on the labels of the segmentation points of the text and the speaker information of the audio to obtain segmentation labels, for example, time information in the speaker information is converted into corresponding sentence information, and then the segmentation labels are obtained by processing, for example, turning sentences of two speakers, namely, speaker a and speaker B, are labeled into segment sentences;

for step 104, the text segmenting device segments the text based on the segmentation labels and outputs the segmented text.

In the scheme of the embodiment, the speaker information is innovatively introduced by changing the condition that the original speech recognition text segment is mostly based on the text, so that the accuracy of segmentation can be effectively improved. The method can be applied to a plurality of application scenes such as long voice transcription, user experience can be visually improved, and meanwhile, resource consumption is low while the user experience is improved.

Please refer to fig. 2, which shows a flowchart of another text segmentation method according to an embodiment of the present invention, and the flowchart mainly refers to a flowchart of steps further defined in the method of "performing fusion processing on labels of segmentation points of the text and speaker information of the audio to obtain segmentation labels" in embodiment 103.

As shown in fig. 2, in step 201, time information corresponding to a speaker in the speaker information is converted into sentence information corresponding to the speaker;

in step 202, a segment label of each sentence is obtained according to the label of the segment point and the sentence information corresponding to the certain speaker.

In this embodiment, for step 201, the text segmenting device converts the time information corresponding to a speaker in the speaker information into the sentence information corresponding to the speaker, for example, each sentence has a speaker label, for example, if the first sentence is spoken by the speaker a, the first sentence is labeled as the speaker a, the second sentence is also spoken by the speaker a, and if the third sentence is spoken by the speaker B, the third sentence is labeled as the speaker B;

for step 202, the text segmenting device obtains the segmentation label of each sentence according to the label of the segmentation point and the sentence information corresponding to the certain speaker, for example, when a certain sentence is a turning sentence of the speaker a and the speaker B, the sentence is labeled as a segment sentence.

In the scheme of the embodiment, speaker information is innovatively introduced, and a post-processing algorithm is also introduced, so that abnormal segmentation points can be eliminated.

In some alternative embodiments, the first model includes an Attention mechanism (Attention), wherein the Attention mechanism is derived from a study of human vision. In cognitive science, humans selectively focus on a portion of all information while ignoring other visible information due to bottlenecks in information processing. The above mechanism is commonly referred to as an attention mechanism. Different parts of the human retina have different degrees of information processing ability, i.e., Acuity (Acuity), with only the foveal part having the strongest Acuity. In order to make reasonable use of limited visual information processing resources, a human needs to select a specific portion in a visual region and then focus on it. For example, when a person is reading, only a few words to be read are usually attended to and processed. In summary, the attention mechanism has two main aspects: deciding which part of the input needs to be focused on; limited information processing resources are allocated to the important parts.

An informal statement of attention mechanism is that a neural attention mechanism may enable a neural network to possess the ability to focus on a subset of its inputs (or features): a particular input is selected. Attention may be applied to any type of input regardless of its shape. In the case of limited computing power, an attention mechanism (attention mechanism) is a resource allocation scheme of a main means for solving the information overload problem, and computing resources are allocated to more important tasks.

Attention is generally divided into two categories: one is conscious attention from top to bottom, called focused (focus) attention. Focused attention refers to attention that has a predetermined purpose, is task dependent, and is actively focused on a subject consciously; the other is a bottom-up unconscious attention called salience-based attention. Attention based on significance is attention driven by external stimuli, does not require active intervention, and is also task independent. If a subject's stimulation information differs from its surrounding information, an unconscious "winner-take-all" or gating (gating) mechanism may divert attention to the subject. Regardless of whether such attention is intended or unintended, most human brain activities require attention, such as memorizing information, reading or thinking, and the like.

In the solution described in this embodiment, by introducing an attention mechanism and making corresponding changes, better influence of the distribution of context information on the segments can be achieved.

In the method according to the above embodiment, the first model is a long-short memory recurrent neural network (Lstm) model including attention mechanism, wherein the long-short memory network is a time recurrent neural network, and is specifically designed to solve the long-term dependence problem of the general RNN (recurrent neural network), and all RNNs have a chain form of a repetitive neural network module. In the standard RNN, this repeated structure block has only a very simple structure, e.g. one tanh layer.

In the scheme of this embodiment, longer context information can be remembered by the LSTM algorithm, and the segmentation accuracy is improved.

In some alternative embodiments, the second model is a Speaker-classification model (Speaker-model), wherein Speaker classification is a technique for automatically recognizing the identity of a Speaker based on speech parameters in a speech waveform that reflect physiological and behavioral characteristics of the Speaker.

In the scheme of the embodiment, more information can be provided by the accuracy of the speaker classification model for segmentation, and the method conforms to the actual application scene.

In some optional embodiments, the text and the audio include: and the text after the voice recognition and the audio corresponding to the text after the voice recognition.

In the method according to the foregoing embodiment, the labeling of the segmentation point of the speech recognition text and the speaker information of the audio are fused to obtain a segmentation label, which is implemented by using a DNN model that outputs a segmentation label for each utterance.

In the scheme of the embodiment, the work of fusing the text and the speaker information in the semantic segmentation is handed to the neural network, so that excessive human intervention can be avoided.

It should be noted that, although the above embodiments adopt numbers with definite precedence order such as step 101 and step 102 to define the precedence order of the steps, in an actual application scenario, some steps may be executed in parallel, and the precedence order of some steps is also not defined by the numbers, and this application is not limited herein and is not described herein again.

The following description is provided to enable those skilled in the art to better understand the present disclosure by describing some of the problems encountered by the inventors in implementing the present disclosure and by describing one particular embodiment of the finally identified solution.

The inventors discovered the defects of these similar techniques in the process of implementing the present invention:

the unsupervised mode needs to calculate the similarity of each segmentation point, the effect depends on the similarity calculation algorithm seriously, and the segmentation information is not only contained in sentences adjacent to the segmentation points but also contained in farther sentence information.

The supervised sequence labeling method is a good way to segment long text of speech, but only using text information is not enough. And it is not suitable in actual production to simply use a model for segmentation.

The unsupervised method generally calculates the similarity only between sentences adjacent to the segmentation point, but the adjacent distance is limited, if the farther sentences have great influence on the segmentation, the information can not be utilized in the method, and segmentation errors are caused.

Segmentation of long text for speech recognition, in addition to the text-to-segment effect, speaker information has a significant impact on segmentation. The existing supervision mode is used for voice text segmentation, only text information is used, and other characteristics are to be added.

The performance is limited by only using a supervised and unsupervised mode, and good interactive experience can be achieved only by adding some post-processing algorithms.

The inventors have found in the course of carrying out the invention why the reason is not easily imaginable:

currently, there are few systems for speech recognition segmentation in the market, and most of them perform semantic segmentation in an unsupervised and supervised manner. The unsupervised segmentation mode has limited information, and the segmentation effect of the information depending on the long context is poor. The supervised mode basically uses the text characteristic without adding other characteristics. Still other segmentation schemes do not incorporate post-processing algorithms.

The long text segmentation mode of the patent is semantic segmentation through a supervised form. The supervised algorithm introduces the attention mechanism and makes corresponding changes. The features are segmented from text and speaker information. Besides the supervision algorithm and the speaker information characteristics, the method also introduces a post-processing algorithm to eliminate some abnormal segmentation points.

The invention has the technical innovation points that:

the flow chart of semantic segmentation, it can be seen from fig. 3 that the input of the system is a text of an article recognized by speech and audio corresponding to the article. After the article text passes through the attention-lstm-model, each sentence of the article is labeled to indicate whether the sentence is a segmentation point. After the audio of the article passes through the spoke-model, the audio is divided to obtain the speaker information of the audio. And inputting the label of whether each sentence is a segmentation point and the speaker information of the audio frequency into a post-processing module for fusion. The post-processing module converts the time information corresponding to the speaker into sentence information corresponding to the speaker in the first step, so that each sentence is labeled by one speaker. And the post-processing module obtains the final segmentation label of each sentence according to the segmentation information of each sentence and the speaker information. The speaker information is used for marking a sentence into a segment sentence when the sentence is a turning sentence of two speakers. After passing through the post-processing module, the segmentation label of each sentence is used as the basis of the final segmentation, and then the article is segmented and output.

Beta version formed by the inventor in the process of implementing the invention:

the beta version is generated in the process of realizing semantic segmentation, as shown in FIG. 4. In FIG. 4, it can be seen that the beta version has exchanged the post-processing module for a DNN model as opposed to the formal version. The DNN model directly outputs segmentation information for each word. The version has the advantages that the work of fusing the text and the speaker information in the semantic segmentation is handed to the neural network, and excessive human intervention is avoided. The disadvantage is that the delay of the system increases after introducing an additional DNN model, which is difficult to get online in situations where speech recognition resources are limited. So that the beta version is not formally used

The inventor finds that deeper effects are achieved in the process of implementing the invention:

the system changes the situation that the original speech recognition text segmentation is mostly based on the text, introduces speaker information innovatively, and can effectively improve the accuracy of segmentation.

The system can be constructed in multiple application scenes such as long voice transcription and the like, and user experience can be visually improved.

Meanwhile, the system consumes less time, and the resource consumption is less while the user experience is improved.

The impact of the context information on the segmentation can be better distributed by adopting an attention mechanism.

The LSTM algorithm can remember longer context information and improve the segmentation accuracy.

The Speaker-model provides more information for the accuracy of segmentation, and accords with the actual application scene.

The introduction of the post-processing module achieves the fusion of text information and speaker information under the condition of less time delay increase.

Referring to fig. 5, a block diagram of a text segmenting apparatus according to an embodiment of the present invention is shown.

As shown in fig. 5, a first input module 510, a second input module 520, a fusion module 530, and an output module 540.

The first input module 510 is configured to input the text into a first model to obtain a label of a segmentation point of the text, where the first model is a trained model capable of labeling segmentation points of the text; a second input module 520, configured to input the audio corresponding to the text into a second model to obtain speaker information of the audio, where the second model is a trained model capable of recognizing the speaker information in the audio; a fusion module 530 configured to perform fusion processing on the labels of the segmentation points of the text and the speaker information of the audio to obtain segmentation labels; an output module 540 configured to segment and output the text based on the segmentation label.

It should be understood that the modules depicted in fig. 5 correspond to various steps in the methods described with reference to fig. 1 and 2. Thus, the operations and features described above for the method and the corresponding technical effects are also applicable to the modules in fig. 5, and are not described again here.

It should be noted that the modules in the embodiments of the present application are not limited to the scheme of the present application, for example, the first input module may be described as a module that inputs the text to a first model to obtain a label for a segmentation point of the text, where the first model is a trained module that can perform segmentation point labeling on the text, and in addition, the relevant function module may also be implemented by a hardware processor, for example, the first input module may be implemented by a processor, and details are not described here.

In other embodiments, the present invention further provides a non-volatile computer storage medium, where the computer storage medium stores computer-executable instructions, and the computer-executable instructions may execute the text segmentation method in any of the above method embodiments;

as one embodiment, a non-volatile computer storage medium of the present invention stores computer-executable instructions configured to:

inputting the text into a first model to obtain labels of the segmentation points of the text, wherein the first model is a trained model capable of labeling the segmentation points of the text;

inputting the audio corresponding to the text into a second model to obtain the speaker information of the audio, wherein the second model is a trained model capable of identifying the speaker information in the audio;

carrying out fusion processing on the labels of the segmentation points of the text and the speaker information of the audio to obtain segmentation labels;

and segmenting the text based on the segmentation labels and outputting the segmented text.

The non-volatile computer-readable storage medium may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to the use of the text segmentation apparatus, and the like. Further, the non-volatile computer-readable storage medium may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some embodiments, the non-transitory computer readable storage medium optionally includes memory located remotely from the processor, which may be connected to the text segmentation apparatus via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

Embodiments of the present invention also provide a computer program product comprising a computer program stored on a non-transitory computer readable storage medium, the computer program comprising program instructions which, when executed by a computer, cause the computer to perform any of the above text segmentation methods.

Fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the present invention, and as shown in fig. 6, the electronic device includes: one or more processors 610 and a memory 620, with one processor 610 being an example in fig. 6. The apparatus for the text segmentation method may further include: an input device 630 and an output device 640. The processor 610, the memory 620, the input device 630, and the output device 640 may be connected by a bus or other means, such as the bus connection in fig. 6. The memory 620 is a non-volatile computer-readable storage medium as described above. The processor 610 executes various functional applications of the server and data processing by executing nonvolatile software programs, instructions and modules stored in the memory 620, that is, implements the above-described method embodiments for the text segmentation apparatus method. The input means 630 may receive input numeric or character information and generate key signal inputs related to user settings and function controls for the text segmentation means. The output device 640 may include a display device such as a display screen.

The product can execute the method provided by the embodiment of the invention, and has corresponding functional modules and beneficial effects of the execution method. For technical details that are not described in detail in this embodiment, reference may be made to the method provided by the embodiment of the present invention.

As an embodiment, the electronic device is applied to a text segmentation apparatus, and includes:

at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to cause the at least one processor to:

The electronic device of the embodiments of the present application exists in various forms, including but not limited to:

(1) a mobile communication device: such devices are characterized by mobile communications capabilities and are primarily targeted at providing voice, data communications. Such terminals include smart phones, multimedia phones, functional phones, and low-end phones, among others.

(2) Ultra mobile personal computer device: the equipment belongs to the category of personal computers, has calculation and processing functions and generally has the characteristic of mobile internet access. Such terminals include: PDA, MID, and UMPC devices, etc.

(3) A portable entertainment device: such devices can display and play multimedia content. The devices comprise audio and video players, handheld game consoles, electronic books, intelligent toys and portable vehicle-mounted navigation devices.

(4) The server is similar to a general computer architecture, but has higher requirements on processing capability, stability, reliability, safety, expandability, manageability and the like because of the need of providing highly reliable services.

(5) And other electronic devices with data interaction functions.

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A method of text segmentation, comprising:

2. The method of claim 1, wherein the fusing the annotation of the segmentation point of the text and the speaker information of the audio to obtain a segmentation annotation comprises:

converting time information corresponding to a certain speaker in the speaker information into sentence information corresponding to the certain speaker;

and obtaining the segmentation label of each sentence according to the label of the segmentation point and the sentence information corresponding to the certain speaker.

3. The method of claim 1, wherein the first model includes an attention mechanism.

4. The method of claim 3, wherein the first model is a long-short memory cycle neural network model that includes an attention mechanism.

5. The method of claim 1, wherein the second model is a speaker classification model.

6. The method of any of claims 1-5, wherein the text being associated with the audio comprises: and the text after the voice recognition and the audio corresponding to the text after the voice recognition.

7. The method of claim 6, wherein the fusing the annotation of the segmentation points of the speech recognition text and the speaker information of the audio to obtain a segmentation annotation is implemented using a DNN model that outputs a segmentation annotation for each sentence.

8. A text segmentation apparatus comprising:

the first input module is configured to input the text into a first model to obtain labels of the segmentation points of the text, wherein the first model is a trained model capable of labeling the segmentation points of the text;

the second input module is configured to input the audio corresponding to the text into a second model to obtain the speaker information of the audio, wherein the second model is a trained model capable of identifying the speaker information in the audio;

the fusion module is configured to perform fusion processing on the labels of the segmentation points of the text and the speaker information of the audio to obtain segmentation labels;

and the output module is configured to segment and output the text based on the segmentation labels.

9. A computer program product comprising a computer program stored on a non-transitory computer readable storage medium, the computer program comprising program instructions which, when executed by a computer, cause the computer to perform the steps of the method of any of claims 1 to 7.

10. An electronic device, comprising: at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the steps of the method of any of claims 1 to 7.