CN112749544A

CN112749544A - Training method and system for paragraph segmentation model

Info

Publication number: CN112749544A
Application number: CN202011583136.1A
Authority: CN
Inventors: 秦文杰
Original assignee: AI Speech Ltd
Current assignee: AI Speech Ltd
Priority date: 2020-12-28
Filing date: 2020-12-28
Publication date: 2021-05-04
Anticipated expiration: 2040-12-28
Also published as: CN112749544B

Abstract

The embodiment of the invention provides a method for training a paragraph segmentation model. The method comprises the following steps: pre-training a neural network model of the paragraph segmentation model by using the general segmentation data; and training a coding layer related to feature extraction in the pre-trained paragraph segmentation model based on the field segmentation data to obtain the paragraph segmentation model in the adaptation field. The embodiment of the invention also provides a training system of the paragraph segmentation model. The embodiment of the invention aims at the problem that a large amount of precise standard data are needed to be trained in a specific field, trains on a large amount of easily-obtained general segmented data, and finally carries out fine adjustment on a small amount of field precise standard data, so that the field adaptation cost can be effectively reduced. The method aims at the problem that the output of an upstream punctuation model is sensitive. The robustness of the segmented model is improved, the dependence of the model on the upstream punctuation is reduced, and the output of the upstream punctuation can be corrected.

Description

Training method and system for paragraph segmentation model

Technical Field

The invention relates to the field of intelligent voice, in particular to a method and a system for training a paragraph segmentation model.

Background

Paragraph segmentation is becoming more and more important today, for example, to convert a recording of a class spoken by a teacher into text, since the text converted from the recording is a large pile of characters. By paragraph segmentation, a plurality of paragraphs can be separated from a stack of characters, so that the paragraph boundary is better seen when the user reviews the characters.

The existing methods on the market are as follows: a paragraph segmentation method of a traditional Machine learning method such as an SVM (Support Vector Machine), a paragraph segmentation method of a neural network such as an LSTM (Long Short-Term Memory based network), and the like.

The paragraph segmentation is essentially a classification task, and the model needs to make a prediction on each sentence in the chapter, and whether to wrap lines in the sentence, so as to complete the paragraph segmentation of the text.

A paragraph segmentation method based on an SVM mainly learns a hyperplane and separates segmentation sentences and non-segmentation sentences in a high latitude space.

The paragraph segmentation method based on LSTM is characterized in that an Encoder (Encoder) of a deep learning model represented by LSTM is used for extracting text features, and whether each sentence needs line feed is predicted according to the text features.

In the process of implementing the invention, the inventor finds that at least the following problems exist in the related art:

1. the cost of domain adaptation is high. The text with paragraph segmentation information is usually regular news manuscript, and the data is objective in scale and easy to obtain. The model trained on the basis is poor in paragraph segmentation in the new field, and a large amount of texts in the corresponding field need to be labeled manually for retraining. This can only rely on a large amount of manual annotation data to learn from scratch, since the training model does not contain any general knowledge of text processing.

2. Sensitive to punctuation output with upstream. The upstream punctuation model has poor performance in some fields of texts, and particularly, the F1 value of the punctuation at the end of the table with a period greatly affects the performance of the downstream segmentation model, namely, the robustness of the segmentation model is poor.

Disclosure of Invention

The method aims to at least solve the problems that in the prior art, the field adaptation is high in cost and sensitive to punctuation output with upstream.

In a first aspect, an embodiment of the present invention provides a method for training a paragraph segmentation model, including:

pre-training a neural network model of the paragraph segmentation model by using common segmentation data;

and training the coding layer related to feature extraction in the pre-trained paragraph segmentation model based on the field segmentation data to obtain the paragraph segmentation model in the adaptation field.

In a second aspect, an embodiment of the present invention provides a training system for a paragraph segmentation model, including:

the model pre-training program module is used for pre-training the neural network model of the paragraph segmentation model by utilizing the general segmentation data;

and the segmentation model training program module is used for training a coding layer related to feature extraction in the pre-trained segment segmentation model based on the field segmentation data to obtain the segment segmentation model in the adaptive field.

In a third aspect, an electronic device is provided, comprising: at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the steps of the method for training a segmentation model according to any of the embodiments of the present invention.

In a fourth aspect, an embodiment of the present invention provides a storage medium, on which a computer program is stored, where the computer program is executed by a processor to implement the steps of the method for training a segmentation model according to any embodiment of the present invention.

The embodiment of the invention has the beneficial effects that: aiming at the problem that a large amount of fine-scale data are needed to be trained in a specific field, a pretraining model such as BERT is used for training on a large amount of easily-obtained general segmentation data, and finally fine adjustment is carried out on a small amount of field fine-scale data, so that the field adaptation cost can be effectively reduced. Aiming at the problem that the output of an upstream punctuation model is sensitive, new segmentation training data is constructed by combining segmentation information and the output of the upstream punctuation, and the quantity distribution of the punctuation before segmentation marking is counted so as to introduce new sentence segmentation punctuation. The robustness of the segmented model is improved, the dependence of the model on the upstream punctuation is reduced, and the output of the upstream punctuation can be corrected.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.

FIG. 1 is a flowchart of a method for training a paragraph segmentation model according to an embodiment of the present invention;

FIG. 2 is a flowchart illustrating a paragraph segmentation general procedure of a method for training a paragraph segmentation model according to an embodiment of the present invention;

FIG. 3 is a structural data diagram of a method for training a paragraph segmentation model according to an embodiment of the present invention;

FIG. 4 is a data diagram of error correction effects of a segmentation model on punctuation model results of a training method of a paragraph segmentation model according to an embodiment of the present invention;

fig. 5 is a schematic structural diagram of a training system for a paragraph segmentation model according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Fig. 1 is a flowchart of a method for training a paragraph segmentation model according to an embodiment of the present invention, including the following steps:

s11: pre-training a neural network model of the paragraph segmentation model by using common segmentation data;

s12: and training the coding layer related to feature extraction in the pre-trained paragraph segmentation model based on the field segmentation data to obtain the paragraph segmentation model in the adaptation field.

And the coding layer related to feature extraction in the neural network model of the paragraph segmentation model and the coding layer related to feature extraction of the paragraph segmentation model in the adaptation field are shared and used for learning and extracting lexical, syntactic and grammatical features.

In the embodiment, the requirement of the new field of the existing segmentation model adaptation for data annotation is large, mainly because the conventional training scheme does not consider that the feature extraction part of the text technology in the bottom layer of the Natural Language Processing (NLP) technology can be shared.

For step S11, the universal corpus is relatively easy to obtain, taking the neural network transducer currently mainstream in NLP as an example, the network structure generally has several layers, the bottom coding layer generally learns general linguistic knowledge such as lexical, syntactic, and grammatical knowledge for feature extraction, and the upper coding layer learns knowledge related to specific tasks. Therefore, a Transformer model trained by mass data is used on one task, and the coding layer at the bottom can be used on NLP tasks of other small data, so that the training overhead is reduced. By using the method, the neural network model of the paragraph segmentation model is pre-trained by using massive general segmentation data.

As one embodiment, the neural network model includes a BERT model.

Considering that the encoder of the Transformer has a self-attention mechanism and has a bidirectional training function. Semantic representations at sentence level higher than words can be obtained, and in order to adapt to transfer learning under multitask, BERT designs a more general input layer and output layer. The BERT model was further chosen because of the low fine tuning cost of the model.

For step S12, a small amount of domain segmentation data is also needed to perform fine tuning training on the coding layer (e.g., the bottom coding layer in the above) related to feature extraction in the paragraph segmentation model trained in step S11, so that the cost of domain adaptation can be effectively reduced.

According to the embodiment, aiming at the problem that a large amount of fine-scale data are needed to be trained in a specific field, a pretraining model such as BERT is used for training on a large amount of easily-obtained general segmented data, and finally fine adjustment is carried out on a small amount of field fine-scale data, so that the field adaptation cost can be effectively reduced.

As an implementation manner, in this embodiment, the domain segmentation data is generated by an upstream punctuation model and segmentation artificial labeling data, and includes:

inputting the original field data into the upper cursor point model to obtain segmented field punctuation data;

receiving segmented artificial labeling data manually labeling the original field data;

determining a sentence ending symbol set based on punctuation types in the segmented artificial labeling data, and segmenting the original field data to obtain segmented artificial field punctuation data;

and generating field segmentation data with punctuation information and segmentation information based on the field punctuation data and the artificial field punctuation data.

As an embodiment, before the inputting the raw domain data into the upstream punctuation model, the method further comprises: and performing punctuation removal processing on the original field data.

In the present embodiment, the existing segmentation model has a large dependency on the punctuation output of the upstream, mainly because the existing techniques default that the punctuation output quality of the upstream is high, and therefore, the division of sentence units completely depends on the punctuation of the end of the table sentence such as the period of the upstream. However, in an actual business scenario such as spoken dialogue, the output quality of the upstream punctuation model is poor, and in particular, the prediction of sentence end symbols is inaccurate. Statistical results show that under such scenes, the punctuation prediction position of the punctuation model is generally accurate but the species prediction is wrong. We try to completely decouple and partially decouple the segmented model and the punctuation model, and from the practical usable viewpoint, we finally select the partial decoupling as the final scheme, and the specific flow of the partial decoupling is shown in fig. 2:

preparing partial field data, determining whether punctuation removal processing is needed or not according to whether the field data are provided with punctuations or not, and if the field data do not have the punctuations, directly inputting the field data into the upper vernier point model; and if the field data has punctuation, performing punctuation removal processing. Since the subsequent steps are performed back to the segmentation process, the domain data we prepared does not need punctuation in this step.

After the punctuation is removed, the field data are respectively input into the upper vernier point model to obtain the field data with the upstream punctuation output, and meanwhile, the segmentation manual marking data with manual marking are also obtained by manually carrying out segmentation marking.

And (3) combining the segmented artificial labeling data and the data output by the upstream punctuation model, counting the punctuation types appearing in front of the artificial segmentation mark, and forming a sentence ending symbol set which is used for dividing the input text into sentences and constructing training data which is in the field and has punctuation information and segmentation information.

Selecting whether the general segmented corpus needs to be used for carrying out a first round of fine tuning training based on a pre-training model according to business needs, wherein generally, if the trained model is a special service model in a certain specific field, the model can be selected not to be trained on the general segmented corpus, otherwise, the general segmented corpus is used by default. After the pre-training is completed, a second round of fine tuning training based on the pre-training model is performed by using the domain segmentation data, which is already described in steps S11 and S12 and will not be described herein again.

After the paragraph segmentation model is trained, the paragraph segmentation model can be used for receiving a large paragraph text input by a user and carrying out paragraph segmentation based on the paragraph segmentation model. The punctuation of the last sentence of a segmented paragraph is not a conventional ending punctuation (e.g., comma) for periods, question marks, etc. Then these ending punctuations that are not conventional are uniformly modified to periods. Therefore, the segmented text better accords with the use rule of punctuation and is fed back to the user.

According to the embodiment, aiming at the problem that the output of an upstream punctuation model is sensitive, new segmentation training data is constructed by combining segmentation information and the output of the upstream punctuation, and the quantity distribution of the punctuation before segmentation marking is counted so as to introduce new sentence segmentation punctuation. For example, the statistical result shows that besides the punctuations of ending the table such as periods, commas also appear before the segmentation markers in large numbers, and we use the commas as sentence segmentation punctuations and train and predict the sentences divided according to the new segmentation punctuation set. If it is finally predicted that segmentation is required at a certain comma, we modify the comma to a period and wrap the line with segments. Therefore, the robustness of the segmented model is improved, the dependence of the model on the upstream punctuations is reduced, and the output of the upstream punctuations can be corrected.

The method was tested for objective evaluation (F1 value):

the paragraph segmentation model trained without the method: 24

The paragraph segmentation model trained by the method comprises the following steps: 94

Subjective evaluation (manual score, 42 points full):

the paragraph segmentation model trained without the method: 20.3

The paragraph segmentation model trained by the method comprises the following steps: 33.3

And (4) conclusion: the segmentation quality can be obviously improved, and only a small amount of labeled corpora are needed.

The paragraph segmentation model is partially decoupled from the punctuation model:

as can be seen from FIG. 3, the segmented model, which is not decoupled from the upstream punctuation model, is greatly affected by the output of the upstream punctuation, and its F1 value at the segment sharply decreases from 88 to 36 when the punctuation changes from an artificial punctuation to a system punctuation.

And the performance is stable all the time by adopting a segmented model of a partial decoupling scheme, and F1 values of an artificial punctuation mark and a system punctuation mark at the segmentation position are respectively 92 and 94.

And (4) conclusion: the partial decoupling scheme can obviously improve the robustness of the model and can bring the benefit of improving the segment quality.

The deeper effect is as shown in fig. 4, the segmented model evaluates the error correction effect of the punctuation model result, and the conclusion is that: the scheme can also additionally improve the performance of the punctuation model, and further improve the text reading experience of a user.

On the other hand, complete decoupling is considered as our alternative:

slicing the text to obtain a plurality of sliced texts;

judging whether the slice text needs to be segmented or not based on a paragraph segmentation model;

and if segmentation is needed, inputting the slice text into an upper cursor point model, and determining the segmented position of the slice text based on the output of the upper punctuation model.

In the present embodiment, the principle of the complete structure is as follows: the input of the segmentation model is consistent with the input of the punctuation model. And slicing a certain text to be segmented according to the size of a certain fixed window, and then predicting whether the certain text needs to be segmented or not. And combining the output results of the standard point model in each window, the position of the specific segment can be determined.

Further, if the window size is selected properly, the number of punctuations at the end of each slice can be less than or equal to one according to the result of the punctuation model, so that the position of a specific segment can be determined more accurately.

It can be seen from this embodiment that this segmentation approach is completely decoupled from the upstream punctuation model, i.e., the segmentation result is completely determined by the segmentation model, and the upstream punctuation model only provides the specific position of the segmentation if necessary, i.e., the segmentation result is not affected by the upstream punctuation at all, and the robustness of the model is significantly improved.

Fig. 5 is a schematic structural diagram of a training system for a paragraph segmentation model according to an embodiment of the present invention, which can execute the method for training a paragraph segmentation model according to any of the above embodiments and is configured in a terminal.

The present embodiment provides a training system 10 for a paragraph segmentation model, which includes: a model pre-training program module 11 and a segmentation model training program module 12.

The model pre-training program module 11 is configured to pre-train a neural network model of the paragraph segmentation model by using general segmentation data; the segmentation model training program module 12 is configured to train, based on the domain segmentation data, an encoding layer related to feature extraction in the pre-trained segment segmentation model to obtain a segment segmentation model in the adaptation domain.

Further, the coding layer related to feature extraction in the neural network model of the paragraph segmentation model and the coding layer related to feature extraction of the paragraph segmentation model in the adaptation field are shared and used for learning and extracting the lexical, syntactic and grammatical features.

The embodiment of the invention also provides a nonvolatile computer storage medium, wherein the computer storage medium stores computer executable instructions which can execute the training method of the paragraph segmentation model in any method embodiment;

as one embodiment, a non-volatile computer storage medium of the present invention stores computer-executable instructions configured to:

As a non-volatile computer-readable storage medium, may be used to store non-volatile software programs, non-volatile computer-executable programs, and modules, such as program instructions/modules corresponding to the methods in embodiments of the present invention. One or more program instructions are stored in a non-transitory computer readable storage medium, which when executed by a processor, perform a method of training a paragraph segmentation model in any of the method embodiments described above.

The non-volatile computer-readable storage medium may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to the use of the device, and the like. Further, the non-volatile computer-readable storage medium may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some embodiments, the non-transitory computer readable storage medium optionally includes memory located remotely from the processor, which may be connected to the device over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

An embodiment of the present invention further provides an electronic device, which includes: at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the steps of the method for training a segmentation model according to any of the embodiments of the present invention.

The electronic device of the embodiments of the present application exists in various forms, including but not limited to:

(1) mobile communication devices, which are characterized by mobile communication capabilities and are primarily targeted at providing voice and data communications. Such terminals include smart phones, multimedia phones, functional phones, and low-end phones, among others.

(2) The ultra-mobile personal computer equipment belongs to the category of personal computers, has calculation and processing functions and generally has the characteristic of mobile internet access. Such terminals include PDA, MID, and UMPC devices, such as tablet computers.

(3) Portable entertainment devices such devices may display and play multimedia content. The devices comprise audio and video players, handheld game consoles, electronic books, intelligent toys and portable vehicle-mounted navigation devices.

(4) Other electronic devices with data processing capabilities.

As used herein, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A method for training a paragraph segmentation model, comprising:

2. The method according to claim 1, wherein the coding layer related to feature extraction in the neural network model of the paragraph segmentation model and the coding layer related to feature extraction of the field-adapted paragraph segmentation model are shared for learning and extracting lexical, syntactic and grammatical features.

3. The method of claim 1, wherein the neural network model comprises a BERT model.

4. The method of claim 1, wherein the domain segmentation data is generated from an upstream punctuation model and segmentation manual annotation data, comprising:

5. The method of claim 4, wherein prior to said inputting raw domain data into an upstream punctuation model, the method further comprises: and performing punctuation removal processing on the original field data.

6. The method of any of claims 1-5, wherein the domain segment data has a smaller data volume than the generic segment data.

7. A system for training a paragraph segmentation model, comprising:

8. The system of claim 7, wherein the coding layer related to feature extraction in the neural network model of the paragraph segmentation model and the coding layer related to feature extraction of the field-adapted paragraph segmentation model are shared for learning to extract lexical, syntactic and grammatical features.

9. An electronic device, comprising: at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the steps of the method of any of claims 1-6.

10. A storage medium on which a computer program is stored which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 6.