CN112580326A

CN112580326A - Punctuation mark model and training system thereof

Info

Publication number: CN112580326A
Application number: CN201910925842.0A
Authority: CN
Inventors: 沈大框; 陈培华; 陈成才
Original assignee: Shanghai Xiaoi Robot Technology Co Ltd
Current assignee: Shanghai Xiaoi Robot Technology Co Ltd
Priority date: 2019-09-27
Filing date: 2019-09-27
Publication date: 2021-03-30

Abstract

Punctuation mark model and its training system, the said system includes: the method comprises the steps that a corpus obtaining unit obtains a first corpus containing punctuation marks and a second corpus containing punctuation marks; the corpus processing unit removes punctuation marks from the second training corpus, and marks corresponding label combinations at the positions of word segmentation units before and after the removed punctuation marks to obtain a third training corpus; the pre-training unit inputs the first training corpus into a preset neural network model with a time sequence for pre-training to obtain a pre-trained language sub-model; and the transfer learning training unit inputs the third training corpus into an initial punctuation mark model for transfer learning training to obtain a trained punctuation mark model. According to the scheme, a large amount of training data does not need to be marked manually, the recall rate is improved, and the obtained punctuation mark marking model has good generalization capability and universality.

Description

Punctuation mark model and training system thereof

Technical Field

The embodiment of the invention relates to the technical field of computer natural language processing, in particular to a punctuation mark marking model and a training system thereof.

Background

The existing punctuation recovery scheme usually adopts a sequence labeling mode, is mainly applied to recovering punctuation of a text obtained by speech recognition, and generally can only add simple punctuation, such as: comma, period. The punctuation added in this way has low accuracy and poor generalization ability on one hand; on the other hand, the marked punctuation is poor in richness, and poor reading experience is brought to people.

With the continuous development of deep learning technology, punctuation marks of texts obtained by speech recognition can be predicted by adopting a trained and learned neural network model, so that the accuracy is improved. However, the learning training of the neural network model requires a large amount of training data to be labeled manually, and the recall rate of the neural network model obtained by the existing training method is low. In addition, the adopted punctuation mark model has weak generalization capability and poor universality.

Disclosure of Invention

In view of this, an aspect of the embodiments of the present invention provides a training system for a punctuation mark model, which does not need to manually mark a large amount of training data and can improve the recall rate.

In another aspect of the embodiments of the present invention, a punctuation mark model is provided, which has better generalization ability and universality.

The embodiment of the invention provides a training system of a punctuation mark model, which comprises: the training corpus acquiring unit is suitable for acquiring a first training corpus containing punctuation marks and a second training corpus containing punctuation marks; and the corpus processing unit is suitable for removing punctuation marks from the second training corpus and labeling corresponding label combinations at word segmentation units before and after the removed punctuation marks to obtain a third training corpus, wherein the third training corpus comprises: a punctuation-free text set and a label sequence set, the label sequence set comprising the label combinations; the pre-training unit is suitable for inputting the first training corpus into a preset neural network model with a time sequence for pre-training to obtain a pre-trained language sub-model; and the transfer learning training unit is suitable for inputting the third training corpus into an initial punctuation mark model for transfer learning training to obtain a trained punctuation mark model. Optionally, the corpus processing unit includes: the identification subunit is suitable for identifying punctuation marks of the second training corpus according to a preset punctuation mark set; the removing subunit is suitable for removing punctuation marks in the second training corpus to obtain a corresponding punctuation-free text set; and the labeling subunit is suitable for labeling label combinations with corresponding position relations at the positions of the word segmentation units before and after each removed punctuation mark according to a preset punctuation mark position label set to obtain a corresponding label sequence set.

Optionally, the labeling subunit includes at least one of: the first labeling module is suitable for labeling label combinations with corresponding position relations at the word segmentation units before and after each removed punctuation mark according to the punctuation mark position label set when the word segmentation units are both before and after the removed punctuation mark; and the second labeling module is suitable for adding a vacancy filling symbol before or after the removed punctuation mark to fill the position of the vacancy word segmentation unit when a word segmentation unit is not arranged before or after the removed punctuation mark, and labeling label combinations with corresponding position relations at the word segmentation unit position after or before the punctuation mark and the vacancy filling symbol according to the punctuation mark position label set.

Optionally, the set of punctuation marks comprises: a start position label arranged at the word segmentation unit before each punctuation mark and an end position label arranged at the word segmentation unit after each punctuation mark; the labeling subunit further comprises at least one of: the third labeling module is suitable for labeling corresponding initial position labels at word segmentation units before other punctuations except the last punctuation when a plurality of continuously removed punctuations are the same punctuation and only one word segmentation unit exists between the punctuations; and the fourth labeling module is suitable for adding an interval symbol after each word segmentation unit when no word segmentation unit exists between two continuous removed punctuations, labeling a starting position label corresponding to the former punctuation at the word segmentation unit before the removed former punctuation, and labeling a starting position label and an ending position label corresponding to the latter punctuation at the interval symbol before the removed latter punctuation and the word segmentation unit after the removed latter punctuation.

Optionally, the set of punctuation marks further comprises: no blank position label corresponding to the punctuation mark; the labeling subunit further comprises: and the fifth marking module is suitable for marking the blank position label at the word segmentation unit when no punctuation mark exists before and after one word segmentation unit.

Optionally, the pre-training unit comprises at least one of: the first pre-training subunit is suitable for acquiring a random first training text from the first training corpus and inputting the random first training text into the neural network model, the neural network model predicts the next word segmentation unit of the first training text under the given condition, and when the prediction accuracy probability reaches a preset first threshold value, the pre-training language sub-model is obtained; and the second pre-training subunit is suitable for acquiring a second training text which randomly covers a part with a preset proportion from the first training corpus and inputting the second training text into the neural network model, the neural network model predicts the covered part with the preset proportion under the condition of a given context, and when the probability of accurate prediction reaches a preset second threshold value, the pre-training language submodel is obtained.

Optionally, the pre-training unit further comprises: and the third pre-training subunit is suitable for acquiring a third training text for randomly covering a sentence from the first training corpus, inputting the third training text into the neural network model, predicting the covered sentence by the neural network model under the given condition, and obtaining the pre-trained language sub-model when the accurate prediction probability reaches a preset third threshold value.

Optionally, the punctuation mark model further comprises a preset conditional random field sub-model; the transfer learning training unit includes: the extraction subunit is suitable for extracting a section of punctuation-free text of the punctuation-free text set obtained by the corpus processing unit and a corresponding first label sequence of the punctuation-free text in the label sequence set; the transfer training subunit is suitable for inputting the punctuation mark marking model with the punctuation-free text and the first label sequence, acquiring sentence characteristic information corresponding to the punctuation-free text by the pre-trained language submodel, inputting the sentence characteristic information into the conditional random field submodel, and predicting a second label sequence corresponding to the punctuation-free text; an error calculation subunit adapted to determine an error between the second tag sequence and the first tag sequence; and the fine tuning subunit is suitable for performing fine tuning on the parameters of the punctuation mark model when the error is determined to be larger than a preset error threshold value until the error between the second label sequence and the first label sequence is not larger than the error threshold value, so as to obtain the trained punctuation mark model.

Optionally, the first corpus comprises at least one of: text data containing punctuation marks captured in the internet; corresponding text data is obtained through voice recognition and marking captured in the Internet; manually entered text data containing punctuation marks; the second corpus comprises at least one of: text data captured in the first corpus; text data containing punctuations in a preset field is captured in the Internet; manually inputting text data containing punctuations in a preset field.

The embodiment of the invention also provides a punctuation mark marking model, which comprises the following steps: the input layer is suitable for extracting corresponding word segmentation unit characteristic information from the text to be added with punctuations; the coding layer is suitable for capturing sentence characteristic information corresponding to the word segmentation unit characteristic information through a pre-trained language sub-model, wherein the pre-trained language sub-model is a neural network model with a time sequence; the conditional random field network layer is suitable for predicting positions to be added with punctuation marks and corresponding punctuation mark types in the texts to be added with the punctuation marks based on the sentence characteristic information, and labeling by adopting label combinations corresponding to the punctuation mark types at word segmentation units before and after the positions to be added with the punctuation marks; and the output layer is suitable for outputting the corresponding punctuation mark marking result.

By adopting the training scheme of the punctuation mark model of the embodiment of the invention, the acquired first training corpus containing punctuation marks is input into a preset neural network model with a time sequence for pre-training, so that a pre-trained language sub-model can be obtained, meanwhile, the punctuation marks of the acquired second training corpus containing punctuation marks are removed, and corresponding label combinations are marked at the front and rear word segmentation units of the removed punctuation marks, so that a third training corpus can be obtained; and inputting the third training corpus into an initial punctuation mark model for transfer learning training, so as to obtain a trained punctuation mark model. The label combination is adopted for labeling, so that the information of the word segmentation units before and after the punctuation marks can be fully considered, the label distribution of the training corpus is more balanced, the problem of label distribution imbalance caused by only labeling the word segmentation units in front of the punctuation marks can be avoided, meanwhile, the condition of excessive labeled invalid labels can be avoided, and the accuracy rate of punctuation mark prediction is improved; and the neural network model with the time sequence can deeply capture the information of the text context after being pre-trained, and is prepared for the transfer learning training of the punctuation mark model, semi-supervised learning can be realized, a large amount of data does not need to be marked manually, the transfer learning training can ensure that the whole punctuation mark model can be converged quickly, the recall rate is improved, and a good effect can be achieved by using a small amount of data sets, so that a universal and powerful punctuation mark model can be obtained, and the punctuation mark model has better generalization capability and universality.

Further, when the word segmentation units are arranged before and after the removed punctuation marks, label combinations with corresponding position relations can be marked on the word segmentation units before and after the removed punctuation marks according to the punctuation mark position label set; when there is no word segmentation unit before or after the removed punctuation mark, a vacancy filling symbol can be added before or after the removed punctuation mark to fill the position of the vacant word segmentation unit, the filled vacancy filling symbol can be regarded as a word segmentation unit when the mark and the label are combined, and then the label combination of the corresponding position relation is marked on the word segmentation unit and the vacancy filling symbol after or before the punctuation mark according to the punctuation mark position label set. By adopting the scheme, the label combinations with corresponding position relations are marked at the word segmentation units before and after the punctuation marks according to the punctuation mark position label set, so that the punctuation mark position labels corresponding to the punctuation marks appear in pairs; when a word segmentation unit is not arranged before or after the removed punctuation mark, the vacancy filling symbol is added to fill the position of the vacant word segmentation unit, so that the problem that the punctuation mark appears at the tail or the head but no word segmentation unit can label is avoided, and the label combination corresponding to the punctuation mark can always appear in pairs.

Further, the set of punctuation marks comprises: a start position label arranged at the word segmentation unit before each punctuation mark and an end position label arranged at the word segmentation unit after each punctuation mark; when a plurality of continuous removed punctuations are the same punctuation and only one word segmentation unit exists between the punctuations, only corresponding initial position labels are marked at word segmentation units before other punctuations except the last punctuation; when there is no word segmentation unit between two consecutive removed punctuations, an interval symbol may be added after each word segmentation unit, and a start position label corresponding to a previous punctuation mark is labeled at a word segmentation unit before the removed previous punctuation mark, and a start position label and an end position label corresponding to a next punctuation mark are respectively labeled at an interval symbol before the removed next punctuation mark and at a word segmentation unit after the removed next punctuation mark. By adopting the scheme, the label marking mode can be optimized, so that the punctuation marks which have shorter interval with word segmentation units and are easy to continuously appear can be predicted more accurately during model training.

Further, the set of punctuation mark position tags further comprises: and when no punctuation mark exists before and after a word segmentation unit, marking a blank position label at the word segmentation unit according to the punctuation mark position label set. And each word segmentation unit in the third training corpus can be provided with a corresponding punctuation mark position label through the blank position label, so that the distribution balance of the labels can be further improved, and the fine adjustment can be conveniently carried out during the later model training.

Furthermore, the representation capability of the pre-trained language sub-model on the characteristics of the upper and lower sentences in the text corpus can be enhanced by selecting the first training corpus and inputting a preset neural network model with a time sequence, predicting the next word segmentation unit of the first training text under the condition of a given upper context, or predicting the covered preset proportion part under the condition of a given context. The ability of understanding the relationship between the upper paragraph and the lower paragraph by the pre-trained language sub-model can be further enhanced by selecting the first training corpus and inputting the preset neural network model with the time sequence and predicting the covered sentence under the condition of the given upper part. By adopting the punctuation mark model of the embodiment of the invention, the sentence characteristic information corresponding to the characteristic information of the word segmentation unit can be captured by the pre-trained language submodel of the coding layer, then the position to be added with the punctuation mark and the corresponding punctuation mark type in the text to be added with the punctuation mark are predicted by the conditional random field network layer, and the punctuation mark is marked by adopting the label combination corresponding to the punctuation mark type at the position of the word segmentation unit before and after the position to be added with the punctuation mark. As can be seen from the above, the punctuation mark model comprises a pre-trained language submodel, which can fully consider the information of word segmentation units before and after the punctuation mark, so that the conditional random field network layer can predict the punctuation mark type more accurately, and then, the label combination is adopted for labeling to make the labeled labels more uniformly distributed, so that the problem of label distribution imbalance and the condition of too many labeled invalid labels can be avoided, and the accuracy of punctuation mark is improved; moreover, the punctuation mark model obtained by the training system can be used for punctuation mark adding of the text to be punctuated and correcting the punctuation marks of the existing text, and has high accuracy, strong universality and wider application range.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present specification, the drawings needed to be used in the embodiments of the present specification or in the description of the prior art will be briefly described below, it is obvious that the drawings described below are only some embodiments of the present specification, and it is also possible for a person skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a schematic structural diagram of a training system for a punctuation mark labeling model according to an embodiment of the present invention. Fig. 2 is a schematic diagram illustrating an embodiment of obtaining a third corpus.

FIG. 3 is a diagram illustrating another example of obtaining a third corpus in accordance with an embodiment of the present invention. FIG. 4 is a schematic structural diagram of a punctuation mark model according to an embodiment of the present invention.

Detailed Description

As described above, although the accuracy of the punctuation mark adding method is greatly improved when the neural network is used, the punctuation mark adding method needs to prepare a large amount of training data in advance for neural network model training, the training data is usually a punctuation-free corpus generated by Automatic Speech Recognition (ASR), training can be performed after manual labeling is performed with time and labor, and then punctuation mark prediction is performed on a punctuation mark text obtained by Speech Recognition by using the trained neural network model, and the obtained punctuation mark prediction result is often low in recall rate. In addition, the training data of the current neural network model only considers the information of the previous word of the punctuation mark, so that the label distribution of the training data is unbalanced, and the neural network model trained by the method is poor in generalization capability and not universal.

In view of the above problems, in the embodiments of the present specification, an obtained first corpus containing punctuation marks is input into a preset neural network model with a time sequence for pre-training, so as to obtain a pre-trained language sub-model, and meanwhile, a third corpus can be obtained by removing the punctuation marks of an obtained second corpus containing punctuation marks and labeling corresponding label combinations at the word segmentation units before and after the removed punctuation marks; and inputting the third training corpus into an initial punctuation mark model for transfer learning training, so as to obtain a trained punctuation mark model.

The word segmentation unit corresponds to the minimum sentence component unit of the language category of the second corpus, for example, the minimum sentence component unit of chinese is "word", and the minimum sentence component unit of english is "word", so that the specific expression of the word segmentation unit is determined by the language category of the second corpus.

The label combination of the embodiment of the specification is adopted for labeling, information of word segmentation units before and after punctuation marks can be fully considered, so that the label distribution of the training corpus is more balanced, the problem of label distribution imbalance caused by only labeling the word segmentation units in front of the punctuation marks can be avoided, meanwhile, the condition of excessive labeled invalid labels can be avoided, and the accuracy of punctuation mark prediction is improved; and the neural network model with the time sequence can deeply capture the information of the text context after being pre-trained, and is prepared for the transfer learning training of the punctuation mark model, semi-supervised learning can be realized, a large amount of data does not need to be marked manually, the transfer learning training can ensure that the whole punctuation mark model can be converged quickly, the recall rate is improved, and a good effect can be achieved by using a small amount of data sets, so that a universal and powerful punctuation mark model can be obtained, and the punctuation mark model has better generalization capability and universality.

In order to make the embodiments of the present disclosure more clearly understood and implemented by those skilled in the art, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the drawings in the embodiments of the present disclosure, and it is obvious that the described embodiments are some embodiments of the present disclosure, but not all embodiments. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments in the present specification without any creative effort shall fall within the protection scope of the present specification.

Referring to fig. 1, a schematic structural diagram of a training system of a punctuation mark model, in an embodiment of the present invention, a training system 10 of a punctuation mark model may specifically include:

a corpus acquiring unit 11 adapted to acquire a first corpus including punctuation marks and a second corpus including punctuation marks;

the corpus processing unit 12 is adapted to remove punctuation marks from the second corpus, and label corresponding label combinations at the word segmentation units before and after the removed punctuation marks to obtain a third corpus, where the third corpus includes: a punctuation-free text set and a label sequence set, the label sequence set comprising the label combinations;

the pre-training unit 13 is adapted to input the first training corpus into a preset neural network model with a time sequence for pre-training, so as to obtain a pre-trained language sub-model;

and the transfer learning training unit 14 is adapted to input the third training corpus into the initial punctuation mark model for transfer learning training, so as to obtain a trained punctuation mark model.

In the specific implementation, the wider the source of the first training corpus including the punctuation marks is, the more standard the punctuation marks are, the more beneficial the pre-training of the neural network model is to be carried out subsequently, and the universality and the accuracy of the pre-trained language word model can be improved. Therefore, on the one hand, in order to make the source of the first corpus more extensive, the corpus obtaining unit 11 obtains the corpus in different fields as much as possible, and on the other hand, in order to make the use of the punctuation mark in the first corpus more standard, the corpus obtaining unit 11 may obtain the corrected corpus.

For example, the first corpus may include: news corpora, hundred degree encyclopedia corpora, thesis corpora, etc. biased toward written terms; preferred to spoken chat corpora, microblog corpora, comment corpora, etc.

In a specific implementation, the neural network model with a time series includes: the pre-training Unit 13 pre-trains the preset neural network model with a time sequence by using the first training corpus, and the obtained pre-trained language sub-model can deeply capture information of a text context and prepare for the migration learning training of a subsequent punctuation mark model.

In a specific implementation, the second corpus containing punctuation marks may be the same as the above-mentioned first corpus, and the corpus obtaining unit 11 may obtain the corpus or obtain the corrected corpus in a plurality of different fields, so as to increase the generality and accuracy of the punctuation mark model. In order to save the corpus acquiring time, the corpus acquiring unit 11 may also directly capture the corpus from the first corpus to obtain the second corpus. For example, the second corpus may include: news corpora, hundred degree encyclopedia corpora, thesis corpora, etc. biased toward written terms; preferred to spoken chat corpora, microblog corpora, comment corpora, etc. The source of the second corpus can be relatively extensive.

In a specific implementation, in order to accurately predict punctuation marks of a punctuation-free text in one or more specific fields, the corpus acquiring unit 11 may also acquire a corpus for the specific field or fields, and acquire a second corpus. For example, to be able to predict punctuation of punctuation-free text in the financial domain, the second corpus may include corpus that contains punctuation from the financial domain.

In this embodiment of the present invention, the corpus processing unit 12 may perform punctuation mark removal and label labeling on the second corpus, and associate the obtained punctuation-free text set and the label sequence set in the third corpus, so that when a part or all of the punctuation-free text set is obtained, a corresponding part or all of the label sequences can be obtained.

In the embodiment of the invention, the label sequence set can contain all label combinations for labeling the front and rear word segmentation units of the removed punctuation marks, and the information of the front and rear word segmentation units of the punctuation marks of the second training corpus can be fully considered by labeling the label combinations, so that the label distribution of the third training corpus is more balanced, the problem of label distribution imbalance caused by labeling only the labels at the word segmentation units in front of the removed punctuation marks can be avoided, the condition of excessive labeled invalid labels can be avoided, and the accuracy of punctuation mark prediction of a punctuation mark labeling model can be improved.

It can be understood that the language types corresponding to the first corpus, the second corpus and the third corpus may include, but are not limited to, chinese, and may be adjusted according to the language type of the text to be labeled by the actual punctuation mark labeling model. In specific implementation, the corpus of the corresponding language type may be selected according to the language type of the text to be labeled or processed subsequently.

It can also be understood that the timing for obtaining the first corpus and the second corpus is not limited to the above sequence of steps, the first corpus may be obtained before the pre-training of the preset neural network model with time sequence, and the second corpus may be obtained before the third corpus is input into the initial punctuation mark model for transfer learning training. For example, the first corpus and the second corpus may be obtained simultaneously.

In specific implementation, the punctuation mark model can comprise a pre-trained language sub-model, so that the number of the third training corpora required by the punctuation mark model is greatly reduced, the number of the second training corpora can be reduced, labor cost and marking time are saved, and training time of the punctuation mark model is reduced. Meanwhile, the whole punctuation mark labeling model can be quickly converged on the premise of reducing the number of the third training corpus by the transfer learning training, and the recall rate is improved.

In a specific implementation, as shown in fig. 1, the corpus processing unit 12 may include:

the identifying subunit 121 is adapted to identify punctuation marks of the second corpus according to a preset punctuation mark set;

a removing subunit 122, adapted to remove punctuation marks in the second corpus to obtain a corresponding punctuation-free text set;

and the labeling subunit 123 is adapted to label, according to the preset punctuation mark position label set, a label combination with a corresponding position relationship at the word segmentation unit before and after each removed punctuation mark, so as to obtain a corresponding label sequence set.

In a specific implementation, in order to achieve automatic acquisition of the third corpus, a punctuation mark set and a preset punctuation mark position mark set may be preset in advance. After the second corpus containing punctuation marks is obtained, all word segmentation units and punctuation marks in the second corpus can be converted into marks (tokens) in a byte format, so that errors caused by word segmentation can be avoided, and the problems of high learning cost, sparse word vector (word vector) representation and the like caused by overlarge dictionaries are also avoided.

Specifically, according to a preset punctuation mark set, the identifying subunit 121 may identify punctuation marks of the second corpus, then the removing subunit 122 may remove the punctuation marks in the second corpus to obtain a corresponding punctuation-free text set, and the labeling subunit 123 may label, according to a preset punctuation mark position label set, a label combination with a corresponding position relationship at a preceding and following word segmentation units of each removed punctuation mark to obtain a corresponding label sequence set. And obtaining a punctuation-free text set and a corresponding label sequence set of the third training corpus for subsequent training punctuation mark models.

The punctuation mark set may include corresponding commonly used punctuation marks according to the language categories corresponding to the first corpus, the second corpus, and the third corpus, and the punctuation mark position mark set may include a label combination corresponding to the punctuation mark set. For example, if the language type corresponding to the first corpus, the second corpus, and the third corpus is chinese, the set of punctuation marks may include common punctuation marks such as comma, period, and pause, and the set of punctuation mark position labels may include a label combination corresponding to comma, a label combination corresponding to period, and a label combination corresponding to pause, etc. The labels are combined to appear in pairs and are used for being marked at the positions of the word segmentation units before and after the removed punctuation marks, so that the label distribution of the third training corpus is more balanced. The punctuation mark set and the punctuation mark position mark set may be set according to a specific application scenario, and the embodiment of the present specification does not limit this.

In a specific implementation, the training system of the punctuation mark model may further include a corpus cleaning unit (not shown), before the pre-training of the language sub-model and the training of the punctuation mark model are performed, data cleaning may be performed on the obtained first corpus and the obtained second corpus, and it is mainly to remove the punctuation marks that are obviously irregular in use, for example, a beginning of a sentence has punctuation marks that cannot be used for a beginning of the sentence, a sentence has punctuation marks that should not appear continuously in the sentence, and a sentence end punctuation mark has a non-sentence end, and the like, and under the condition that the punctuation marks that are irregular in use are used, these punctuation marks that are irregular in use may be deleted. Therefore, the method can avoid learning an irregular punctuation mark using mode by the language submodel and the punctuation mark marking model.

In a specific implementation, as shown in fig. 1, the labeling subunit 123 may include at least one of the following:

the first labeling module 1231 is adapted to label, according to the set of punctuation mark positions, label combinations with corresponding position relationships at the preceding and following word segmentation units of each removed punctuation mark when there are word segmentation units in both the preceding and following removed punctuation marks, so that the punctuation mark position labels corresponding to the punctuation marks appear in pairs. For example, the second corpus is S: { t₁t₂t₃...t_nN is a determined positive integer, which represents the number of tokens (tokens) contained in the set S, and if the text contains m punctuations and m is less than n, the third text corpus is the set R: { t₁t₂...t_n-m}, the firstA labeling module 1231 may begin t after the removed punctuation mark_iProcessing and marking a label combination to obtain a corresponding label sequence set L of { L }₁L₂L₃...L_n-m}。

The second labeling module 1232 is adapted to, when there is no segmentation unit before or after the removed punctuation mark, add a vacancy-filling symbol before or after the removed punctuation mark to fill the position of the vacant segmentation unit, and label combinations of corresponding positional relationships are labeled at the segmentation unit and the vacancy-filling symbol after or before the punctuation mark according to the punctuation mark position label set.

Since the label combination cannot be labeled in pairs when there is no word segmentation unit before or after the removed punctuation mark, the gap filling symbol can be regarded as a word segmentation unit, so that the label combination can be labeled in pairs.

In the embodiment of the present invention, the second labeling module 1232 may use the null-position-filling symbol in the following cases:

first, punctuation marks are located at the end of a sentence. For example, the second corpus is: { today is wednesday. And when the punctuation mark is removed, the second labeling module 1232 may add a vacancy completion mark EOS for the sentence end after the removed sentence mark to obtain a punctuation-free text set { today is wednesday EOS }, and then, the second labeling module 1232 may label the label combination of the sentence mark at the position of the character "three" before the removed sentence mark and the position of the vacancy completion mark "EOS" after the removed sentence mark.

It should be understood that "{ }" is used only to limit the content range of the examples, and is not an essential part in representing the content of the corpus, and those skilled in the art can use other symbols that are not easy to confuse to limit the content range of the corpus, and the word segmentation unit and the punctuation mark in "{ }" can be marks converted into byte format, and can also be marks converted into other formats, and the following "{ }" is the same as described above.

Second, punctuation marks are located at the beginning of the sentence. For example, the second corpus is: the reader is a good journal. And when the punctuation mark is removed, the second labeling module 1232 may add a vacancy completion mark BOS for the beginning of the sentence in front of the removed left title mark to obtain a non-punctuation text set { BOS reader is a native magazine EOS }, and then, the second labeling module 1232 may label a tag combination of the left title mark at the vacancy completion mark "BOS" position before the removed left title mark and at the character "read" position after the left title mark.

It is understood that the foregoing embodiments are merely illustrative, and not restrictive on the specific implementation, and in the specific implementation process, the punctuation-free text set and the tag sequence set obtained by the second corpus are often massive.

By adopting the scheme, the label combinations with corresponding position relations are marked at the word segmentation units before and after the punctuation marks according to the punctuation mark position label set, so that the punctuation mark position labels corresponding to the punctuation marks appear in pairs; when a word segmentation unit is not arranged before or after the removed punctuation mark, a corresponding vacancy filling symbol is added according to specific conditions to fill the position of the vacant word segmentation unit, so that the problem that the punctuation mark can be labeled without the word segmentation unit can be avoided, and the punctuation mark position labels corresponding to the punctuation mark can be ensured to be always paired.

In a specific implementation, the set of punctuation marks tags may comprise: a start position label at the word segmentation unit before each punctuation mark, and an end position label at the word segmentation unit after each punctuation mark. The set of punctuation position tags can correspond to the set of punctuation symbols.

In an embodiment of the present invention, one punctuation symbol set P and one punctuation symbol position tag set Q are preset, where P {. Is there a | A B, carrying out the following steps of; : and when the "Q" is equal to { B-period I-period B-sequential mark I-sequential mark B-iteration I-iteration B-comma I-comma B-semiconductor I-semiconductor B-operation I-operation electron B-closing electron I-closing electron B-upper quantitative mark I-upper quantitative mark B-lower quantitative mark I-lower quantitative mark }, the set of point symbols P and the set of point position marks Q are associated, and the corresponding position relationship of each point symbol in the set of point symbols P and its preceding and succeeding sub-word unit is represented by the set of point position marks Q. When the punctuation marks in the second training corpus are identified through the punctuation mark set P, the punctuation mark position labels corresponding to the identified punctuation marks can be obtained through the association between the punctuation mark set P and the punctuation mark position label set Q, and then corresponding label combinations can be marked at the front word segmentation unit and the rear word segmentation unit after the identified punctuation marks are removed.

It should be understood that the notation set of the punctuation marks in the embodiment is represented by "full english name of B-punctuation" and "full english name of I-punctuation", but other representations may be adopted, for example, considering that the full english name written with the punctuation marks is too long, the full english name of the punctuation marks may be replaced by an abbreviation, and for example, the full english name of the punctuation marks may be directly replaced by the punctuation marks, which is not limited in this embodiment of the present invention.

The table of punctuation, a set of punctuation, and a set of punctuation position tags is shown in table 1.

TABLE 1

Continuing with the above example, second labeling module 1232 may label the beginning position label "B-per" of the period immediately before the word "three" in the set of non-punctuated text { today is Wednesday EOS } and the ending position label "I-per" of the period at the space completion symbol "EOS" immediately after the period. In the same way, the method can obtain,

the second labeling module 1232 may also label the starting position label "B-ope" of the left title number at a position of the padding symbol "BOS" preceding the left title number in the punctuated text collection { BOS readers are a native magazine EOS }, and label the ending position label "I-ope" of the left title number at a position of the padding symbol "BOS" succeeding the left title number.

In a specific implementation, the set of punctuation mark position tags may further include: and no blank position label corresponding to the punctuation mark. As shown in fig. 1, the labeling subunit 123 may further include: the fifth labeling module 1235 is adapted to label the blank position label at a word segmentation unit when there is no punctuation mark before or after the word segmentation unit. Through the blank position labels, each word segmentation unit in the text corpus can have a corresponding punctuation mark position label, the distribution balance of the labels can be further improved, and fine adjustment can be conveniently carried out during later punctuation mark model training.

In a specific implementation, as shown in fig. 1, the labeling subunit 123 may further include at least one of the following:

a third labeling module 1233, adapted to label, when a plurality of consecutive removed punctuations are the same punctuation and there is only one word segmentation unit between the punctuations, only corresponding start position labels at word segmentation units before other punctuations except for the last punctuation;

a fourth labeling module 1234, adapted to, when there is no segmentation unit between two consecutive removed punctuations, add an interval symbol after each segmentation unit, label a start position label corresponding to a previous punctuation symbol at a segmentation unit before the removed previous punctuation symbol, and label a start position label and an end position label corresponding to a next punctuation symbol at an interval symbol before the removed next punctuation symbol and a word segmentation unit after the removed next punctuation symbol, respectively. The inventor finds in research that due to the rules of language usage, several special cases may arise:

first, a plurality of punctuations that have been removed in succession are the same punctuation and there is only one word segmentation unit between the punctuations. For example, according to the method for labeling a Chinese punctuation mark, although characters are present before and after a pause sign, when only one character is present between pause signs, a problem of overlapping occurs in a label combination added with pause signs. For example, the second corpus is: { cattle, sheep, horse all eat grass. And if there is only one character among a plurality of consecutive pause numbers, the third labeling module 1233 may label the start position label "B-cae" of the pause number at the characters "cow" and "sheep", and label the end position label "I-cae" of the pause number at the character "horse".

In an embodiment of the present invention, as shown in fig. 2, a schematic diagram of obtaining a third corpus according to the present invention is shown, and in an embodiment of the present invention, the obtained second corpus 21 is:

{ day, month, day three parties for a big battle, ready to begin! Is it impossible for a member of the group to win? }

Then, according to the preset punctuation mark set P, the identifying subunit 121 may identify and remove punctuation marks in the second corpus, and the removing subunit 122 removes punctuation marks in the second corpus, and because there is no text behind the question mark of the sentence end, the second labeling module 1232 may add a vacancy completion mark EOS for the sentence end behind the question mark, and obtain a punctuation-free text set 221{ the three-party large-decision in the day and month is ready to begin a chance that people cannot win the match EOS }. In addition, the second labeling module 1232 may label, according to the preset punctuation mark position label set Q, a label combination with a corresponding position relationship at the front and rear characters of each removed punctuation mark.

Since consecutive pause numbers appear between "day", "month", and "day", and there is only one character in the middle of the pause numbers, the third labeling module 1233 may label "B-cae" at "day" and "month" and label "B-cae" at "day". The first labeling module 1231 may label "B-com" and "I-com" respectively at the first pre-and post-comma text "decide" and "quasi", label "B-exc" and "I-exc" respectively at the pre-and post-exclamation point text "start" and "zan", and label "B-com" and "I-com" respectively at the second pre-and post-comma text "start" and "pair". The second labeling module 1232 may label "B-que" and "I-que" at the text "Do" before the question mark and the space-filling symbol "EOS" after the question mark, respectively. The fifth labeling module 1235 may label the blank position label O at the text without punctuation marks before and after the text, and finally obtain the corresponding label sequence set 222{ B-cae I-cae O B-com I-com O B-exc I-exc O B-com I-com B-que I-que }. And associating the obtained punctuation-free text set 221 with the label sequence set 222, thereby obtaining the third corpus 22.

Second, the removed punctuation marks are two consecutive punctuation marks, and no word segmentation unit exists between the punctuation marks. According to the labeling method of the label combination, the label overlapping problem of the punctuation mark combination also occurs, in order to avoid the problem, in the research process, the inventor observes that the situation is caused by the punctuation mark use rule and has a general rule, therefore, an interval mark SEP (segment) can be added after each word segmentation unit, a starting position label corresponding to the former punctuation mark is labeled at the word segmentation unit before the removed former punctuation mark, a starting position label and an ending position label corresponding to the latter punctuation mark are respectively labeled at the word segmentation unit before the removed latter punctuation mark and at the word segmentation unit after the latter punctuation mark, the interval mark can be regarded as a word segmentation unit, so that the label combination can be labeled in pairs, and the labeling method is effective for two continuous punctuation marks.

For example, the second corpus is: { I like to see readers. Where two consecutive symbols "appear in the text. ", the fourth labeling module 1234 may add an interval symbol SEP after each character, and, since there is a punctuation mark period at the end of a sentence, the second labeling module 1232 is further required to add a null-position-filling symbol EOS for the end of a sentence to fill up the missing characters, resulting in a punctuation-free text set { i SEP1 likes SEP2 and SEP3 looks at SEP4 and reads SEP5 or SEP6 EOS }. Then, the fourth labeling module 1234 may label the beginning position label "B-clo" of the right title number at "person", label the beginning position label "B-per" of the period number at "SEP 6", label the ending position label "I-per" of the period number at "EOS", and for the participle unit without punctuation marks before and after, the fifth labeling module 1235 may label the blank position label O at the participle unit; and for the interval symbols without punctuation marks at the front and the back, marking the interval labels S at the interval symbols.

It is to be understood that the "SEP" followed by the number sequence is used for descriptive purposes only for purposes of distinction, and that the "SEP" followed by the number sequence may not be used in a particular implementation.

In an embodiment of the present invention, as shown in fig. 3, a schematic diagram of obtaining the third corpus according to the present invention is shown, where the obtained second corpus 31 is:

{ leader description: "achieving profitability is the primary goal". }

Then, according to the preset punctuation mark set P, the identifying subunit 121 may identify punctuation marks in the second corpus, the removing subunit 122 may remove punctuation marks in the second corpus, where the colon, the quotation marks, the period, and the quotation marks are all the cases where two punctuation marks appear continuously, the fourth labeling module may add an interval mark SEP after each character, and because there is no character behind the sentence end period, the second labeling module 1232 may add a vacancy completion mark EOS for the sentence end behind the period. In addition, for the convenience of the subsequent training of the punctuation mark model, the second marking module 1232 may further add a null-filling symbol BOS to the beginning of the sentence, so that the symbols in the punctuation-free text set are complete, and obtain the punctuation-free text set 321{ BOS leads SEP1 to lead SEP2 to say that SEP3 is a real SEP4, SEP5 is a full SEP6, and SEP7 is a full SEP8, which is SEP9, SEP10, SEP11, SEP12 marks SEP13 EOS }.

The fourth labeling module 1234 may label the start position label "B-col" of the colon at the text "say", label the start position label "B-upp" of the quotation mark at the interval symbol "SEP 3", and label the end position label "I-upp" of the quotation mark at the text "real". Similarly, the processing method of periods and quotation marks can be obtained, interval labels S are marked at the interval symbols without punctuation marks at the front and the back, finally, the corresponding label sequence set 322{ O S B-col B-upp I-upp S O S O S B-low B-per I-per } is obtained, and the obtained punctuation-free text set 321 and the label sequence set 322 are associated, so that the third corpus 32 is obtained.

By adopting the scheme, the label marking mode can be optimized, so that the punctuation marks which have shorter interval with word segmentation units and are easy to continuously appear can be predicted more accurately during model training.

In order to make those skilled in the art better understand and implement the embodiments of the present invention, a specific example of a trained punctuation mark model obtained by migration learning is given below.

In an embodiment of the present invention, the punctuation mark model may include a pre-trained language submodel and a pre-set conditional random field submodel, and the pre-trained language submodel may be connected to the pre-set conditional random field model. As shown in fig. 1, the transfer learning training unit 14 may include:

an extracting subunit 141, adapted to extract a section of punctuation-free text in the punctuation-free text set obtained by the corpus processing unit 12, and a corresponding first tag sequence of the punctuation-free text in the tag sequence set;

a migration training subunit 142, adapted to input the punctuation mark model with the punctuation-free text and the first tag sequence, obtain, by the pre-trained language sub-model, sentence characteristic information corresponding to the punctuation-free text, input the sentence characteristic information into the conditional random field sub-model, and predict a second tag sequence corresponding to the punctuation-free text;

an error calculation subunit 143 adapted to determine an error between the second tag sequence and the first tag sequence;

a fine tuning subunit 144, configured to, when it is determined that the error is greater than a preset error threshold, perform fine tuning on parameters of the punctuation mark model until the error between the second tag sequence and the first tag sequence is not greater than the error threshold, so as to obtain a trained punctuation mark model.

In an implementation, the extracting subunit 141 may extract the third corpus obtained by the corpus processing unit 12 in batches, so that the migration training subunit 142 may perform iterative training for each batch, and each batch may include a piece of punctuation-free text, i.e., a sentence list, where the size of the list is determined by actual conditions. The punctuation-free text obtained by the corpus processing unit 12 may be a marking method adopted when removing punctuation marks and marking label combinations according to the second corpus, and the punctuation-free text may include at least one of a vacancy completion symbol EOS for a sentence end, a vacancy completion symbol BOS for a sentence head, and an interval symbol SEP.

In a specific implementation, since the set of punctuation-free texts is associated with the set of tag sequences, a corresponding tag sequence can be obtained as a first tag sequence according to a segment of punctuation-free texts contained in each batch.

In a specific implementation, the migration training subunit 142 inputs the punctuation-free text and the first tag sequence into the punctuation mark model, and the punctuation mark model converts each word segmentation unit in the punctuation-free text into a mark (token) in a byte format, and uses the mark (token) to perform the training

Denotes x_iIs an index value in the dictionary.

Performing word segmentation on each word segmentation unit according to index values of the dictionaryWord vector embedding (word embedding), one word segmentation unit corresponding to one word vector, using

e_iIs a dense vector of dimension k. The dictionary is adjusted according to the language type of the word segmentation unit, wherein the word vector matrix is obtained in a mode of random initialization or using pre-trained word vectors and exists in the whole neural network as learning parameters, and the word vector e_iIs set according to specific conditions.

Then, the vectorized punctuation-free text passes through the pre-trained language submodel of the punctuation mark model to express the static word vector

The sentence is converted into dynamic sentence representation and is recorded as

l_iIs a dense vector of dimension l, and the size of l is set according to specific situations, thereby being capable of deeply strengthening the representation of sentences.

Inputting a Conditional Random Field (CRF) submodel of the punctuation mark labeling model, wherein the CRF submodel is preset with a state transition matrix [ A]_i，jAnd a fractional matrix

[A]_i，jRepresenting the transition probability of a state transition from the ith state to the jth state for two time steps, a fractional matrix

To represent

The t-th position output after the input is [ y ]]_tWhere θ contains all the parameters of the entire pre-trained language submodel that require transfer learning. At this time, a feature function preset in the conditional random field model may be used:

which is represented as input

The predicted output tag sequence is

The score of (1). This characteristic function

The occurrence of label sequences which do not conform to labeling rules can be avoided, and in order to conveniently express subsequent formulas, the following formulas are expressed

Simplified to s (l, y). When s (l, y) is scored maximally, a second tag sequence is obtained

In an embodiment of the present invention, a Viterbi (Viterbi) method may be used to calculate a path with the largest s (l, y) score, i.e., y ═ argmaxs (l, y).

In a specific implementation, assume that the first tag sequence is

Error calculation subunit 143 can compare

And

an error between the two can be obtained.

In an implementation, after iterative training of the third corpus, the error calculating subunit 143 may obtain all the second tag sequences

And corresponding first tag sequence

When there is at least one group

And

when the error between the two is greater than the preset error threshold, the fine tuning subunit 144 may perform fine tuning on the parameters of the punctuation mark labeling model, and then the migration training subunit 142 performs the migration learning training of the punctuation mark labeling model again until the punctuation mark labeling model is subjected to the migration learning training again

And

the error between the two is not larger than the preset error threshold value, and the punctuation mark model trained by the transfer learning is obtained.

In an embodiment of the present invention, the fine tuning subunit 144 can be obtained first

After the punctuation mark model is input, the scores sigma s (l, y) corresponding to all possible mark sequences and the first label sequence are

Corresponding score

Secondly, normalizing the first label sequence to obtain the probability of the first label sequence

Then, log the probability of the first tag sequence to obtain a loss function:

since the smaller the value of the loss function is, the higher the representation accuracy is, it is necessary to obtain a parameter of the symbol labeling model corresponding to the minimum value of the loss function. The fine tuning subunit 144 may fine tune parameters of the punctuation mark model by using a gradient descent method or a back propagation method, where the range includes parameters of the pre-training language sub-model and parameters of the conditional random field sub-model, so as to reduce the value of the loss function. By adopting a gradient descent method or a back propagation method, the punctuation mark model can be quickly converged, the training time is shortened, and the accuracy of the punctuation mark model is improved.

In a specific implementation, the punctuation mark model further comprises a preset bidirectional long-time and short-time memory submodel, and the preset bidirectional long-time and short-time memory (Bilstm) submodel is located between the pre-trained language submodel and the conditional random field submodel.

After text information is generated by using the pre-trained language submodel, the text information is input into a two-way long-short time memory (Bilstm) submodel to further enhance the representation capability of sentence characteristic information. After sentence characteristic information acquired by two layers of neural networks with time sequences is input into the conditional random field sub-model, the change range of the final prediction result is smaller, a better prediction effect is achieved, and in addition, the accuracy and the generalization capability of the model can be increased under the condition of larger data volume.

In a specific implementation, as shown in fig. 1, the pre-training unit 13 includes at least one of:

the first pre-training subunit 131 is adapted to obtain a random first training text from the first training corpus, and input the random first training text into the neural network Model, where the neural network Model predicts a next word segmentation unit of the first training text under a given condition, determines that the training is good when a prediction accuracy probability reaches a preset first threshold, and obtains a pre-trained Language sub-Model through a Language Model (LM) training mode.

The pre-trained language submodel may estimate a likelihood that the punctuation-free text is a sentence when subsequent punctuation-free text is input. Assume that the punctuation-free text subsequently entered is: { omega [ [ omega ] ]₁ω₂ω₃...ω_nN is the number of word segmentation units in the punctuate-free text, and the pre-trained language submodel expresses that the probability that the punctuate-free text is a sentence is

Where i is 1, 2 … n. The greater the probability of P (S), the higher the probability of stating that the punctuate-free text is a sentence. The pre-trained language submodel may enhance the pre-trained language submodel's ability to characterize the characteristics of the context sentences in the corpus of text.

The word segmentation unit corresponds to the minimum sentence component unit of the language type of the input first corpus, for example, the minimum sentence component unit of chinese is "word", and the minimum sentence component unit of english is "word", so that the concrete expression of the word segmentation unit is determined by the language type of the first corpus.

In a specific implementation, when the first training text is an empty set, the neural network model performs first-word prediction, and takes a word segmentation unit with the highest first-word probability as a prediction result according to a known prior probability, for example, "present" as the first-word probability is the highest, and then "present" is taken as the first word.

When the foregoing has been the case, it is,for example, there has been "so far" above from which the probability P (ω) of the next word is predicted₂|ω₁) When P (ω)₂|ω₁) At maximum, the probability P (ω) of a sentence represented by two words₁ω₂)＝P(ω₁)P(ω₂|ω₁) At the maximum, "today" is obtained assuming the probability of "day" is the greatest.

As another example, there has been a probability P (ω) of predicting the next word based on "weather today is very" as above₆|ω₁ω₂ω₃ω₄ω₅) When P (ω)₆|ω₁ω₂ω₃ω₄ω₅) At maximum, these 6 words represent the likelihood of a sentence, P (ω)₁ω₂ω₃ω₄ω₅ω₆)＝P(ω₁)P(ω₂|ω₁)P(ω₃|ω₁ω₂)P(ω₄|ω₁ω₂ω₃)P(ω₅|ω₁ω₂ω₃ω₄)

At the maximum, assuming the probability of "good" is the greatest, then "today's weather is very good" is obtained.

A second pre-training subunit 132, adapted to obtain a second training text randomly masking a portion with a preset proportion from the first training corpus, and input the second training text into the neural network Model, where the neural network Model predicts the masked portion with the preset proportion under a given context, and when a predicted accurate probability reaches a preset second threshold, it is determined that the second training text is trained, and the pre-training Language sub-Model is obtained through a Mask Language Model (MLM) training mode. In a specific implementation, the second pre-training subunit 132 may mask a percentage of the content of the first corpus containing punctuation marks randomly, and then predict the masked content through a normalized exponential function (softmax) function, because the randomly masked content may include word segmentation units and punctuation marks, the language sub-model pre-trained in the MLM manner may be more strongly characterized for the sentence feature information of the sentence context.

In a specific implementation, as shown in fig. 1, the pre-training unit 13 further includes the following sub-units:

a third pre-training subunit 133, adapted to obtain a third training text that randomly masks a sentence from the first training corpus, and input the neural network model, where the neural network model predicts the masked sentence under a given condition, and when a probability of prediction accuracy reaches a preset third threshold, it is determined that training is good, and a pre-trained language sub-model is obtained in a Next Sentence Prediction (NSP) training manner. The language sub-model obtained by pre-training in the NSP mode has stronger capability of understanding the relationship between the upper paragraph and the lower paragraph.

In an embodiment of the present invention, the NSP mode adopts a classification method, the third pre-training subunit 133 may obtain a text a from the first corpus, randomly mask one sentence of the text a, then obtain a sentence B from the first corpus, predict whether the sentence B is a masked sentence in the text a, output "1" or "TRUE" if the predicted sentence B is a masked sentence in the text a, otherwise output "0" or "flag", judge whether the prediction is accurate and count the probability of the prediction accuracy according to the output result, and determine that the training is good when the probability of the prediction accuracy reaches a preset third threshold. For example, in a training period, 50% of the time is that the obtained sentence B is an actually covered sentence of the text a, and the remaining 50% of the time is that the obtained sentence B is a sentence randomly extracted from the first corpus.

In specific implementation, any one of the three pre-training subunits can be used for training to obtain a pre-trained language sub-model, or the pre-trained language sub-model can be obtained by training in a mode of combining the first pre-training subunit and the third pre-training subunit and combining the second pre-training subunit and the third pre-training subunit.

In a specific implementation, the first corpus may be obtained in various manners, for example, the first corpus may be obtained in one or more of the following manners:

1. text data containing punctuation marks is captured in the internet.

2. And capturing language data in the Internet, and carrying out voice recognition and labeling on the language data to obtain corresponding text data.

3. Text data containing punctuation is manually entered.

The second corpus may also be obtained in various manners, for example, the second corpus may be obtained in one or more of the following manners:

1. and capturing text data in the first training corpus.

2. Text data containing punctuations in a preset field is captured in the Internet.

3. And manually inputting text data containing punctuations in a preset field.

The embodiment of the present invention further provides a punctuation mark model obtained by using the training system of the punctuation mark model, and in order to enable those skilled in the art to better understand and realize the embodiment of the present invention, detailed descriptions are provided below through specific embodiments with reference to the attached drawings.

Referring to the structural diagram of the punctuation mark model shown in fig. 4, in the embodiment of the present invention, as shown in fig. 4, the punctuation mark model 40 may include an input layer 41, an encoding layer 42, a conditional random field network layer 44 and an output layer 45, wherein:

and the input layer 41 is suitable for extracting corresponding word segmentation unit characteristic information from the text to be added with punctuations.

In one embodiment, word vector embedding is performed on each word segmentation unit according to the dictionary, and one word segmentation unit corresponds to one word vector (i.e. one square of the input layer 41 in fig. 4) for use in word vector embedding

e_iIs a dense direction of k dimensionsAmount of the compound (A). The dictionary is adjusted according to the language type of the word segmentation unit, wherein the acquisition mode of the word vector matrix is random initialization or pre-trained word vectors are used and exist in the whole neural network as learning parameters, and the dimensionality of the word vectors is set according to specific conditions.

The text to be added with the punctuation mark may include at least one of a vacancy completion mark EOS for a sentence end, a vacancy completion mark BOS for a sentence head, and an interval mark SEP, which is specifically determined by a labeling mode of a third corpus used in the punctuation mark model training.

And the coding layer 42 is adapted to capture sentence characteristic information corresponding to the word segmentation unit characteristic information through a pre-trained language sub-model, wherein the pre-trained language sub-model is a neural network model with a time sequence, and the specific training process of the pre-trained language sub-model can refer to the description about the LM training mode, the MLM training mode and the NSP training mode.

In the specific implementation, the vectorized punctuation-free text represents the static word vector through the pre-trained language sub-model

l_iIs a dense vector with dimension l, the size of l is set according to specific conditions, thereby being capable of deeply strengthening the representation of sentences,

is an input to a Conditional Random Field (CRF) network layer 43.

And the conditional random field network layer 44 is adapted to predict a position to be added with a punctuation mark and a corresponding punctuation mark type in the punctuation-free text based on the sentence characteristic information, and label combinations corresponding to the punctuation mark types are adopted for labeling at word segmentation units before and after the position to be added with the punctuation mark.

In one embodiment, the conditional random field network layer 44 is pre-populated with a state transition matrix [ A ]]i，_jAnd a fractional matrix

[A]i, j represents the transition probability of the state transition from the ith state to the jth state in two time steps, and the fraction matrix

To represent

which is represented as input

The predicted output tag sequence is

The score of (1). This characteristic function

Simplified to s (l, y). When the score of s (l, y) is maximum, outputting the label sequence

Wherein the tag sequence

The method comprises the information of the position to be added of the punctuation marks and the information of the type of the corresponding punctuation marks.

And the output layer 45 is suitable for outputting the corresponding punctuation mark marking result.

In specific implementations, tag sequences will be tagged

And outputting the marking result of the punctuation marks corresponding to the texts to be added with the punctuation marks.

In the embodiment of the present invention, the path with the largest s (l, y) score can be calculated by using the Viterbi (Viterbi) method, i.e. y^*＝argmaxs(l，y)。

In a specific implementation, as shown in fig. 4, the punctuation mark model may further include a bidirectional long-and-short term memory network layer 43, where the bidirectional long-and-short term memory network layer 43 is disposed between the encoding layer 42 and the conditional random field network layer 44, and the bidirectional long-and-short term memory network layer 43 is to be disposed between the encoding layer 42 and the conditional random field network layer 44

In_iSequentially correspond to two adjacent blocks, e.g. l, in the two-way long and short term memory network layer 43 of FIG. 4₁Corresponding to 431 and 431. And enhancing the sentence characteristic information through bidirectional operation, and outputting the sentence characteristic information to the conditional random field network layer.

In an embodiment of the present invention, assuming that the dimension of the word vector is set to 100, the text to be added with punctuation marks is input as follows: { BOS Hello world EOS }, the input layer 41 of the punctuation mark model 40 gets the characterization {123456} of the initialized sentence word vector, which is in one-to-one correspondence with { BOS Hello world EOS }, specifically R^6×100Is represented by a matrix of (a). The coding layer 42 captures sentence feature information of the context in the text through a pre-trained language sub-model obtained in the MLM + NSP manner. The conditional random field network layer 44 predicts the tag sequence according to the sentence characteristic information and assumes the preset punctuation mark positionIf the label set Q has 15 types, then the conditional random field network layer 44 predicts to obtain R^6×15The corresponding relationship between the matrix values and the punctuation mark position label set Q is as follows:

1[0 0 0 0 0 0 0 0 0 0 0 1 0 0 0] O

2[0 0 0 0 0 0 0 0 0 0 0 1 0 0 0] O

3[0 0 0 0 0 0 1 0 0 0 0 0 0 0 0] B-COM

4[0 0 0 0 0 0 0 0 0 1 0 0 0 0 0] 1-COM

5[0 0 0 0 0 0 0 0 0 0 0 0 0 0 1] B-EXC

6[0 0 0 1 0 0 0 0 0 0 0 0 0 0 0] I-EXC

thus, the following predicted tag sequence is obtained:

{O O B-COM I-COM B-EXC I-EXC}

then, adding punctuation marks corresponding to punctuation mark types at positions, corresponding to positions to be added, of the text to which the punctuation marks are to be added respectively according to punctuation mark marking results output by the punctuation mark marking model, so as to obtain the text to which the punctuation marks are to be added: { hello, world! }.

By adopting the punctuation mark labeling model of the embodiment of the invention, the information of word segmentation units before and after punctuation marks can be fully considered through the pre-trained language submodel, so that the punctuation mark type can be more accurately predicted by a conditional random field network layer, and then, labeling is carried out by adopting label combination to ensure that the labeled labels are more uniformly distributed, so that the problem of label distribution imbalance and the condition of excessive labeled invalid labels can be avoided, and the accuracy of punctuation mark is improved; moreover, the punctuation mark model obtained by the training system can be used for punctuation mark adding of the text to be punctuated and correcting the punctuation marks of the existing text, and has high accuracy, strong universality and wider application range.

It is to be understood that the terms "first," "second," "third," and the like, herein, are used merely to distinguish one term from another, and do not denote any order, size, or importance, among others.

Although the present invention is disclosed above, the present invention is not limited thereto. Various changes and modifications may be effected therein by one skilled in the art without departing from the spirit and scope of the invention as defined in the appended claims.

Claims

1. A training system for punctuation marking models, comprising:

the training corpus acquiring unit is suitable for acquiring a first training corpus containing punctuation marks and a second training corpus containing punctuation marks;

and the corpus processing unit is suitable for removing punctuation marks from the second training corpus and labeling corresponding label combinations at word segmentation units before and after the removed punctuation marks to obtain a third training corpus, wherein the third training corpus comprises: a punctuation-free text set and a label sequence set, the label sequence set comprising the label combinations;

the pre-training unit is suitable for inputting the first training corpus into a preset neural network model with a time sequence for pre-training to obtain a pre-trained language sub-model;

and the transfer learning training unit is suitable for inputting the third training corpus into an initial punctuation mark model for transfer learning training to obtain a trained punctuation mark model.

2. The system for training punctuation marking models according to claim 1, wherein the corpus processing unit comprises:

the identification subunit is suitable for identifying punctuation marks of the second training corpus according to a preset punctuation mark set;

the removing subunit is suitable for removing punctuation marks in the second training corpus to obtain a corresponding punctuation-free text set;

and the labeling subunit is suitable for labeling label combinations with corresponding position relations at the positions of the word segmentation units before and after each removed punctuation mark according to a preset punctuation mark position label set to obtain a corresponding label sequence set.

3. The system for training punctuation marking models according to claim 2, wherein said marking subunit comprises at least one of:

the first labeling module is suitable for labeling label combinations with corresponding position relations at the word segmentation units before and after each removed punctuation mark according to the punctuation mark position label set when the word segmentation units are both before and after the removed punctuation mark;

and the second labeling module is suitable for adding a vacancy filling symbol before or after the removed punctuation mark to fill the position of the vacancy word segmentation unit when a word segmentation unit is not arranged before or after the removed punctuation mark, and labeling label combinations with corresponding position relations at the word segmentation unit position after or before the punctuation mark and the vacancy filling symbol according to the punctuation mark position label set.

4. A system for training a punctuation mark model according to claim 2 or 3, wherein said set of punctuation mark position tags comprises: a start position label arranged at the word segmentation unit before each punctuation mark and an end position label arranged at the word segmentation unit after each punctuation mark;

the labeling subunit further comprises at least one of:

the third labeling module is suitable for labeling corresponding initial position labels at word segmentation units before other punctuations except the last punctuation when a plurality of continuously removed punctuations are the same punctuation and only one word segmentation unit exists between the punctuations;

and the fourth labeling module is suitable for adding an interval symbol after each word segmentation unit when no word segmentation unit exists between two continuous removed punctuations, labeling a starting position label corresponding to the former punctuation at the word segmentation unit before the removed former punctuation, and labeling a starting position label and an ending position label corresponding to the latter punctuation at the interval symbol before the removed latter punctuation and the word segmentation unit after the removed latter punctuation.

5. The system for training a punctuation mark model according to claim 4, wherein the set of punctuation mark position tags further comprises: no blank position label corresponding to the punctuation mark;

the labeling subunit further comprises:

and the fifth marking module is suitable for marking the blank position label at the word segmentation unit when no punctuation mark exists before and after one word segmentation unit.

6. The system for training a punctuation mark model according to claim 1, wherein said pre-training unit comprises at least one of:

the first pre-training subunit is suitable for acquiring a random first training text from the first training corpus and inputting the random first training text into the neural network model, the neural network model predicts the next word segmentation unit of the first training text under the given condition, and when the prediction accuracy probability reaches a preset first threshold value, the pre-training language sub-model is obtained;

and the second pre-training subunit is suitable for acquiring a second training text which randomly covers a part with a preset proportion from the first training corpus and inputting the second training text into the neural network model, the neural network model predicts the covered part with the preset proportion under the condition of a given context, and when the probability of accurate prediction reaches a preset second threshold value, the pre-training language submodel is obtained.

7. The system for training punctuation marking models according to claim 6, wherein said pre-training unit further comprises:

and the third pre-training subunit is suitable for acquiring a third training text for randomly covering a sentence from the first training corpus, inputting the third training text into the neural network model, predicting the covered sentence by the neural network model under the given condition, and obtaining the pre-trained language sub-model when the accurate prediction probability reaches a preset third threshold value.

8. The system for training a punctuation mark model according to claim 1, wherein said punctuation mark model further comprises a pre-set conditional random field model;

the transfer learning training unit includes:

the extraction subunit is suitable for extracting a section of punctuation-free text of the punctuation-free text set obtained by the corpus processing unit and a corresponding first label sequence of the punctuation-free text in the label sequence set;

the transfer training subunit is suitable for inputting the punctuation mark marking model with the punctuation-free text and the first label sequence, acquiring sentence characteristic information corresponding to the punctuation-free text by the pre-trained language submodel, inputting the sentence characteristic information into the conditional random field submodel, and predicting a second label sequence corresponding to the punctuation-free text;

an error calculation subunit adapted to determine an error between the second tag sequence and the first tag sequence;

and the fine tuning subunit is suitable for performing fine tuning on the parameters of the punctuation mark model when the error is determined to be larger than a preset error threshold value until the error between the second label sequence and the first label sequence is not larger than the error threshold value, so as to obtain the trained punctuation mark model.

9. The system for training punctuation marking models according to claim 1, wherein said first corpus comprises at least one of:

text data containing punctuation marks captured in the internet;

corresponding text data is obtained through voice recognition and marking captured in the Internet;

manually entered text data containing punctuation marks;

the second corpus comprises at least one of:

text data captured in the first corpus;

text data containing punctuations in a preset field is captured in the Internet;

manually inputting text data containing punctuations in a preset field.

10. A punctuation mark model, comprising:

the input layer is suitable for extracting corresponding word segmentation unit characteristic information from the text to be added with punctuations;

the coding layer is suitable for capturing sentence characteristic information corresponding to the word segmentation unit characteristic information through a pre-trained language sub-model, wherein the pre-trained language sub-model is a neural network model with a time sequence;

the conditional random field network layer is suitable for predicting positions to be added with punctuation marks and corresponding punctuation mark types in the texts to be added with the punctuation marks based on the sentence characteristic information, and labeling by adopting label combinations corresponding to the punctuation mark types at word segmentation units before and after the positions to be added with the punctuation marks;

and the output layer is suitable for outputting the corresponding punctuation mark marking result.