CN112580326A - Punctuation mark model and training system thereof - Google Patents

Punctuation mark model and training system thereof Download PDF

Info

Publication number
CN112580326A
CN112580326A CN201910925842.0A CN201910925842A CN112580326A CN 112580326 A CN112580326 A CN 112580326A CN 201910925842 A CN201910925842 A CN 201910925842A CN 112580326 A CN112580326 A CN 112580326A
Authority
CN
China
Prior art keywords
punctuation
training
corpus
punctuation mark
label
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910925842.0A
Other languages
Chinese (zh)
Inventor
沈大框
陈培华
陈成才
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Xiaoi Robot Technology Co Ltd
Original Assignee
Shanghai Xiaoi Robot Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Xiaoi Robot Technology Co Ltd filed Critical Shanghai Xiaoi Robot Technology Co Ltd
Priority to CN201910925842.0A priority Critical patent/CN112580326A/en
Publication of CN112580326A publication Critical patent/CN112580326A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Machine Translation (AREA)

Abstract

Punctuation mark model and its training system, the said system includes: the method comprises the steps that a corpus obtaining unit obtains a first corpus containing punctuation marks and a second corpus containing punctuation marks; the corpus processing unit removes punctuation marks from the second training corpus, and marks corresponding label combinations at the positions of word segmentation units before and after the removed punctuation marks to obtain a third training corpus; the pre-training unit inputs the first training corpus into a preset neural network model with a time sequence for pre-training to obtain a pre-trained language sub-model; and the transfer learning training unit inputs the third training corpus into an initial punctuation mark model for transfer learning training to obtain a trained punctuation mark model. According to the scheme, a large amount of training data does not need to be marked manually, the recall rate is improved, and the obtained punctuation mark marking model has good generalization capability and universality.

Description

Punctuation mark model and training system thereof
Technical Field
The embodiment of the invention relates to the technical field of computer natural language processing, in particular to a punctuation mark marking model and a training system thereof.
Background
The existing punctuation recovery scheme usually adopts a sequence labeling mode, is mainly applied to recovering punctuation of a text obtained by speech recognition, and generally can only add simple punctuation, such as: comma, period. The punctuation added in this way has low accuracy and poor generalization ability on one hand; on the other hand, the marked punctuation is poor in richness, and poor reading experience is brought to people.
With the continuous development of deep learning technology, punctuation marks of texts obtained by speech recognition can be predicted by adopting a trained and learned neural network model, so that the accuracy is improved. However, the learning training of the neural network model requires a large amount of training data to be labeled manually, and the recall rate of the neural network model obtained by the existing training method is low. In addition, the adopted punctuation mark model has weak generalization capability and poor universality.
Disclosure of Invention
In view of this, an aspect of the embodiments of the present invention provides a training system for a punctuation mark model, which does not need to manually mark a large amount of training data and can improve the recall rate.
In another aspect of the embodiments of the present invention, a punctuation mark model is provided, which has better generalization ability and universality.
The embodiment of the invention provides a training system of a punctuation mark model, which comprises: the training corpus acquiring unit is suitable for acquiring a first training corpus containing punctuation marks and a second training corpus containing punctuation marks; and the corpus processing unit is suitable for removing punctuation marks from the second training corpus and labeling corresponding label combinations at word segmentation units before and after the removed punctuation marks to obtain a third training corpus, wherein the third training corpus comprises: a punctuation-free text set and a label sequence set, the label sequence set comprising the label combinations; the pre-training unit is suitable for inputting the first training corpus into a preset neural network model with a time sequence for pre-training to obtain a pre-trained language sub-model; and the transfer learning training unit is suitable for inputting the third training corpus into an initial punctuation mark model for transfer learning training to obtain a trained punctuation mark model. Optionally, the corpus processing unit includes: the identification subunit is suitable for identifying punctuation marks of the second training corpus according to a preset punctuation mark set; the removing subunit is suitable for removing punctuation marks in the second training corpus to obtain a corresponding punctuation-free text set; and the labeling subunit is suitable for labeling label combinations with corresponding position relations at the positions of the word segmentation units before and after each removed punctuation mark according to a preset punctuation mark position label set to obtain a corresponding label sequence set.
Optionally, the labeling subunit includes at least one of: the first labeling module is suitable for labeling label combinations with corresponding position relations at the word segmentation units before and after each removed punctuation mark according to the punctuation mark position label set when the word segmentation units are both before and after the removed punctuation mark; and the second labeling module is suitable for adding a vacancy filling symbol before or after the removed punctuation mark to fill the position of the vacancy word segmentation unit when a word segmentation unit is not arranged before or after the removed punctuation mark, and labeling label combinations with corresponding position relations at the word segmentation unit position after or before the punctuation mark and the vacancy filling symbol according to the punctuation mark position label set.
Optionally, the set of punctuation marks comprises: a start position label arranged at the word segmentation unit before each punctuation mark and an end position label arranged at the word segmentation unit after each punctuation mark; the labeling subunit further comprises at least one of: the third labeling module is suitable for labeling corresponding initial position labels at word segmentation units before other punctuations except the last punctuation when a plurality of continuously removed punctuations are the same punctuation and only one word segmentation unit exists between the punctuations; and the fourth labeling module is suitable for adding an interval symbol after each word segmentation unit when no word segmentation unit exists between two continuous removed punctuations, labeling a starting position label corresponding to the former punctuation at the word segmentation unit before the removed former punctuation, and labeling a starting position label and an ending position label corresponding to the latter punctuation at the interval symbol before the removed latter punctuation and the word segmentation unit after the removed latter punctuation.
Optionally, the set of punctuation marks further comprises: no blank position label corresponding to the punctuation mark; the labeling subunit further comprises: and the fifth marking module is suitable for marking the blank position label at the word segmentation unit when no punctuation mark exists before and after one word segmentation unit.
Optionally, the pre-training unit comprises at least one of: the first pre-training subunit is suitable for acquiring a random first training text from the first training corpus and inputting the random first training text into the neural network model, the neural network model predicts the next word segmentation unit of the first training text under the given condition, and when the prediction accuracy probability reaches a preset first threshold value, the pre-training language sub-model is obtained; and the second pre-training subunit is suitable for acquiring a second training text which randomly covers a part with a preset proportion from the first training corpus and inputting the second training text into the neural network model, the neural network model predicts the covered part with the preset proportion under the condition of a given context, and when the probability of accurate prediction reaches a preset second threshold value, the pre-training language submodel is obtained.
Optionally, the pre-training unit further comprises: and the third pre-training subunit is suitable for acquiring a third training text for randomly covering a sentence from the first training corpus, inputting the third training text into the neural network model, predicting the covered sentence by the neural network model under the given condition, and obtaining the pre-trained language sub-model when the accurate prediction probability reaches a preset third threshold value.
Optionally, the punctuation mark model further comprises a preset conditional random field sub-model; the transfer learning training unit includes: the extraction subunit is suitable for extracting a section of punctuation-free text of the punctuation-free text set obtained by the corpus processing unit and a corresponding first label sequence of the punctuation-free text in the label sequence set; the transfer training subunit is suitable for inputting the punctuation mark marking model with the punctuation-free text and the first label sequence, acquiring sentence characteristic information corresponding to the punctuation-free text by the pre-trained language submodel, inputting the sentence characteristic information into the conditional random field submodel, and predicting a second label sequence corresponding to the punctuation-free text; an error calculation subunit adapted to determine an error between the second tag sequence and the first tag sequence; and the fine tuning subunit is suitable for performing fine tuning on the parameters of the punctuation mark model when the error is determined to be larger than a preset error threshold value until the error between the second label sequence and the first label sequence is not larger than the error threshold value, so as to obtain the trained punctuation mark model.
Optionally, the first corpus comprises at least one of: text data containing punctuation marks captured in the internet; corresponding text data is obtained through voice recognition and marking captured in the Internet; manually entered text data containing punctuation marks; the second corpus comprises at least one of: text data captured in the first corpus; text data containing punctuations in a preset field is captured in the Internet; manually inputting text data containing punctuations in a preset field.
The embodiment of the invention also provides a punctuation mark marking model, which comprises the following steps: the input layer is suitable for extracting corresponding word segmentation unit characteristic information from the text to be added with punctuations; the coding layer is suitable for capturing sentence characteristic information corresponding to the word segmentation unit characteristic information through a pre-trained language sub-model, wherein the pre-trained language sub-model is a neural network model with a time sequence; the conditional random field network layer is suitable for predicting positions to be added with punctuation marks and corresponding punctuation mark types in the texts to be added with the punctuation marks based on the sentence characteristic information, and labeling by adopting label combinations corresponding to the punctuation mark types at word segmentation units before and after the positions to be added with the punctuation marks; and the output layer is suitable for outputting the corresponding punctuation mark marking result.
By adopting the training scheme of the punctuation mark model of the embodiment of the invention, the acquired first training corpus containing punctuation marks is input into a preset neural network model with a time sequence for pre-training, so that a pre-trained language sub-model can be obtained, meanwhile, the punctuation marks of the acquired second training corpus containing punctuation marks are removed, and corresponding label combinations are marked at the front and rear word segmentation units of the removed punctuation marks, so that a third training corpus can be obtained; and inputting the third training corpus into an initial punctuation mark model for transfer learning training, so as to obtain a trained punctuation mark model. The label combination is adopted for labeling, so that the information of the word segmentation units before and after the punctuation marks can be fully considered, the label distribution of the training corpus is more balanced, the problem of label distribution imbalance caused by only labeling the word segmentation units in front of the punctuation marks can be avoided, meanwhile, the condition of excessive labeled invalid labels can be avoided, and the accuracy rate of punctuation mark prediction is improved; and the neural network model with the time sequence can deeply capture the information of the text context after being pre-trained, and is prepared for the transfer learning training of the punctuation mark model, semi-supervised learning can be realized, a large amount of data does not need to be marked manually, the transfer learning training can ensure that the whole punctuation mark model can be converged quickly, the recall rate is improved, and a good effect can be achieved by using a small amount of data sets, so that a universal and powerful punctuation mark model can be obtained, and the punctuation mark model has better generalization capability and universality.
Further, when the word segmentation units are arranged before and after the removed punctuation marks, label combinations with corresponding position relations can be marked on the word segmentation units before and after the removed punctuation marks according to the punctuation mark position label set; when there is no word segmentation unit before or after the removed punctuation mark, a vacancy filling symbol can be added before or after the removed punctuation mark to fill the position of the vacant word segmentation unit, the filled vacancy filling symbol can be regarded as a word segmentation unit when the mark and the label are combined, and then the label combination of the corresponding position relation is marked on the word segmentation unit and the vacancy filling symbol after or before the punctuation mark according to the punctuation mark position label set. By adopting the scheme, the label combinations with corresponding position relations are marked at the word segmentation units before and after the punctuation marks according to the punctuation mark position label set, so that the punctuation mark position labels corresponding to the punctuation marks appear in pairs; when a word segmentation unit is not arranged before or after the removed punctuation mark, the vacancy filling symbol is added to fill the position of the vacant word segmentation unit, so that the problem that the punctuation mark appears at the tail or the head but no word segmentation unit can label is avoided, and the label combination corresponding to the punctuation mark can always appear in pairs.
Further, the set of punctuation marks comprises: a start position label arranged at the word segmentation unit before each punctuation mark and an end position label arranged at the word segmentation unit after each punctuation mark; when a plurality of continuous removed punctuations are the same punctuation and only one word segmentation unit exists between the punctuations, only corresponding initial position labels are marked at word segmentation units before other punctuations except the last punctuation; when there is no word segmentation unit between two consecutive removed punctuations, an interval symbol may be added after each word segmentation unit, and a start position label corresponding to a previous punctuation mark is labeled at a word segmentation unit before the removed previous punctuation mark, and a start position label and an end position label corresponding to a next punctuation mark are respectively labeled at an interval symbol before the removed next punctuation mark and at a word segmentation unit after the removed next punctuation mark. By adopting the scheme, the label marking mode can be optimized, so that the punctuation marks which have shorter interval with word segmentation units and are easy to continuously appear can be predicted more accurately during model training.
Further, the set of punctuation mark position tags further comprises: and when no punctuation mark exists before and after a word segmentation unit, marking a blank position label at the word segmentation unit according to the punctuation mark position label set. And each word segmentation unit in the third training corpus can be provided with a corresponding punctuation mark position label through the blank position label, so that the distribution balance of the labels can be further improved, and the fine adjustment can be conveniently carried out during the later model training.
Furthermore, the representation capability of the pre-trained language sub-model on the characteristics of the upper and lower sentences in the text corpus can be enhanced by selecting the first training corpus and inputting a preset neural network model with a time sequence, predicting the next word segmentation unit of the first training text under the condition of a given upper context, or predicting the covered preset proportion part under the condition of a given context. The ability of understanding the relationship between the upper paragraph and the lower paragraph by the pre-trained language sub-model can be further enhanced by selecting the first training corpus and inputting the preset neural network model with the time sequence and predicting the covered sentence under the condition of the given upper part. By adopting the punctuation mark model of the embodiment of the invention, the sentence characteristic information corresponding to the characteristic information of the word segmentation unit can be captured by the pre-trained language submodel of the coding layer, then the position to be added with the punctuation mark and the corresponding punctuation mark type in the text to be added with the punctuation mark are predicted by the conditional random field network layer, and the punctuation mark is marked by adopting the label combination corresponding to the punctuation mark type at the position of the word segmentation unit before and after the position to be added with the punctuation mark. As can be seen from the above, the punctuation mark model comprises a pre-trained language submodel, which can fully consider the information of word segmentation units before and after the punctuation mark, so that the conditional random field network layer can predict the punctuation mark type more accurately, and then, the label combination is adopted for labeling to make the labeled labels more uniformly distributed, so that the problem of label distribution imbalance and the condition of too many labeled invalid labels can be avoided, and the accuracy of punctuation mark is improved; moreover, the punctuation mark model obtained by the training system can be used for punctuation mark adding of the text to be punctuated and correcting the punctuation marks of the existing text, and has high accuracy, strong universality and wider application range.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present specification, the drawings needed to be used in the embodiments of the present specification or in the description of the prior art will be briefly described below, it is obvious that the drawings described below are only some embodiments of the present specification, and it is also possible for a person skilled in the art to obtain other drawings based on these drawings without creative efforts.
FIG. 1 is a schematic structural diagram of a training system for a punctuation mark labeling model according to an embodiment of the present invention. Fig. 2 is a schematic diagram illustrating an embodiment of obtaining a third corpus.
FIG. 3 is a diagram illustrating another example of obtaining a third corpus in accordance with an embodiment of the present invention. FIG. 4 is a schematic structural diagram of a punctuation mark model according to an embodiment of the present invention.
Detailed Description
As described above, although the accuracy of the punctuation mark adding method is greatly improved when the neural network is used, the punctuation mark adding method needs to prepare a large amount of training data in advance for neural network model training, the training data is usually a punctuation-free corpus generated by Automatic Speech Recognition (ASR), training can be performed after manual labeling is performed with time and labor, and then punctuation mark prediction is performed on a punctuation mark text obtained by Speech Recognition by using the trained neural network model, and the obtained punctuation mark prediction result is often low in recall rate. In addition, the training data of the current neural network model only considers the information of the previous word of the punctuation mark, so that the label distribution of the training data is unbalanced, and the neural network model trained by the method is poor in generalization capability and not universal.
In view of the above problems, in the embodiments of the present specification, an obtained first corpus containing punctuation marks is input into a preset neural network model with a time sequence for pre-training, so as to obtain a pre-trained language sub-model, and meanwhile, a third corpus can be obtained by removing the punctuation marks of an obtained second corpus containing punctuation marks and labeling corresponding label combinations at the word segmentation units before and after the removed punctuation marks; and inputting the third training corpus into an initial punctuation mark model for transfer learning training, so as to obtain a trained punctuation mark model.
The word segmentation unit corresponds to the minimum sentence component unit of the language category of the second corpus, for example, the minimum sentence component unit of chinese is "word", and the minimum sentence component unit of english is "word", so that the specific expression of the word segmentation unit is determined by the language category of the second corpus.
The label combination of the embodiment of the specification is adopted for labeling, information of word segmentation units before and after punctuation marks can be fully considered, so that the label distribution of the training corpus is more balanced, the problem of label distribution imbalance caused by only labeling the word segmentation units in front of the punctuation marks can be avoided, meanwhile, the condition of excessive labeled invalid labels can be avoided, and the accuracy of punctuation mark prediction is improved; and the neural network model with the time sequence can deeply capture the information of the text context after being pre-trained, and is prepared for the transfer learning training of the punctuation mark model, semi-supervised learning can be realized, a large amount of data does not need to be marked manually, the transfer learning training can ensure that the whole punctuation mark model can be converged quickly, the recall rate is improved, and a good effect can be achieved by using a small amount of data sets, so that a universal and powerful punctuation mark model can be obtained, and the punctuation mark model has better generalization capability and universality.
In order to make the embodiments of the present disclosure more clearly understood and implemented by those skilled in the art, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the drawings in the embodiments of the present disclosure, and it is obvious that the described embodiments are some embodiments of the present disclosure, but not all embodiments. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments in the present specification without any creative effort shall fall within the protection scope of the present specification.
Referring to fig. 1, a schematic structural diagram of a training system of a punctuation mark model, in an embodiment of the present invention, a training system 10 of a punctuation mark model may specifically include:
a corpus acquiring unit 11 adapted to acquire a first corpus including punctuation marks and a second corpus including punctuation marks;
the corpus processing unit 12 is adapted to remove punctuation marks from the second corpus, and label corresponding label combinations at the word segmentation units before and after the removed punctuation marks to obtain a third corpus, where the third corpus includes: a punctuation-free text set and a label sequence set, the label sequence set comprising the label combinations;
the pre-training unit 13 is adapted to input the first training corpus into a preset neural network model with a time sequence for pre-training, so as to obtain a pre-trained language sub-model;
and the transfer learning training unit 14 is adapted to input the third training corpus into the initial punctuation mark model for transfer learning training, so as to obtain a trained punctuation mark model.
In the specific implementation, the wider the source of the first training corpus including the punctuation marks is, the more standard the punctuation marks are, the more beneficial the pre-training of the neural network model is to be carried out subsequently, and the universality and the accuracy of the pre-trained language word model can be improved. Therefore, on the one hand, in order to make the source of the first corpus more extensive, the corpus obtaining unit 11 obtains the corpus in different fields as much as possible, and on the other hand, in order to make the use of the punctuation mark in the first corpus more standard, the corpus obtaining unit 11 may obtain the corrected corpus.
For example, the first corpus may include: news corpora, hundred degree encyclopedia corpora, thesis corpora, etc. biased toward written terms; preferred to spoken chat corpora, microblog corpora, comment corpora, etc.
In a specific implementation, the neural network model with a time series includes: the pre-training Unit 13 pre-trains the preset neural network model with a time sequence by using the first training corpus, and the obtained pre-trained language sub-model can deeply capture information of a text context and prepare for the migration learning training of a subsequent punctuation mark model.
In a specific implementation, the second corpus containing punctuation marks may be the same as the above-mentioned first corpus, and the corpus obtaining unit 11 may obtain the corpus or obtain the corrected corpus in a plurality of different fields, so as to increase the generality and accuracy of the punctuation mark model. In order to save the corpus acquiring time, the corpus acquiring unit 11 may also directly capture the corpus from the first corpus to obtain the second corpus. For example, the second corpus may include: news corpora, hundred degree encyclopedia corpora, thesis corpora, etc. biased toward written terms; preferred to spoken chat corpora, microblog corpora, comment corpora, etc. The source of the second corpus can be relatively extensive.
In a specific implementation, in order to accurately predict punctuation marks of a punctuation-free text in one or more specific fields, the corpus acquiring unit 11 may also acquire a corpus for the specific field or fields, and acquire a second corpus. For example, to be able to predict punctuation of punctuation-free text in the financial domain, the second corpus may include corpus that contains punctuation from the financial domain.
In this embodiment of the present invention, the corpus processing unit 12 may perform punctuation mark removal and label labeling on the second corpus, and associate the obtained punctuation-free text set and the label sequence set in the third corpus, so that when a part or all of the punctuation-free text set is obtained, a corresponding part or all of the label sequences can be obtained.
In the embodiment of the invention, the label sequence set can contain all label combinations for labeling the front and rear word segmentation units of the removed punctuation marks, and the information of the front and rear word segmentation units of the punctuation marks of the second training corpus can be fully considered by labeling the label combinations, so that the label distribution of the third training corpus is more balanced, the problem of label distribution imbalance caused by labeling only the labels at the word segmentation units in front of the removed punctuation marks can be avoided, the condition of excessive labeled invalid labels can be avoided, and the accuracy of punctuation mark prediction of a punctuation mark labeling model can be improved.
It can be understood that the language types corresponding to the first corpus, the second corpus and the third corpus may include, but are not limited to, chinese, and may be adjusted according to the language type of the text to be labeled by the actual punctuation mark labeling model. In specific implementation, the corpus of the corresponding language type may be selected according to the language type of the text to be labeled or processed subsequently.
It can also be understood that the timing for obtaining the first corpus and the second corpus is not limited to the above sequence of steps, the first corpus may be obtained before the pre-training of the preset neural network model with time sequence, and the second corpus may be obtained before the third corpus is input into the initial punctuation mark model for transfer learning training. For example, the first corpus and the second corpus may be obtained simultaneously.
In specific implementation, the punctuation mark model can comprise a pre-trained language sub-model, so that the number of the third training corpora required by the punctuation mark model is greatly reduced, the number of the second training corpora can be reduced, labor cost and marking time are saved, and training time of the punctuation mark model is reduced. Meanwhile, the whole punctuation mark labeling model can be quickly converged on the premise of reducing the number of the third training corpus by the transfer learning training, and the recall rate is improved.
In a specific implementation, as shown in fig. 1, the corpus processing unit 12 may include:
the identifying subunit 121 is adapted to identify punctuation marks of the second corpus according to a preset punctuation mark set;
a removing subunit 122, adapted to remove punctuation marks in the second corpus to obtain a corresponding punctuation-free text set;
and the labeling subunit 123 is adapted to label, according to the preset punctuation mark position label set, a label combination with a corresponding position relationship at the word segmentation unit before and after each removed punctuation mark, so as to obtain a corresponding label sequence set.
In a specific implementation, in order to achieve automatic acquisition of the third corpus, a punctuation mark set and a preset punctuation mark position mark set may be preset in advance. After the second corpus containing punctuation marks is obtained, all word segmentation units and punctuation marks in the second corpus can be converted into marks (tokens) in a byte format, so that errors caused by word segmentation can be avoided, and the problems of high learning cost, sparse word vector (word vector) representation and the like caused by overlarge dictionaries are also avoided.
Specifically, according to a preset punctuation mark set, the identifying subunit 121 may identify punctuation marks of the second corpus, then the removing subunit 122 may remove the punctuation marks in the second corpus to obtain a corresponding punctuation-free text set, and the labeling subunit 123 may label, according to a preset punctuation mark position label set, a label combination with a corresponding position relationship at a preceding and following word segmentation units of each removed punctuation mark to obtain a corresponding label sequence set. And obtaining a punctuation-free text set and a corresponding label sequence set of the third training corpus for subsequent training punctuation mark models.
The punctuation mark set may include corresponding commonly used punctuation marks according to the language categories corresponding to the first corpus, the second corpus, and the third corpus, and the punctuation mark position mark set may include a label combination corresponding to the punctuation mark set. For example, if the language type corresponding to the first corpus, the second corpus, and the third corpus is chinese, the set of punctuation marks may include common punctuation marks such as comma, period, and pause, and the set of punctuation mark position labels may include a label combination corresponding to comma, a label combination corresponding to period, and a label combination corresponding to pause, etc. The labels are combined to appear in pairs and are used for being marked at the positions of the word segmentation units before and after the removed punctuation marks, so that the label distribution of the third training corpus is more balanced. The punctuation mark set and the punctuation mark position mark set may be set according to a specific application scenario, and the embodiment of the present specification does not limit this.
In a specific implementation, the training system of the punctuation mark model may further include a corpus cleaning unit (not shown), before the pre-training of the language sub-model and the training of the punctuation mark model are performed, data cleaning may be performed on the obtained first corpus and the obtained second corpus, and it is mainly to remove the punctuation marks that are obviously irregular in use, for example, a beginning of a sentence has punctuation marks that cannot be used for a beginning of the sentence, a sentence has punctuation marks that should not appear continuously in the sentence, and a sentence end punctuation mark has a non-sentence end, and the like, and under the condition that the punctuation marks that are irregular in use are used, these punctuation marks that are irregular in use may be deleted. Therefore, the method can avoid learning an irregular punctuation mark using mode by the language submodel and the punctuation mark marking model.
In a specific implementation, as shown in fig. 1, the labeling subunit 123 may include at least one of the following:
the first labeling module 1231 is adapted to label, according to the set of punctuation mark positions, label combinations with corresponding position relationships at the preceding and following word segmentation units of each removed punctuation mark when there are word segmentation units in both the preceding and following removed punctuation marks, so that the punctuation mark position labels corresponding to the punctuation marks appear in pairs. For example, the second corpus is S: { t1t2t3...tnN is a determined positive integer, which represents the number of tokens (tokens) contained in the set S, and if the text contains m punctuations and m is less than n, the third text corpus is the set R: { t1t2...tn-m}, the firstA labeling module 1231 may begin t after the removed punctuation markiProcessing and marking a label combination to obtain a corresponding label sequence set L of { L }1L2L3...Ln-m}。
The second labeling module 1232 is adapted to, when there is no segmentation unit before or after the removed punctuation mark, add a vacancy-filling symbol before or after the removed punctuation mark to fill the position of the vacant segmentation unit, and label combinations of corresponding positional relationships are labeled at the segmentation unit and the vacancy-filling symbol after or before the punctuation mark according to the punctuation mark position label set.
Since the label combination cannot be labeled in pairs when there is no word segmentation unit before or after the removed punctuation mark, the gap filling symbol can be regarded as a word segmentation unit, so that the label combination can be labeled in pairs.
In the embodiment of the present invention, the second labeling module 1232 may use the null-position-filling symbol in the following cases:
first, punctuation marks are located at the end of a sentence. For example, the second corpus is: { today is wednesday. And when the punctuation mark is removed, the second labeling module 1232 may add a vacancy completion mark EOS for the sentence end after the removed sentence mark to obtain a punctuation-free text set { today is wednesday EOS }, and then, the second labeling module 1232 may label the label combination of the sentence mark at the position of the character "three" before the removed sentence mark and the position of the vacancy completion mark "EOS" after the removed sentence mark.
It should be understood that "{ }" is used only to limit the content range of the examples, and is not an essential part in representing the content of the corpus, and those skilled in the art can use other symbols that are not easy to confuse to limit the content range of the corpus, and the word segmentation unit and the punctuation mark in "{ }" can be marks converted into byte format, and can also be marks converted into other formats, and the following "{ }" is the same as described above.
Second, punctuation marks are located at the beginning of the sentence. For example, the second corpus is: the reader is a good journal. And when the punctuation mark is removed, the second labeling module 1232 may add a vacancy completion mark BOS for the beginning of the sentence in front of the removed left title mark to obtain a non-punctuation text set { BOS reader is a native magazine EOS }, and then, the second labeling module 1232 may label a tag combination of the left title mark at the vacancy completion mark "BOS" position before the removed left title mark and at the character "read" position after the left title mark.
It is understood that the foregoing embodiments are merely illustrative, and not restrictive on the specific implementation, and in the specific implementation process, the punctuation-free text set and the tag sequence set obtained by the second corpus are often massive.
By adopting the scheme, the label combinations with corresponding position relations are marked at the word segmentation units before and after the punctuation marks according to the punctuation mark position label set, so that the punctuation mark position labels corresponding to the punctuation marks appear in pairs; when a word segmentation unit is not arranged before or after the removed punctuation mark, a corresponding vacancy filling symbol is added according to specific conditions to fill the position of the vacant word segmentation unit, so that the problem that the punctuation mark can be labeled without the word segmentation unit can be avoided, and the punctuation mark position labels corresponding to the punctuation mark can be ensured to be always paired.
In a specific implementation, the set of punctuation marks tags may comprise: a start position label at the word segmentation unit before each punctuation mark, and an end position label at the word segmentation unit after each punctuation mark. The set of punctuation position tags can correspond to the set of punctuation symbols.
In an embodiment of the present invention, one punctuation symbol set P and one punctuation symbol position tag set Q are preset, where P {. Is there a | A B, carrying out the following steps of; : and when the "Q" is equal to { B-period I-period B-sequential mark I-sequential mark B-iteration I-iteration B-comma I-comma B-semiconductor I-semiconductor B-operation I-operation electron B-closing electron I-closing electron B-upper quantitative mark I-upper quantitative mark B-lower quantitative mark I-lower quantitative mark }, the set of point symbols P and the set of point position marks Q are associated, and the corresponding position relationship of each point symbol in the set of point symbols P and its preceding and succeeding sub-word unit is represented by the set of point position marks Q. When the punctuation marks in the second training corpus are identified through the punctuation mark set P, the punctuation mark position labels corresponding to the identified punctuation marks can be obtained through the association between the punctuation mark set P and the punctuation mark position label set Q, and then corresponding label combinations can be marked at the front word segmentation unit and the rear word segmentation unit after the identified punctuation marks are removed.
It should be understood that the notation set of the punctuation marks in the embodiment is represented by "full english name of B-punctuation" and "full english name of I-punctuation", but other representations may be adopted, for example, considering that the full english name written with the punctuation marks is too long, the full english name of the punctuation marks may be replaced by an abbreviation, and for example, the full english name of the punctuation marks may be directly replaced by the punctuation marks, which is not limited in this embodiment of the present invention.
The table of punctuation, a set of punctuation, and a set of punctuation position tags is shown in table 1.
TABLE 1
Figure BDA0002218898530000121
Figure BDA0002218898530000131
Continuing with the above example, second labeling module 1232 may label the beginning position label "B-per" of the period immediately before the word "three" in the set of non-punctuated text { today is Wednesday EOS } and the ending position label "I-per" of the period at the space completion symbol "EOS" immediately after the period. In the same way, the method can obtain,
the second labeling module 1232 may also label the starting position label "B-ope" of the left title number at a position of the padding symbol "BOS" preceding the left title number in the punctuated text collection { BOS readers are a native magazine EOS }, and label the ending position label "I-ope" of the left title number at a position of the padding symbol "BOS" succeeding the left title number.
In a specific implementation, the set of punctuation mark position tags may further include: and no blank position label corresponding to the punctuation mark. As shown in fig. 1, the labeling subunit 123 may further include: the fifth labeling module 1235 is adapted to label the blank position label at a word segmentation unit when there is no punctuation mark before or after the word segmentation unit. Through the blank position labels, each word segmentation unit in the text corpus can have a corresponding punctuation mark position label, the distribution balance of the labels can be further improved, and fine adjustment can be conveniently carried out during later punctuation mark model training.
In a specific implementation, as shown in fig. 1, the labeling subunit 123 may further include at least one of the following:
a third labeling module 1233, adapted to label, when a plurality of consecutive removed punctuations are the same punctuation and there is only one word segmentation unit between the punctuations, only corresponding start position labels at word segmentation units before other punctuations except for the last punctuation;
a fourth labeling module 1234, adapted to, when there is no segmentation unit between two consecutive removed punctuations, add an interval symbol after each segmentation unit, label a start position label corresponding to a previous punctuation symbol at a segmentation unit before the removed previous punctuation symbol, and label a start position label and an end position label corresponding to a next punctuation symbol at an interval symbol before the removed next punctuation symbol and a word segmentation unit after the removed next punctuation symbol, respectively. The inventor finds in research that due to the rules of language usage, several special cases may arise:
first, a plurality of punctuations that have been removed in succession are the same punctuation and there is only one word segmentation unit between the punctuations. For example, according to the method for labeling a Chinese punctuation mark, although characters are present before and after a pause sign, when only one character is present between pause signs, a problem of overlapping occurs in a label combination added with pause signs. For example, the second corpus is: { cattle, sheep, horse all eat grass. And if there is only one character among a plurality of consecutive pause numbers, the third labeling module 1233 may label the start position label "B-cae" of the pause number at the characters "cow" and "sheep", and label the end position label "I-cae" of the pause number at the character "horse".
In an embodiment of the present invention, as shown in fig. 2, a schematic diagram of obtaining a third corpus according to the present invention is shown, and in an embodiment of the present invention, the obtained second corpus 21 is:
{ day, month, day three parties for a big battle, ready to begin! Is it impossible for a member of the group to win? }
Then, according to the preset punctuation mark set P, the identifying subunit 121 may identify and remove punctuation marks in the second corpus, and the removing subunit 122 removes punctuation marks in the second corpus, and because there is no text behind the question mark of the sentence end, the second labeling module 1232 may add a vacancy completion mark EOS for the sentence end behind the question mark, and obtain a punctuation-free text set 221{ the three-party large-decision in the day and month is ready to begin a chance that people cannot win the match EOS }. In addition, the second labeling module 1232 may label, according to the preset punctuation mark position label set Q, a label combination with a corresponding position relationship at the front and rear characters of each removed punctuation mark.
Since consecutive pause numbers appear between "day", "month", and "day", and there is only one character in the middle of the pause numbers, the third labeling module 1233 may label "B-cae" at "day" and "month" and label "B-cae" at "day". The first labeling module 1231 may label "B-com" and "I-com" respectively at the first pre-and post-comma text "decide" and "quasi", label "B-exc" and "I-exc" respectively at the pre-and post-exclamation point text "start" and "zan", and label "B-com" and "I-com" respectively at the second pre-and post-comma text "start" and "pair". The second labeling module 1232 may label "B-que" and "I-que" at the text "Do" before the question mark and the space-filling symbol "EOS" after the question mark, respectively. The fifth labeling module 1235 may label the blank position label O at the text without punctuation marks before and after the text, and finally obtain the corresponding label sequence set 222{ B-cae I-cae O B-com I-com O B-exc I-exc O B-com I-com B-que I-que }. And associating the obtained punctuation-free text set 221 with the label sequence set 222, thereby obtaining the third corpus 22.
Second, the removed punctuation marks are two consecutive punctuation marks, and no word segmentation unit exists between the punctuation marks. According to the labeling method of the label combination, the label overlapping problem of the punctuation mark combination also occurs, in order to avoid the problem, in the research process, the inventor observes that the situation is caused by the punctuation mark use rule and has a general rule, therefore, an interval mark SEP (segment) can be added after each word segmentation unit, a starting position label corresponding to the former punctuation mark is labeled at the word segmentation unit before the removed former punctuation mark, a starting position label and an ending position label corresponding to the latter punctuation mark are respectively labeled at the word segmentation unit before the removed latter punctuation mark and at the word segmentation unit after the latter punctuation mark, the interval mark can be regarded as a word segmentation unit, so that the label combination can be labeled in pairs, and the labeling method is effective for two continuous punctuation marks.
For example, the second corpus is: { I like to see readers. Where two consecutive symbols "appear in the text. ", the fourth labeling module 1234 may add an interval symbol SEP after each character, and, since there is a punctuation mark period at the end of a sentence, the second labeling module 1232 is further required to add a null-position-filling symbol EOS for the end of a sentence to fill up the missing characters, resulting in a punctuation-free text set { i SEP1 likes SEP2 and SEP3 looks at SEP4 and reads SEP5 or SEP6 EOS }. Then, the fourth labeling module 1234 may label the beginning position label "B-clo" of the right title number at "person", label the beginning position label "B-per" of the period number at "SEP 6", label the ending position label "I-per" of the period number at "EOS", and for the participle unit without punctuation marks before and after, the fifth labeling module 1235 may label the blank position label O at the participle unit; and for the interval symbols without punctuation marks at the front and the back, marking the interval labels S at the interval symbols.
It is to be understood that the "SEP" followed by the number sequence is used for descriptive purposes only for purposes of distinction, and that the "SEP" followed by the number sequence may not be used in a particular implementation.
In an embodiment of the present invention, as shown in fig. 3, a schematic diagram of obtaining the third corpus according to the present invention is shown, where the obtained second corpus 31 is:
{ leader description: "achieving profitability is the primary goal". }
Then, according to the preset punctuation mark set P, the identifying subunit 121 may identify punctuation marks in the second corpus, the removing subunit 122 may remove punctuation marks in the second corpus, where the colon, the quotation marks, the period, and the quotation marks are all the cases where two punctuation marks appear continuously, the fourth labeling module may add an interval mark SEP after each character, and because there is no character behind the sentence end period, the second labeling module 1232 may add a vacancy completion mark EOS for the sentence end behind the period. In addition, for the convenience of the subsequent training of the punctuation mark model, the second marking module 1232 may further add a null-filling symbol BOS to the beginning of the sentence, so that the symbols in the punctuation-free text set are complete, and obtain the punctuation-free text set 321{ BOS leads SEP1 to lead SEP2 to say that SEP3 is a real SEP4, SEP5 is a full SEP6, and SEP7 is a full SEP8, which is SEP9, SEP10, SEP11, SEP12 marks SEP13 EOS }.
The fourth labeling module 1234 may label the start position label "B-col" of the colon at the text "say", label the start position label "B-upp" of the quotation mark at the interval symbol "SEP 3", and label the end position label "I-upp" of the quotation mark at the text "real". Similarly, the processing method of periods and quotation marks can be obtained, interval labels S are marked at the interval symbols without punctuation marks at the front and the back, finally, the corresponding label sequence set 322{ O S B-col B-upp I-upp S O S O S B-low B-per I-per } is obtained, and the obtained punctuation-free text set 321 and the label sequence set 322 are associated, so that the third corpus 32 is obtained.
By adopting the scheme, the label marking mode can be optimized, so that the punctuation marks which have shorter interval with word segmentation units and are easy to continuously appear can be predicted more accurately during model training.
It is understood that the foregoing embodiments are merely illustrative, and not restrictive on the specific implementation, and in the specific implementation process, the punctuation-free text set and the tag sequence set obtained by the second corpus are often massive.
In order to make those skilled in the art better understand and implement the embodiments of the present invention, a specific example of a trained punctuation mark model obtained by migration learning is given below.
In an embodiment of the present invention, the punctuation mark model may include a pre-trained language submodel and a pre-set conditional random field submodel, and the pre-trained language submodel may be connected to the pre-set conditional random field model. As shown in fig. 1, the transfer learning training unit 14 may include:
an extracting subunit 141, adapted to extract a section of punctuation-free text in the punctuation-free text set obtained by the corpus processing unit 12, and a corresponding first tag sequence of the punctuation-free text in the tag sequence set;
a migration training subunit 142, adapted to input the punctuation mark model with the punctuation-free text and the first tag sequence, obtain, by the pre-trained language sub-model, sentence characteristic information corresponding to the punctuation-free text, input the sentence characteristic information into the conditional random field sub-model, and predict a second tag sequence corresponding to the punctuation-free text;
an error calculation subunit 143 adapted to determine an error between the second tag sequence and the first tag sequence;
a fine tuning subunit 144, configured to, when it is determined that the error is greater than a preset error threshold, perform fine tuning on parameters of the punctuation mark model until the error between the second tag sequence and the first tag sequence is not greater than the error threshold, so as to obtain a trained punctuation mark model.
In an implementation, the extracting subunit 141 may extract the third corpus obtained by the corpus processing unit 12 in batches, so that the migration training subunit 142 may perform iterative training for each batch, and each batch may include a piece of punctuation-free text, i.e., a sentence list, where the size of the list is determined by actual conditions. The punctuation-free text obtained by the corpus processing unit 12 may be a marking method adopted when removing punctuation marks and marking label combinations according to the second corpus, and the punctuation-free text may include at least one of a vacancy completion symbol EOS for a sentence end, a vacancy completion symbol BOS for a sentence head, and an interval symbol SEP.
In a specific implementation, since the set of punctuation-free texts is associated with the set of tag sequences, a corresponding tag sequence can be obtained as a first tag sequence according to a segment of punctuation-free texts contained in each batch.
In a specific implementation, the migration training subunit 142 inputs the punctuation-free text and the first tag sequence into the punctuation mark model, and the punctuation mark model converts each word segmentation unit in the punctuation-free text into a mark (token) in a byte format, and uses the mark (token) to perform the training
Figure BDA0002218898530000171
Denotes xiIs an index value in the dictionary.
Performing word segmentation on each word segmentation unit according to index values of the dictionaryWord vector embedding (word embedding), one word segmentation unit corresponding to one word vector, using
Figure BDA0002218898530000172
eiIs a dense vector of dimension k. The dictionary is adjusted according to the language type of the word segmentation unit, wherein the word vector matrix is obtained in a mode of random initialization or using pre-trained word vectors and exists in the whole neural network as learning parameters, and the word vector eiIs set according to specific conditions.
Then, the vectorized punctuation-free text passes through the pre-trained language submodel of the punctuation mark model to express the static word vector
Figure BDA0002218898530000173
The sentence is converted into dynamic sentence representation and is recorded as
Figure BDA0002218898530000174
Figure BDA0002218898530000175
liIs a dense vector of dimension l, and the size of l is set according to specific situations, thereby being capable of deeply strengthening the representation of sentences.
Figure BDA0002218898530000176
Inputting a Conditional Random Field (CRF) submodel of the punctuation mark labeling model, wherein the CRF submodel is preset with a state transition matrix [ A]i,jAnd a fractional matrix
Figure BDA0002218898530000177
[A]i,jRepresenting the transition probability of a state transition from the ith state to the jth state for two time steps, a fractional matrix
Figure BDA0002218898530000178
To represent
Figure BDA0002218898530000179
The t-th position output after the input is [ y ]]tWhere θ contains all the parameters of the entire pre-trained language submodel that require transfer learning. At this time, a feature function preset in the conditional random field model may be used:
Figure BDA00022188985300001710
Figure BDA0002218898530000181
which is represented as input
Figure BDA0002218898530000182
The predicted output tag sequence is
Figure BDA0002218898530000183
The score of (1). This characteristic function
Figure BDA0002218898530000184
The occurrence of label sequences which do not conform to labeling rules can be avoided, and in order to conveniently express subsequent formulas, the following formulas are expressed
Figure BDA0002218898530000185
Simplified to s (l, y). When s (l, y) is scored maximally, a second tag sequence is obtained
Figure BDA0002218898530000186
In an embodiment of the present invention, a Viterbi (Viterbi) method may be used to calculate a path with the largest s (l, y) score, i.e., y ═ argmaxs (l, y).
In a specific implementation, assume that the first tag sequence is
Figure BDA00022188985300001821
Error calculation subunit 143 can compare
Figure BDA0002218898530000187
And
Figure BDA0002218898530000188
an error between the two can be obtained.
In an implementation, after iterative training of the third corpus, the error calculating subunit 143 may obtain all the second tag sequences
Figure BDA0002218898530000189
And corresponding first tag sequence
Figure BDA00022188985300001810
When there is at least one group
Figure BDA00022188985300001811
And
Figure BDA00022188985300001812
when the error between the two is greater than the preset error threshold, the fine tuning subunit 144 may perform fine tuning on the parameters of the punctuation mark labeling model, and then the migration training subunit 142 performs the migration learning training of the punctuation mark labeling model again until the punctuation mark labeling model is subjected to the migration learning training again
Figure BDA00022188985300001813
And
Figure BDA00022188985300001814
the error between the two is not larger than the preset error threshold value, and the punctuation mark model trained by the transfer learning is obtained.
In an embodiment of the present invention, the fine tuning subunit 144 can be obtained first
Figure BDA00022188985300001815
After the punctuation mark model is input, the scores sigma s (l, y) corresponding to all possible mark sequences and the first label sequence are
Figure BDA00022188985300001816
Corresponding score
Figure BDA00022188985300001817
Secondly, normalizing the first label sequence to obtain the probability of the first label sequence
Figure BDA00022188985300001818
Figure BDA00022188985300001819
Then, log the probability of the first tag sequence to obtain a loss function:
Figure BDA00022188985300001820
since the smaller the value of the loss function is, the higher the representation accuracy is, it is necessary to obtain a parameter of the symbol labeling model corresponding to the minimum value of the loss function. The fine tuning subunit 144 may fine tune parameters of the punctuation mark model by using a gradient descent method or a back propagation method, where the range includes parameters of the pre-training language sub-model and parameters of the conditional random field sub-model, so as to reduce the value of the loss function. By adopting a gradient descent method or a back propagation method, the punctuation mark model can be quickly converged, the training time is shortened, and the accuracy of the punctuation mark model is improved.
In a specific implementation, the punctuation mark model further comprises a preset bidirectional long-time and short-time memory submodel, and the preset bidirectional long-time and short-time memory (Bilstm) submodel is located between the pre-trained language submodel and the conditional random field submodel.
After text information is generated by using the pre-trained language submodel, the text information is input into a two-way long-short time memory (Bilstm) submodel to further enhance the representation capability of sentence characteristic information. After sentence characteristic information acquired by two layers of neural networks with time sequences is input into the conditional random field sub-model, the change range of the final prediction result is smaller, a better prediction effect is achieved, and in addition, the accuracy and the generalization capability of the model can be increased under the condition of larger data volume.
In a specific implementation, as shown in fig. 1, the pre-training unit 13 includes at least one of:
the first pre-training subunit 131 is adapted to obtain a random first training text from the first training corpus, and input the random first training text into the neural network Model, where the neural network Model predicts a next word segmentation unit of the first training text under a given condition, determines that the training is good when a prediction accuracy probability reaches a preset first threshold, and obtains a pre-trained Language sub-Model through a Language Model (LM) training mode.
The pre-trained language submodel may estimate a likelihood that the punctuation-free text is a sentence when subsequent punctuation-free text is input. Assume that the punctuation-free text subsequently entered is: { omega [ [ omega ] ]1ω2ω3...ωnN is the number of word segmentation units in the punctuate-free text, and the pre-trained language submodel expresses that the probability that the punctuate-free text is a sentence is
Figure BDA0002218898530000191
Where i is 1, 2 … n. The greater the probability of P (S), the higher the probability of stating that the punctuate-free text is a sentence. The pre-trained language submodel may enhance the pre-trained language submodel's ability to characterize the characteristics of the context sentences in the corpus of text.
The word segmentation unit corresponds to the minimum sentence component unit of the language type of the input first corpus, for example, the minimum sentence component unit of chinese is "word", and the minimum sentence component unit of english is "word", so that the concrete expression of the word segmentation unit is determined by the language type of the first corpus.
In a specific implementation, when the first training text is an empty set, the neural network model performs first-word prediction, and takes a word segmentation unit with the highest first-word probability as a prediction result according to a known prior probability, for example, "present" as the first-word probability is the highest, and then "present" is taken as the first word.
When the foregoing has been the case, it is,for example, there has been "so far" above from which the probability P (ω) of the next word is predicted21) When P (ω)21) At maximum, the probability P (ω) of a sentence represented by two words1ω2)=P(ω1)P(ω21) At the maximum, "today" is obtained assuming the probability of "day" is the greatest.
As another example, there has been a probability P (ω) of predicting the next word based on "weather today is very" as above61ω2ω3ω4ω5) When P (ω)61ω2ω3ω4ω5) At maximum, these 6 words represent the likelihood of a sentence, P (ω)1ω2ω3ω4ω5ω6)=P(ω1)P(ω21)P(ω31ω2)P(ω41ω2ω3)P(ω51ω2ω3ω4)
Figure BDA0002218898530000192
At the maximum, assuming the probability of "good" is the greatest, then "today's weather is very good" is obtained.
A second pre-training subunit 132, adapted to obtain a second training text randomly masking a portion with a preset proportion from the first training corpus, and input the second training text into the neural network Model, where the neural network Model predicts the masked portion with the preset proportion under a given context, and when a predicted accurate probability reaches a preset second threshold, it is determined that the second training text is trained, and the pre-training Language sub-Model is obtained through a Mask Language Model (MLM) training mode. In a specific implementation, the second pre-training subunit 132 may mask a percentage of the content of the first corpus containing punctuation marks randomly, and then predict the masked content through a normalized exponential function (softmax) function, because the randomly masked content may include word segmentation units and punctuation marks, the language sub-model pre-trained in the MLM manner may be more strongly characterized for the sentence feature information of the sentence context.
In a specific implementation, as shown in fig. 1, the pre-training unit 13 further includes the following sub-units:
a third pre-training subunit 133, adapted to obtain a third training text that randomly masks a sentence from the first training corpus, and input the neural network model, where the neural network model predicts the masked sentence under a given condition, and when a probability of prediction accuracy reaches a preset third threshold, it is determined that training is good, and a pre-trained language sub-model is obtained in a Next Sentence Prediction (NSP) training manner. The language sub-model obtained by pre-training in the NSP mode has stronger capability of understanding the relationship between the upper paragraph and the lower paragraph.
In an embodiment of the present invention, the NSP mode adopts a classification method, the third pre-training subunit 133 may obtain a text a from the first corpus, randomly mask one sentence of the text a, then obtain a sentence B from the first corpus, predict whether the sentence B is a masked sentence in the text a, output "1" or "TRUE" if the predicted sentence B is a masked sentence in the text a, otherwise output "0" or "flag", judge whether the prediction is accurate and count the probability of the prediction accuracy according to the output result, and determine that the training is good when the probability of the prediction accuracy reaches a preset third threshold. For example, in a training period, 50% of the time is that the obtained sentence B is an actually covered sentence of the text a, and the remaining 50% of the time is that the obtained sentence B is a sentence randomly extracted from the first corpus.
In specific implementation, any one of the three pre-training subunits can be used for training to obtain a pre-trained language sub-model, or the pre-trained language sub-model can be obtained by training in a mode of combining the first pre-training subunit and the third pre-training subunit and combining the second pre-training subunit and the third pre-training subunit.
In a specific implementation, the first corpus may be obtained in various manners, for example, the first corpus may be obtained in one or more of the following manners:
1. text data containing punctuation marks is captured in the internet.
2. And capturing language data in the Internet, and carrying out voice recognition and labeling on the language data to obtain corresponding text data.
3. Text data containing punctuation is manually entered.
The second corpus may also be obtained in various manners, for example, the second corpus may be obtained in one or more of the following manners:
1. and capturing text data in the first training corpus.
2. Text data containing punctuations in a preset field is captured in the Internet.
3. And manually inputting text data containing punctuations in a preset field.
The embodiment of the present invention further provides a punctuation mark model obtained by using the training system of the punctuation mark model, and in order to enable those skilled in the art to better understand and realize the embodiment of the present invention, detailed descriptions are provided below through specific embodiments with reference to the attached drawings.
Referring to the structural diagram of the punctuation mark model shown in fig. 4, in the embodiment of the present invention, as shown in fig. 4, the punctuation mark model 40 may include an input layer 41, an encoding layer 42, a conditional random field network layer 44 and an output layer 45, wherein:
and the input layer 41 is suitable for extracting corresponding word segmentation unit characteristic information from the text to be added with punctuations.
In one embodiment, word vector embedding is performed on each word segmentation unit according to the dictionary, and one word segmentation unit corresponds to one word vector (i.e. one square of the input layer 41 in fig. 4) for use in word vector embedding
Figure BDA0002218898530000211
eiIs a dense direction of k dimensionsAmount of the compound (A). The dictionary is adjusted according to the language type of the word segmentation unit, wherein the acquisition mode of the word vector matrix is random initialization or pre-trained word vectors are used and exist in the whole neural network as learning parameters, and the dimensionality of the word vectors is set according to specific conditions.
The text to be added with the punctuation mark may include at least one of a vacancy completion mark EOS for a sentence end, a vacancy completion mark BOS for a sentence head, and an interval mark SEP, which is specifically determined by a labeling mode of a third corpus used in the punctuation mark model training.
And the coding layer 42 is adapted to capture sentence characteristic information corresponding to the word segmentation unit characteristic information through a pre-trained language sub-model, wherein the pre-trained language sub-model is a neural network model with a time sequence, and the specific training process of the pre-trained language sub-model can refer to the description about the LM training mode, the MLM training mode and the NSP training mode.
In the specific implementation, the vectorized punctuation-free text represents the static word vector through the pre-trained language sub-model
Figure BDA0002218898530000212
The sentence is converted into dynamic sentence representation and is recorded as
Figure BDA0002218898530000213
liIs a dense vector with dimension l, the size of l is set according to specific conditions, thereby being capable of deeply strengthening the representation of sentences,
Figure BDA0002218898530000214
is an input to a Conditional Random Field (CRF) network layer 43.
And the conditional random field network layer 44 is adapted to predict a position to be added with a punctuation mark and a corresponding punctuation mark type in the punctuation-free text based on the sentence characteristic information, and label combinations corresponding to the punctuation mark types are adopted for labeling at word segmentation units before and after the position to be added with the punctuation mark.
In one embodiment, the conditional random field network layer 44 is pre-populated with a state transition matrix [ A ]]i,jAnd a fractional matrix
Figure BDA0002218898530000221
[A]i, j represents the transition probability of the state transition from the ith state to the jth state in two time steps, and the fraction matrix
Figure BDA0002218898530000222
To represent
Figure BDA0002218898530000223
The t-th position output after the input is [ y ]]tWhere θ contains all the parameters of the entire pre-trained language submodel that require transfer learning. At this time, a feature function preset in the conditional random field model may be used:
Figure BDA00022188985300002213
Figure BDA0002218898530000224
which is represented as input
Figure BDA00022188985300002212
The predicted output tag sequence is
Figure BDA0002218898530000225
The score of (1). This characteristic function
Figure BDA0002218898530000226
The occurrence of label sequences which do not conform to labeling rules can be avoided, and in order to conveniently express subsequent formulas, the following formulas are expressed
Figure BDA0002218898530000227
Simplified to s (l, y). When the score of s (l, y) is maximum, outputting the label sequence
Figure BDA0002218898530000228
Wherein the tag sequence
Figure BDA0002218898530000229
The method comprises the information of the position to be added of the punctuation marks and the information of the type of the corresponding punctuation marks.
And the output layer 45 is suitable for outputting the corresponding punctuation mark marking result.
In specific implementations, tag sequences will be tagged
Figure BDA00022188985300002210
And outputting the marking result of the punctuation marks corresponding to the texts to be added with the punctuation marks.
In the embodiment of the present invention, the path with the largest s (l, y) score can be calculated by using the Viterbi (Viterbi) method, i.e. y*=argmaxs(l,y)。
In a specific implementation, as shown in fig. 4, the punctuation mark model may further include a bidirectional long-and-short term memory network layer 43, where the bidirectional long-and-short term memory network layer 43 is disposed between the encoding layer 42 and the conditional random field network layer 44, and the bidirectional long-and-short term memory network layer 43 is to be disposed between the encoding layer 42 and the conditional random field network layer 44
Figure BDA00022188985300002211
IniSequentially correspond to two adjacent blocks, e.g. l, in the two-way long and short term memory network layer 43 of FIG. 41Corresponding to 431 and 431. And enhancing the sentence characteristic information through bidirectional operation, and outputting the sentence characteristic information to the conditional random field network layer.
In an embodiment of the present invention, assuming that the dimension of the word vector is set to 100, the text to be added with punctuation marks is input as follows: { BOS Hello world EOS }, the input layer 41 of the punctuation mark model 40 gets the characterization {123456} of the initialized sentence word vector, which is in one-to-one correspondence with { BOS Hello world EOS }, specifically R6×100Is represented by a matrix of (a). The coding layer 42 captures sentence feature information of the context in the text through a pre-trained language sub-model obtained in the MLM + NSP manner. The conditional random field network layer 44 predicts the tag sequence according to the sentence characteristic information and assumes the preset punctuation mark positionIf the label set Q has 15 types, then the conditional random field network layer 44 predicts to obtain R6×15The corresponding relationship between the matrix values and the punctuation mark position label set Q is as follows:
1[0 0 0 0 0 0 0 0 0 0 0 1 0 0 0] O
2[0 0 0 0 0 0 0 0 0 0 0 1 0 0 0] O
3[0 0 0 0 0 0 1 0 0 0 0 0 0 0 0] B-COM
4[0 0 0 0 0 0 0 0 0 1 0 0 0 0 0] 1-COM
5[0 0 0 0 0 0 0 0 0 0 0 0 0 0 1] B-EXC
6[0 0 0 1 0 0 0 0 0 0 0 0 0 0 0] I-EXC
thus, the following predicted tag sequence is obtained:
{O O B-COM I-COM B-EXC I-EXC}
then, adding punctuation marks corresponding to punctuation mark types at positions, corresponding to positions to be added, of the text to which the punctuation marks are to be added respectively according to punctuation mark marking results output by the punctuation mark marking model, so as to obtain the text to which the punctuation marks are to be added: { hello, world! }.
By adopting the punctuation mark labeling model of the embodiment of the invention, the information of word segmentation units before and after punctuation marks can be fully considered through the pre-trained language submodel, so that the punctuation mark type can be more accurately predicted by a conditional random field network layer, and then, labeling is carried out by adopting label combination to ensure that the labeled labels are more uniformly distributed, so that the problem of label distribution imbalance and the condition of excessive labeled invalid labels can be avoided, and the accuracy of punctuation mark is improved; moreover, the punctuation mark model obtained by the training system can be used for punctuation mark adding of the text to be punctuated and correcting the punctuation marks of the existing text, and has high accuracy, strong universality and wider application range.
It is to be understood that the terms "first," "second," "third," and the like, herein, are used merely to distinguish one term from another, and do not denote any order, size, or importance, among others.
Although the present invention is disclosed above, the present invention is not limited thereto. Various changes and modifications may be effected therein by one skilled in the art without departing from the spirit and scope of the invention as defined in the appended claims.

Claims (10)

1. A training system for punctuation marking models, comprising:
the training corpus acquiring unit is suitable for acquiring a first training corpus containing punctuation marks and a second training corpus containing punctuation marks;
and the corpus processing unit is suitable for removing punctuation marks from the second training corpus and labeling corresponding label combinations at word segmentation units before and after the removed punctuation marks to obtain a third training corpus, wherein the third training corpus comprises: a punctuation-free text set and a label sequence set, the label sequence set comprising the label combinations;
the pre-training unit is suitable for inputting the first training corpus into a preset neural network model with a time sequence for pre-training to obtain a pre-trained language sub-model;
and the transfer learning training unit is suitable for inputting the third training corpus into an initial punctuation mark model for transfer learning training to obtain a trained punctuation mark model.
2. The system for training punctuation marking models according to claim 1, wherein the corpus processing unit comprises:
the identification subunit is suitable for identifying punctuation marks of the second training corpus according to a preset punctuation mark set;
the removing subunit is suitable for removing punctuation marks in the second training corpus to obtain a corresponding punctuation-free text set;
and the labeling subunit is suitable for labeling label combinations with corresponding position relations at the positions of the word segmentation units before and after each removed punctuation mark according to a preset punctuation mark position label set to obtain a corresponding label sequence set.
3. The system for training punctuation marking models according to claim 2, wherein said marking subunit comprises at least one of:
the first labeling module is suitable for labeling label combinations with corresponding position relations at the word segmentation units before and after each removed punctuation mark according to the punctuation mark position label set when the word segmentation units are both before and after the removed punctuation mark;
and the second labeling module is suitable for adding a vacancy filling symbol before or after the removed punctuation mark to fill the position of the vacancy word segmentation unit when a word segmentation unit is not arranged before or after the removed punctuation mark, and labeling label combinations with corresponding position relations at the word segmentation unit position after or before the punctuation mark and the vacancy filling symbol according to the punctuation mark position label set.
4. A system for training a punctuation mark model according to claim 2 or 3, wherein said set of punctuation mark position tags comprises: a start position label arranged at the word segmentation unit before each punctuation mark and an end position label arranged at the word segmentation unit after each punctuation mark;
the labeling subunit further comprises at least one of:
the third labeling module is suitable for labeling corresponding initial position labels at word segmentation units before other punctuations except the last punctuation when a plurality of continuously removed punctuations are the same punctuation and only one word segmentation unit exists between the punctuations;
and the fourth labeling module is suitable for adding an interval symbol after each word segmentation unit when no word segmentation unit exists between two continuous removed punctuations, labeling a starting position label corresponding to the former punctuation at the word segmentation unit before the removed former punctuation, and labeling a starting position label and an ending position label corresponding to the latter punctuation at the interval symbol before the removed latter punctuation and the word segmentation unit after the removed latter punctuation.
5. The system for training a punctuation mark model according to claim 4, wherein the set of punctuation mark position tags further comprises: no blank position label corresponding to the punctuation mark;
the labeling subunit further comprises:
and the fifth marking module is suitable for marking the blank position label at the word segmentation unit when no punctuation mark exists before and after one word segmentation unit.
6. The system for training a punctuation mark model according to claim 1, wherein said pre-training unit comprises at least one of:
the first pre-training subunit is suitable for acquiring a random first training text from the first training corpus and inputting the random first training text into the neural network model, the neural network model predicts the next word segmentation unit of the first training text under the given condition, and when the prediction accuracy probability reaches a preset first threshold value, the pre-training language sub-model is obtained;
and the second pre-training subunit is suitable for acquiring a second training text which randomly covers a part with a preset proportion from the first training corpus and inputting the second training text into the neural network model, the neural network model predicts the covered part with the preset proportion under the condition of a given context, and when the probability of accurate prediction reaches a preset second threshold value, the pre-training language submodel is obtained.
7. The system for training punctuation marking models according to claim 6, wherein said pre-training unit further comprises:
and the third pre-training subunit is suitable for acquiring a third training text for randomly covering a sentence from the first training corpus, inputting the third training text into the neural network model, predicting the covered sentence by the neural network model under the given condition, and obtaining the pre-trained language sub-model when the accurate prediction probability reaches a preset third threshold value.
8. The system for training a punctuation mark model according to claim 1, wherein said punctuation mark model further comprises a pre-set conditional random field model;
the transfer learning training unit includes:
the extraction subunit is suitable for extracting a section of punctuation-free text of the punctuation-free text set obtained by the corpus processing unit and a corresponding first label sequence of the punctuation-free text in the label sequence set;
the transfer training subunit is suitable for inputting the punctuation mark marking model with the punctuation-free text and the first label sequence, acquiring sentence characteristic information corresponding to the punctuation-free text by the pre-trained language submodel, inputting the sentence characteristic information into the conditional random field submodel, and predicting a second label sequence corresponding to the punctuation-free text;
an error calculation subunit adapted to determine an error between the second tag sequence and the first tag sequence;
and the fine tuning subunit is suitable for performing fine tuning on the parameters of the punctuation mark model when the error is determined to be larger than a preset error threshold value until the error between the second label sequence and the first label sequence is not larger than the error threshold value, so as to obtain the trained punctuation mark model.
9. The system for training punctuation marking models according to claim 1, wherein said first corpus comprises at least one of:
text data containing punctuation marks captured in the internet;
corresponding text data is obtained through voice recognition and marking captured in the Internet;
manually entered text data containing punctuation marks;
the second corpus comprises at least one of:
text data captured in the first corpus;
text data containing punctuations in a preset field is captured in the Internet;
manually inputting text data containing punctuations in a preset field.
10. A punctuation mark model, comprising:
the input layer is suitable for extracting corresponding word segmentation unit characteristic information from the text to be added with punctuations;
the coding layer is suitable for capturing sentence characteristic information corresponding to the word segmentation unit characteristic information through a pre-trained language sub-model, wherein the pre-trained language sub-model is a neural network model with a time sequence;
the conditional random field network layer is suitable for predicting positions to be added with punctuation marks and corresponding punctuation mark types in the texts to be added with the punctuation marks based on the sentence characteristic information, and labeling by adopting label combinations corresponding to the punctuation mark types at word segmentation units before and after the positions to be added with the punctuation marks;
and the output layer is suitable for outputting the corresponding punctuation mark marking result.
CN201910925842.0A 2019-09-27 2019-09-27 Punctuation mark model and training system thereof Pending CN112580326A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910925842.0A CN112580326A (en) 2019-09-27 2019-09-27 Punctuation mark model and training system thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910925842.0A CN112580326A (en) 2019-09-27 2019-09-27 Punctuation mark model and training system thereof

Publications (1)

Publication Number Publication Date
CN112580326A true CN112580326A (en) 2021-03-30

Family

ID=75110102

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910925842.0A Pending CN112580326A (en) 2019-09-27 2019-09-27 Punctuation mark model and training system thereof

Country Status (1)

Country Link
CN (1) CN112580326A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113449489A (en) * 2021-07-22 2021-09-28 深圳追一科技有限公司 Punctuation mark marking method, punctuation mark marking device, computer equipment and storage medium
CN114049885A (en) * 2022-01-12 2022-02-15 阿里巴巴达摩院(杭州)科技有限公司 Punctuation mark recognition model construction method and punctuation mark recognition model construction device
CN114065709A (en) * 2021-11-26 2022-02-18 深圳视界信息技术有限公司 Punctuation mark adding method and device, electronic equipment and storage medium
CN115129877A (en) * 2022-07-12 2022-09-30 北京有竹居网络技术有限公司 Method and device for generating punctuation mark prediction model and electronic equipment

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113449489A (en) * 2021-07-22 2021-09-28 深圳追一科技有限公司 Punctuation mark marking method, punctuation mark marking device, computer equipment and storage medium
CN113449489B (en) * 2021-07-22 2023-08-08 深圳追一科技有限公司 Punctuation mark labeling method, punctuation mark labeling device, computer equipment and storage medium
CN114065709A (en) * 2021-11-26 2022-02-18 深圳视界信息技术有限公司 Punctuation mark adding method and device, electronic equipment and storage medium
CN114049885A (en) * 2022-01-12 2022-02-15 阿里巴巴达摩院(杭州)科技有限公司 Punctuation mark recognition model construction method and punctuation mark recognition model construction device
CN114049885B (en) * 2022-01-12 2022-04-22 阿里巴巴达摩院(杭州)科技有限公司 Punctuation mark recognition model construction method and punctuation mark recognition model construction device
CN115129877A (en) * 2022-07-12 2022-09-30 北京有竹居网络技术有限公司 Method and device for generating punctuation mark prediction model and electronic equipment

Similar Documents

Publication Publication Date Title
CN110674629B (en) Punctuation mark labeling model, training method, training equipment and storage medium thereof
CN110688822A (en) Punctuation mark adding method, punctuation mark adding device and punctuation mark adding medium
CN112580326A (en) Punctuation mark model and training system thereof
CN107391486B (en) Method for identifying new words in field based on statistical information and sequence labels
CN110222182B (en) Statement classification method and related equipment
CN110705264A (en) Punctuation correction method, punctuation correction apparatus, and punctuation correction medium
CN111460827B (en) Text information processing method, system, equipment and computer readable storage medium
CN111428485B (en) Judicial document paragraph classifying method, device, computer equipment and storage medium
CN110909549B (en) Method, device and storage medium for punctuating ancient Chinese
CN112257437B (en) Speech recognition error correction method, device, electronic equipment and storage medium
CN111506709B (en) Entity linking method and device, electronic equipment and storage medium
CN110096572B (en) Sample generation method, device and computer readable medium
CN113408535B (en) OCR error correction method based on Chinese character level features and language model
CN114282527A (en) Multi-language text detection and correction method, system, electronic device and storage medium
CN113158656B (en) Ironic content recognition method, ironic content recognition device, electronic device, and storage medium
CN113268576B (en) Deep learning-based department semantic information extraction method and device
CN112541095B (en) Video title generation method and device, electronic equipment and storage medium
CN112131881A (en) Information extraction method and device, electronic equipment and storage medium
CN111160026A (en) Model training method and device, and method and device for realizing text processing
CN112151019A (en) Text processing method and device and computing equipment
CN112818693A (en) Automatic extraction method and system for electronic component model words
CN115906835B (en) Chinese question text representation learning method based on clustering and contrast learning
CN111950274A (en) Chinese word segmentation method and device for linguistic data in professional field
CN112580303A (en) Punctuation adding system
CN111159377A (en) Attribute recall model training method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination