CN112580303A - Punctuation adding system - Google Patents

Punctuation adding system Download PDF

Info

Publication number
CN112580303A
CN112580303A CN201910925835.0A CN201910925835A CN112580303A CN 112580303 A CN112580303 A CN 112580303A CN 201910925835 A CN201910925835 A CN 201910925835A CN 112580303 A CN112580303 A CN 112580303A
Authority
CN
China
Prior art keywords
punctuation
punctuation mark
mark
added
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910925835.0A
Other languages
Chinese (zh)
Inventor
沈大框
陈培华
陈成才
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Xiaoi Robot Technology Co Ltd
Original Assignee
Shanghai Xiaoi Robot Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Xiaoi Robot Technology Co Ltd filed Critical Shanghai Xiaoi Robot Technology Co Ltd
Priority to CN201910925835.0A priority Critical patent/CN112580303A/en
Publication of CN112580303A publication Critical patent/CN112580303A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Machine Translation (AREA)

Abstract

A system for adding punctuation marks, the system comprising: the input unit is suitable for acquiring a text to be added with punctuations; the punctuation mark marking unit is suitable for predicting the punctuation mark adding position and the corresponding punctuation mark type in the text of the punctuation mark to be added by adopting a punctuation mark marking model which finishes the transfer learning training, marking by adopting a label combination corresponding to the punctuation mark type at the word segmentation unit before and after the punctuation mark adding position and outputting a corresponding punctuation mark marking result; the punctuation mark adding unit is suitable for adding corresponding punctuation marks to the texts to be added with punctuation marks according to the punctuation mark marking result; the punctuation mark model completing the transfer learning training comprises a pre-trained language submodel, wherein the language submodel is a neural network model with a time sequence. The method and the device can improve the accuracy of punctuation mark prediction and meet the demand of punctuation mark addition.

Description

Punctuation adding system
Technical Field
The embodiment of the invention relates to the technical field of computer natural language processing, in particular to a punctuation adding system.
Background
The existing punctuation recovery scheme usually adopts a sequence labeling mode, is mainly applied to recovering punctuation of a text obtained by speech recognition, and generally can only add simple punctuation, such as: comma, period. The punctuation added in this way has low accuracy and poor generalization ability on one hand; on the other hand, the marked punctuation marks have poor richness, thereby causing poor reading experience for people.
With the continuous development of deep learning technology, punctuation marks of texts obtained by speech recognition can be predicted by adopting a trained and learned neural network model, so that the accuracy is improved. However, the neural network model obtained by the existing training method has low recall rate, weak generalization capability and poor universality, and cannot meet the variable punctuation mark prediction requirement.
Disclosure of Invention
In view of this, an aspect of the embodiments of the present invention provides a system for adding punctuation marks, which improves the accuracy of punctuation mark prediction and meets the requirement of punctuation mark addition.
The embodiment of the invention provides a system for adding punctuation marks, which comprises:
the input unit is suitable for acquiring a text to be added with punctuations; the punctuation mark marking unit is suitable for predicting the punctuation mark adding position and the corresponding punctuation mark type in the text of the punctuation mark to be added by adopting a punctuation mark marking model which finishes the transfer learning training, marking by adopting a label combination corresponding to the punctuation mark type at the word segmentation unit before and after the punctuation mark adding position and outputting a corresponding punctuation mark marking result; the punctuation mark adding unit is suitable for adding corresponding punctuation marks to the texts to be added with punctuation marks according to the punctuation mark marking result; the punctuation mark model completing the transfer learning training comprises a pre-trained language submodel, wherein the language submodel is a neural network model with a time sequence.
Optionally, the punctuation mark model is adapted to identify punctuation marks corresponding to the type of the punctuation marks according to a preset first punctuation mark set, and mark, according to a preset punctuation mark position mark set, word segmentation units before and after the position to be added of the punctuation marks by using a label combination corresponding to the punctuation marks.
Optionally, the punctuation mark model is adapted to mark, according to the punctuation mark set, word segmentation units before and after the position to be added of the punctuation mark by using a label combination corresponding to the punctuation mark when the word segmentation units are both before and after the position to be added of the punctuation mark; the punctuation mark adding unit comprises a first adding subunit and is suitable for adding the punctuation mark between the two word segmentation units when the front word segmentation unit and the rear word segmentation unit in the text to be added with the punctuation mark are respectively marked with the corresponding mark combination of the same punctuation mark.
Optionally, the input unit includes a symbol adding subunit, adapted to add a vacancy completion symbol to the input text without punctuation marks, to obtain the text to which punctuation marks are to be added; the punctuation mark marking model is suitable for marking the vacancy filling symbols at the word segmentation unit position behind or in front of the punctuation mark position to be added and the vacancy filling symbols at the word segmentation unit position ahead or behind according to the punctuation mark position label set when a vacancy filling symbol is positioned in front of or behind the punctuation mark position to be added by adopting the label combination corresponding to the punctuation mark. The punctuation mark adding unit also comprises a second adding subunit, which is suitable for adding corresponding punctuation marks between the vacancy filling symbols and the adjacent word segmentation units when the vacancy filling symbols in the text to be added with punctuation marks and the adjacent word segmentation units are respectively marked with label combinations corresponding to the same punctuation marks.
Optionally, the set of punctuation marks comprises: the punctuation mark adding method comprises the steps that a starting position label is arranged at a word segmentation unit before a position to be added of each punctuation mark, and an ending position label is arranged at a word segmentation unit after the position to be added of each punctuation mark; the punctuation mark labeling model is suitable for labeling only initial position labels corresponding to punctuation marks at the word segmentation units before the punctuation mark positions to be added except the last punctuation mark position when the punctuation marks corresponding to the punctuation marks at a plurality of continuous punctuation mark positions to be added have the same punctuation mark type and only one word segmentation unit exists between the punctuation mark positions to be added; the punctuation mark adding unit also comprises a third adding subunit, which is suitable for adding corresponding punctuation marks after the word segmentation unit when the corresponding initial position labels are only marked at a plurality of continuous word segmentation units in the text of the punctuation mark to be added.
Optionally, the symbol adding subunit is adapted to add an interval symbol after each word segmentation unit of the input text without punctuation marks, so as to obtain the text to be added with punctuation marks; the punctuation mark labeling model is suitable for labeling a starting position label corresponding to a previous punctuation mark at a segmentation unit before a position to be added with the previous punctuation mark when no segmentation unit exists between two continuous punctuation marks to be added with the position, and labeling a starting position label and an ending position label corresponding to a next punctuation mark at an interval symbol before the position to be added with the next punctuation mark and a segmentation unit after the position to be added with the next punctuation mark respectively; the punctuation mark adding unit further comprises a fourth adding subunit, and is suitable for adding corresponding punctuation marks after a preceding segmentation unit and a subsequent segmentation unit when a starting position label corresponding to a marking position of the preceding segmentation unit, an interval symbol before the subsequent segmentation unit and an ending position label corresponding to the same punctuation mark are respectively marked in the text to be added with punctuation marks.
Optionally, the set of punctuation marks further comprises: no blank position label corresponding to the punctuation mark; the punctuation mark marking model is suitable for marking the blank position label at the word segmentation unit when the punctuation mark position to be added is not available before and after the word segmentation unit.
Optionally, the punctuation mark adding unit further comprises a mark deleting subunit adapted to delete at least one of a space filling symbol in the text to which the punctuation mark is to be added or an interval symbol in the text to which the punctuation mark is to be added.
Optionally, the system for adding punctuation further comprises at least one of: the first language submodel generating unit is suitable for acquiring a random first training text from a training corpus and inputting the random first training text into the neural network model, the neural network model predicts the next word segmentation unit of the first training text under the given condition, and when the prediction accuracy probability reaches a preset first threshold value, the pre-trained language submodel is acquired; and the second language submodel generating unit is suitable for acquiring a second training text which randomly covers a part with a preset proportion from the training corpus and inputting the second training text into the neural network model, the neural network model predicts the covered part with the preset proportion under the condition of given context, and when the accurate prediction probability reaches a preset second threshold value, the pre-trained language submodel is acquired.
Optionally, the system for adding punctuation marks further comprises: and the third language sub-model generating unit is suitable for acquiring a third training text which randomly covers a sentence from the training corpus, inputting the third training text into the neural network model, predicting the covered sentence by the neural network model under the given condition, and acquiring the pre-trained language sub-model when the accurate prediction probability reaches a preset third threshold value.
By adopting the punctuation mark adding scheme of the embodiment of the invention, the acquired text to be added with punctuation marks can be input into a punctuation mark labeling model for completing transfer learning training, the punctuation mark labeling model is adopted to predict the punctuation mark adding position and the corresponding punctuation mark type in the text to be added with punctuation marks, and label combinations corresponding to the punctuation mark types are adopted to label at word segmentation units before and after the punctuation mark adding position, the corresponding punctuation mark labeling result is output, and then the corresponding punctuation marks can be added to the text to be added with punctuation marks according to the punctuation mark labeling result.
As can be seen from the above, the neural network model adopted in the adding scheme of the punctuation mark in the embodiment of the present invention includes a language submodel, the language submodel is a neural network model with a time sequence, the pre-trained language submodel can fully consider the information of the word segmentation units before and after the punctuation mark, and the punctuation mark model includes the pre-trained language submodel, so that the number of training data can be reduced by performing migration learning training, rapid convergence is realized, the obtained neural network model has good generalization ability and universality, the punctuation mark type is predicted more accurately, and the recall rate of punctuation mark prediction is improved; then, labeling is carried out by adopting label combination, so that the labeled labels are distributed more uniformly, the problem of label distribution imbalance and the condition of excessive labeled invalid labels can be avoided, the accuracy of punctuation mark prediction is improved, and the requirement of punctuation mark addition can be met.
Further, when word segmentation units are arranged before and after the position to be added of the punctuation mark, labeling the word segmentation units before and after the position to be added of the punctuation mark by adopting a label combination corresponding to the punctuation mark according to the punctuation mark position label set; when a position before or after the position to be added of the punctuation mark is a vacancy-filling symbol, marking the vacancy-filling symbol at the position after or before the word segmentation unit and before or after the position to be added of the punctuation mark by adopting a label combination corresponding to the punctuation mark according to the punctuation mark position label set. By adopting the scheme, the punctuation mark position labels corresponding to the punctuation marks can be ensured to appear in pairs all the time.
Further, the set of punctuation marks comprises: the punctuation mark adding method comprises the steps that a starting position label is arranged at a word segmentation unit before a position to be added of each punctuation mark, and an ending position label is arranged at a word segmentation unit after the position to be added of each punctuation mark; when the types of the punctuations corresponding to the positions to be added of the continuous punctuations are the same and only one word segmentation unit exists between the positions to be added of the punctuations, except for the position to be added of the last punctuation, only corresponding initial position labels can be marked at word segmentation units before the positions to be added of other punctuations; when no word segmentation unit exists between the positions to be added of two continuous punctuations, the starting position label corresponding to the previous punctuation can be marked at the word segmentation unit before the position to be added of the previous punctuation, and the starting position label and the ending position label corresponding to the next punctuation are respectively marked at the word segmentation unit after the interval symbol before the position to be added of the next punctuation and the position to be added of the next punctuation. By adopting the scheme, the label marking mode can be optimized, and the marking of punctuation marks which have short intervals with word segmentation units and are easy to continuously appear is more accurate.
Further, the set of punctuation mark position tags further comprises: and when the positions to be added of the punctuations do not exist before and after the word segmentation unit, the blank position label can be marked at the word segmentation unit. The blank position labels can ensure that each word segmentation unit in the text to be added with the punctuation marks has the corresponding punctuation mark position label, thereby further improving the distribution balance of the labels and facilitating the subsequent addition of the punctuation marks.
Furthermore, the representation capability of the pre-trained language submodel on the characteristics of the upper and lower sentences in the text corpus can be enhanced by selecting the training corpus and inputting the training corpus into the neural network model to predict the next word segmentation unit under the condition of giving the upper text or predict the covered preset proportion part under the condition of giving the context. Furthermore, the ability of the pre-trained language sub-model to understand the relationship between two texts in the text corpus can be further enhanced by selecting the training corpus and inputting the training corpus into the neural network model and predicting the covered sentence under the condition of the given text.
Drawings
FIG. 1 is a flowchart illustrating a method for training a punctuation mark model according to an embodiment of the present invention.
Fig. 2 is a schematic diagram illustrating an embodiment of obtaining a third corpus.
FIG. 3 is a diagram illustrating another example of obtaining a third corpus in accordance with an embodiment of the present invention.
Fig. 4 is a flowchart of a transfer learning training performed by a punctuation mark model according to an embodiment of the present invention.
FIG. 5 is a schematic structural diagram of a punctuation mark model according to an embodiment of the present invention.
Fig. 6 is a schematic structural diagram of a system for adding punctuation marks according to an embodiment of the present invention.
Detailed Description
As described above, although the accuracy of the punctuation mark adding method is greatly improved when the neural network is used, the punctuation mark adding method needs to prepare a large amount of training data in advance for neural network model training, the training data is usually a punctuation-free corpus generated by Automatic Speech Recognition (ASR), training can be performed after manual labeling is performed with time and labor, and then punctuation mark prediction is performed on a punctuation mark text obtained by Speech Recognition by using the trained neural network model, and the obtained punctuation mark prediction result is often low in recall rate. In addition, the training data of the current neural network model only considers the information of the previous word of the punctuation mark, so that the label distribution of the training data is unbalanced, the neural network model trained by the training data is poor in generalization capability and not universal, and the punctuation mark predicted by the neural network model trained by the training data is poor in user experience because only the information of the previous word of the punctuation mark is considered during training, so that the problems of multiple marks and missed marks often occur.
In view of the above problems, an embodiment of the present specification provides a punctuation mark adding system, where an acquired punctuation mark to be added is input into a punctuation mark labeling model that completes migration learning training, the punctuation mark labeling model is adopted to predict a punctuation mark to be added position and a corresponding punctuation mark type in a text to be added with punctuation marks, and label combinations corresponding to the punctuation mark types are adopted to label at word segmentation units before and after the punctuation mark to be added position, and a corresponding punctuation mark labeling result is output, so that corresponding punctuation marks can be added to the text to be added with punctuation marks according to the punctuation mark labeling result. The punctuation mark model comprises a pre-trained language submodel, wherein the language submodel is a neural network model with a time sequence.
In order to make the embodiments of the present disclosure more clearly understood and implemented by those skilled in the art, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the drawings in the embodiments of the present disclosure, and it is obvious that the described embodiments are some embodiments of the present disclosure, but not all embodiments. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments in the present specification without any creative effort shall fall within the protection scope of the present specification.
Before adding punctuation marks of the text, the punctuation mark model can be subjected to transfer learning. Referring to a flowchart of a training method of a punctuation mark model shown in fig. 1, in an embodiment of the present invention, the punctuation mark model may adopt the following steps:
s101, obtaining a first training corpus containing punctuation marks.
In the specific implementation, the wider the source of the first training corpus including the punctuation marks is, the more standard the punctuation marks are, the more beneficial the pre-training of the neural network model is to be carried out subsequently, and the universality and the accuracy of the pre-trained language word model can be improved. Therefore, on the one hand, to make the source of the first corpus more extensive, corpora are obtained in different fields as much as possible, and on the other hand, to make the use of the punctuation marks in the first corpus more standard, the corrected corpora can be obtained.
For example, the first corpus may include: news corpora, hundred degree encyclopedia corpora, thesis corpora, etc. biased toward written terms; preferred to spoken chat corpora, microblog corpora, comment corpora, etc.
S102, inputting the first training corpus into a preset neural network model with a time sequence for pre-training, and obtaining a pre-trained language sub-model.
In a specific implementation, the neural network model with a time series includes: the method comprises the steps that a converter (Transformer) network model with a self-attention mechanism (self-attention), a bidirectional Long Short-Term Memory (Bi-directional Long Short-Term Memory, BiLstm) network model, a Gated Recurrent Unit (GRU) network model and the like are adopted, the first training corpus is utilized to pre-train a preset neural network model with a time sequence, and the obtained pre-trained language sub-model can deeply capture information of text context and prepare for learning and transferring training of a subsequent punctuation mark model.
S103, obtaining a second training corpus containing punctuation marks, removing the punctuation marks from the second training corpus, and labeling corresponding label combinations at the positions of word segmentation units before and after the removed punctuation marks to obtain a third training corpus.
Wherein the third corpus may include: a set of punctuation-free text and a set of sequences of labels, which may include the combination of labels.
In specific implementation, the second corpus containing punctuation marks may be the same as the first corpus obtained above, and the corpus or the corrected corpus may be obtained in a plurality of different fields, so that the generality and accuracy of the punctuation mark model may be increased. In order to save the corpus acquisition time, the corpus can also be directly captured from the first corpus to obtain a second corpus. For example, the second corpus may include: news corpora, hundred degree encyclopedia corpora, thesis corpora, etc. biased toward written terms; preferred to spoken chat corpora, microblog corpora, comment corpora, etc. The source of the second corpus can be relatively extensive.
In a specific implementation, in order to accurately predict punctuation marks of punctuation-free texts in one or more specific fields, the corpus may also be obtained for the specific field or fields, and the second corpus is obtained. For example, to be able to predict punctuation of punctuation-free text in the financial domain, the second corpus may include corpus that contains punctuation from the financial domain.
In the embodiment of the present invention, the combination of removing punctuation marks and labeling labels is performed on the second corpus, and the obtained punctuation-free text set and the label sequence set in the third corpus may be associated, so that when a part or all of the punctuation-free text set is obtained, a corresponding part or all of the label sequences can be obtained.
In the embodiment of the invention, the label sequence set can contain all label combinations for labeling the front and rear word segmentation units of the removed punctuation marks, and the information of the front and rear word segmentation units of the punctuation marks of the second training corpus can be fully considered by labeling the label combinations, so that the label distribution of the third training corpus is more balanced, the problem of label distribution imbalance caused by labeling only the labels at the word segmentation units in front of the removed punctuation marks can be avoided, the condition of excessive labeled invalid labels can be avoided, and the accuracy of punctuation mark prediction of a punctuation mark labeling model can be improved.
S104, inputting the third training corpus into an initial punctuation mark model for transfer learning training to obtain a trained punctuation mark model;
wherein the punctuation mark model may comprise: the pre-trained language submodel.
It can be understood that the language types corresponding to the first corpus, the second corpus and the third corpus may include, but are not limited to, chinese, and may be adjusted according to the language type of the text to be labeled by the actual punctuation mark labeling model. In specific implementation, the corpus of the corresponding language type may be selected according to the language type of the text to be labeled or processed subsequently.
It can also be understood that the timing for obtaining the first corpus and the second corpus is not limited to the above sequence of steps, the first corpus may be obtained before the pre-training of the preset neural network model with time sequence, and the second corpus may be obtained before the third corpus is input into the initial punctuation mark model for transfer learning training. For example, the first corpus and the second corpus may be obtained simultaneously.
In specific implementation, the punctuation mark model can comprise a pre-trained language sub-model, so that the number of the third training corpora required by the punctuation mark model is greatly reduced, the number of the second training corpora can be reduced, labor cost and marking time are saved, and training time of the punctuation mark model is reduced. Meanwhile, the whole punctuation mark labeling model can be quickly converged on the premise of reducing the number of the third training corpus by the transfer learning training, and the recall rate is improved.
In a specific implementation, in order to achieve automatic acquisition of the third corpus, a punctuation mark set and a preset punctuation mark position mark set may be preset in advance. After the second corpus containing punctuation marks is obtained, all word segmentation units and punctuation marks in the second corpus can be converted into marks (tokens) in a byte format, so that errors caused by word segmentation can be avoided, and the problems of high learning cost, sparse word vector (word vector) representation and the like caused by overlarge dictionaries are also avoided.
According to a preset punctuation mark set, punctuation marks of the second training corpus can be recognized, then the punctuation marks in the second training corpus are removed, a corresponding punctuation-free text set is obtained, according to the preset punctuation mark position tag set, tag combinations with corresponding position relations are marked at the front word segmentation unit and the rear word segmentation unit of each removed punctuation mark, a corresponding tag sequence set is obtained, and therefore a punctuation-free text set and a corresponding tag sequence set of the third training corpus are obtained and are used for a subsequent training punctuation mark model.
The punctuation mark set may include corresponding commonly used punctuation marks according to the language categories corresponding to the first corpus, the second corpus, and the third corpus, and the punctuation mark position mark set may include a label combination corresponding to the punctuation mark set. For example, if the language type corresponding to the first corpus, the second corpus, and the third corpus is chinese, the set of punctuation marks may include common punctuation marks such as comma, period, and pause, and the set of punctuation mark position labels may include a label combination corresponding to comma, a label combination corresponding to period, and a label combination corresponding to pause, etc. The labels are combined to appear in pairs and are used for being marked at the positions of the word segmentation units before and after the removed punctuation marks, so that the label distribution of the third training corpus is more balanced. The punctuation mark set and the punctuation mark position mark set may be set according to a specific application scenario, and the embodiment of the present specification does not limit this.
In specific implementation, before pre-training of the language sub-model and training of the punctuation mark model, data cleaning can be performed on the acquired first training corpus and the acquired second training corpus, and the obtained first training corpus and the obtained second training corpus are mainly used for removing the punctuation marks which obviously use the irregularities, for example, the beginning of a sentence has the punctuation marks which cannot be used for the beginning of the sentence, the sentence has punctuation marks which do not continuously appear, the sentence end has the punctuation marks which do not finish the sentence, and the like, and the punctuation marks which use the irregularities can be deleted. Therefore, the method can avoid learning an irregular punctuation mark using mode by the language submodel and the punctuation mark marking model.
In a specific implementation, when word segmentation units are arranged before and after the removed punctuation mark, label combinations with corresponding position relations are marked at the word segmentation units before and after the removed punctuation mark according to the punctuation mark position label set, so that the punctuation mark position labels corresponding to the punctuation mark appear in pairs. For example, the second corpus is S: { t1t2t3…tnN is a determined positive integer, which represents the number of tokens (tokens) contained in the set S, and m punctuations are contained in the text if m punctuations are contained in the text<n, the third text corpus is set R: { t1t2…tn-mT before and after the removed punctuation markiProcessing and marking a label combination to obtain a corresponding label sequence set L of { L }1L2L3…Ln-m}。
However, when there is no segmentation unit before or after the removed punctuation mark, the label combination cannot be labeled in pairs, so that a vacancy-filling symbol can be added before or after the removed punctuation mark to fill the position of the vacant segmentation unit, and then the label combination with the corresponding position relationship between the segmentation unit after or before the punctuation mark and the vacancy-filling symbol is labeled according to the punctuation mark position label set, so that the vacancy-filling symbol can be regarded as a segmentation unit, and the label combination can be labeled in pairs.
In embodiments of the present invention, the null-fill symbol may be used in several cases:
first, punctuation marks are located at the end of a sentence. For example, the second corpus is: { today is wednesday. And when the punctuation mark is removed, adding a vacancy completion mark EOS for the sentence end behind the removed sentence number to obtain a punctuation-free text set { today is Wednesday EOS }, and then labeling the label combination of the sentence number at the position of the character 'three' before the removed sentence number and the position of the vacancy completion mark 'EOS' after the removed sentence number.
It should be understood that "{ }" is used only for limiting the content range of the examples, and is not an essential part in representing the content of the corpus, and those skilled in the art can use other symbols which are not easy to be confused to limit the content range of the corpus, and the word segmentation unit and the punctuation mark in "{ }" can be a mark (token) converted into a byte format, and can also be a mark (token) converted into other formats, and the "{ }" is the same as described above.
Second, punctuation marks are located at the beginning of the sentence. For example, the second corpus is: the reader is a good journal. And then, labeling label combinations of the left title on the positions of the vacancy completion symbol BOS before the removed left title and the character reading position after the left title.
It is understood that the foregoing embodiments are merely illustrative, and not restrictive on the specific implementation, and in the specific implementation process, the punctuation-free text set and the tag sequence set obtained by the second corpus are often massive.
By adopting the scheme, the label combinations with corresponding position relations are marked at the word segmentation units before and after the punctuation marks according to the punctuation mark position label set, so that the punctuation mark position labels corresponding to the punctuation marks appear in pairs; when a word segmentation unit is not arranged before or after the removed punctuation mark, a corresponding vacancy filling symbol is added according to specific conditions to fill the position of the vacant word segmentation unit, so that the problem that the punctuation mark can be labeled without the word segmentation unit can be avoided, and the punctuation mark position labels corresponding to the punctuation mark can be ensured to be always paired.
In a specific implementation, the set of punctuation marks tags may comprise: a start position label at the word segmentation unit before each punctuation mark, and an end position label at the word segmentation unit after each punctuation mark. The set of punctuation position tags can correspond to the set of punctuation symbols.
In an embodiment of the present invention, one punctuation symbol set P and one punctuation symbol position tag set Q are preset, where P {. Is there a | A B, carrying out the following steps of; : and when the "Q" is equal to { B-period I-period B-sequential mark I-sequential mark B-iteration I-iteration B-comma I-comma B-semiconductor I-semiconductor B-operation I-operation electron B-closing electron I-closing electron B-upper quantitative mark I-upper quantitative mark B-lower quantitative mark I-lower quantitative mark }, the set of point symbols P and the set of point position marks Q are associated, and the corresponding position relationship of each point symbol in the set of point symbols P and its preceding and succeeding sub-word unit is represented by the set of point position marks Q. When the punctuation marks in the second training corpus are identified through the punctuation mark set P, the punctuation mark position labels corresponding to the identified punctuation marks can be obtained through the association between the punctuation mark set P and the punctuation mark position label set Q, and then corresponding label combinations can be marked at the front word segmentation unit and the rear word segmentation unit after the identified punctuation marks are removed.
It should be understood that the notation set of the punctuation marks in the embodiment is represented by "full english name of B-punctuation" and "full english name of I-punctuation", but other representations may be adopted, for example, considering that the full english name written with the punctuation marks is too long, the full english name of the punctuation marks may be replaced by an abbreviation, and for example, the full english name of the punctuation marks may be directly replaced by the punctuation marks, which is not limited in this embodiment of the present invention.
The table of punctuation, a set of punctuation, and a set of punctuation position tags is shown in table 1.
Continuing with the above example, the beginning position label "B-per" for a period may be marked in the non-punctuation text set { today is Wednesday EOS } at the word "three" before the period, and the ending position label "I-per" for a period may be marked at the space after the period to complete the symbol "EOS". In the same way, the starting position label 'B-ope' of the left title number can be marked at the position of a vacant position completion symbol 'BOS' before the left title number in the punctuation-free text set { BOS reader is a good magazine EOS }, and the ending position label 'I-ope' of the left title number can be marked at the position of a character 'read' after the left title number.
TABLE 1
Figure BDA0002218895970000111
In a specific implementation, the set of punctuation mark position tags may further include: and when no punctuation mark exists before and after one word segmentation unit, marking the blank position label at the word segmentation unit. Through the blank position labels, each word segmentation unit in the text corpus can have a corresponding punctuation mark position label, the distribution balance of the labels can be further improved, and fine adjustment can be conveniently carried out during later punctuation mark model training.
The inventor finds in research that due to the rules of language usage, several special cases may arise:
first, a plurality of punctuations that have been removed in succession are the same punctuation and there is only one word segmentation unit between the punctuations. For example, according to the method for labeling a Chinese punctuation mark, although characters are present before and after a pause sign, when only one character is present between pause signs, a problem of overlapping occurs in a label combination added with pause signs. For example, the second corpus is: { cattle, sheep, horse all eat grass. And if only one character exists among a plurality of continuous pause numbers, marking a start position label 'B-cae' of the pause number at the characters 'cattle' and 'sheep', and marking an end position label 'I-cae' of the pause number at the characters 'horse'.
In an embodiment of the present invention, as shown in fig. 2, a schematic diagram of obtaining a third corpus according to the present invention is shown, and in an embodiment of the present invention, the obtained second corpus 21 is:
{ day, month, day three parties for a big battle, ready to begin! Is it impossible for a member of the group to win? }
Then, according to a preset punctuation mark set P, punctuation marks in the second training corpus are identified and removed, and because characters are lacked behind a question mark of a sentence tail, a vacancy completion mark EOS for the sentence tail is added behind the question mark, so that a punctuation-free text set 221{ the fact that three parties in the day and month day are in great chance to begin and can not win the match EOS } is obtained. And labeling label combinations with corresponding position relations at the front and rear characters of each removed punctuation mark according to a preset punctuation mark position label set Q. Since successive pause signs appear among "day", "month" and "day", and there is only one character in the middle of the pause sign, the "B-cae" is labeled at "day" and "month", the "B-cae" is labeled at "day", the "B-come" and "I-com" are labeled at the "decision" and "quasi" characters before and after the first pause sign, the "B-exc" and "I-exc" are labeled at the "start" and "zang" characters before and after the exclamation sign, the "B-que" and "I-que" are labeled at the "start" and "pair" characters before and after the question sign, respectively. For the characters without punctuation marks before and after, the blank position label O is marked at the characters, and finally the corresponding label sequence set 222{ B-cae B-cae I-caeO O O B-com I-com O B-exc I-exc O O O O O B-com I-com B-que I-que } is obtained. And associating the obtained punctuation-free text set 221 with the label sequence set 222, thereby obtaining the third corpus 22.
Second, the removed punctuation marks are two consecutive punctuation marks, and no word segmentation unit exists between the punctuation marks. According to the labeling method of the label combination, the label overlapping problem of the punctuation mark combination also occurs, in order to avoid the problem, in the research process, the inventor observes that the situation is caused by the punctuation mark use rule and has a general rule, therefore, an interval mark SEP (segment) can be added after each word segmentation unit, a starting position label corresponding to the former punctuation mark is labeled at the word segmentation unit before the removed former punctuation mark, a starting position label and an ending position label corresponding to the latter punctuation mark are respectively labeled at the word segmentation unit before the removed latter punctuation mark and at the word segmentation unit after the latter punctuation mark, the interval mark can be regarded as a word segmentation unit, so that the label combination can be labeled in pairs, and the labeling method is effective for two continuous punctuation marks.
For example, the second corpus is: { I like to see readers. Where two consecutive symbols "appear in the text. "then, an interval symbol SEP is added after each character, and because there is a punctuation mark period at the end of a sentence, it is also necessary to add a vacant space for filling a symbol EOS to fill a missing character to obtain a punctuation-free text set { I SEP1 likes SEP2 and sees SEP3 and SEP4 reads SEP5 and SEP6EOS }, then a start position label" B-clo "of a right title number is marked at" man ", a start position label" B-per "of a period number is marked at" SEP6 ", an end position label" I-per "of a period number is marked at" EOS ", and for a participle unit without punctuation marks before and after, the blank position label O is marked at the participle unit; and for the interval symbols without punctuation marks at the front and the back, marking the interval labels S at the interval symbols. It is to be understood that the "SEP" followed by the number sequence is used for descriptive purposes only for purposes of distinction, and that the "SEP" followed by the number sequence may not be used in a particular implementation.
In an embodiment of the present invention, as shown in fig. 3, a schematic diagram of obtaining the third corpus according to the present invention is shown, where the obtained second corpus 31 is:
{ leader description: "achieving profitability is the primary goal". }
And then, identifying and removing punctuation marks in the second training corpus according to a preset punctuation mark set P, wherein colons, quotation marks, periods and quotation marks are all the conditions that two punctuation marks appear continuously, adding a space mark SEP after each character, and adding a vacancy completion mark EOS for the end of a sentence after the period because the characters are lacked after the end of the sentence. In addition, in order to facilitate the subsequent training of the punctuation mark model, a vacancy filling symbol BOS can be added at the beginning of the sentence, so that the symbols in the punctuation-free text set are complete, the punctuation-free text set 321 is obtained, wherein the BOS leads SEP1 to SEP2 that SEP3 realizes SEP4, SEP5 is full of SEP6 and SEP7 is obtained, the SEP8 is SEP9 root SEP10, the SEP11 is of an SEP12 target and marks SEP13EOS, the starting position label "B-col" of the colon mark is marked at the position of the character "say", the starting position label "B-upp" of the quotation mark is marked at the position of the interval symbol "SEP 3", and the ending position label "I-upp" of the quotation mark is marked at the position of the character "real". Similarly, the processing method of periods and quotation marks can be obtained, interval labels S are marked at the interval symbols without punctuation marks at the front and the back, finally, the corresponding label sequence set 322{ O S B-col B-upp I-upp S O S O S B-low B-per I-per } is obtained, and the obtained punctuation-free text set 321 and the label sequence set 322 are associated, so that the third corpus 32 is obtained.
By adopting the scheme, the label marking mode can be optimized, so that the punctuation marks which have shorter interval with word segmentation units and are easy to continuously appear can be predicted more accurately during model training. It is understood that the foregoing embodiments are merely illustrative, and not restrictive on the specific implementation, and in the specific implementation process, the punctuation-free text set and the tag sequence set obtained by the second corpus are often massive.
In order to make those skilled in the art better understand and implement the embodiments of the present invention, a specific example of a trained punctuation mark model obtained by migration learning is given below.
In an embodiment of the present invention, the punctuation mark model may include a pre-trained language submodel and a preset conditional random field submodel, the pre-trained language submodel may be connected to the preset conditional random field model, referring to the flowchart of the punctuation mark model shown in fig. 4, the transfer learning training may be performed by the following steps:
s1041, obtaining a section of punctuation-free text from the punctuation-free text set.
In a specific implementation, the whole third corpus may be divided into multiple batches for iterative training, where each batch is processed each time, and each batch contains a section of punctuation-free text, i.e., a sentence list, and the size of the list is determined by actual conditions. The marking method adopted when the obtained punctuation-free text is subjected to punctuation symbol removal and label combination marking according to the second training corpus can comprise at least one of a vacancy completion symbol EOS for sentence end, a vacancy completion symbol BOS for sentence head and an interval symbol SEP.
S1042, a first label sequence corresponding to the punctuation-free text is obtained from the label sequence set.
In a specific implementation, since the set of punctuation-free texts is associated with the set of tag sequences, a corresponding tag sequence can be obtained as a first tag sequence according to a segment of punctuation-free texts contained in each batch.
S1043, inputting the punctuation mark labeling model with the punctuation-free text and the first label sequence, obtaining the sentence characteristic information corresponding to the punctuation-free text by the pre-trained language sub-model, inputting the sentence characteristic information into the conditional random field sub-model, and predicting the second label sequence corresponding to the punctuation-free text.
In the specific implementation, a section of punctuation-free text is obtained, so that each word segmentation unit in the punctuation-free text is firstly converted into a mark (token) in a byte format and used
Figure BDA0002218895970000141
Denotes xiIs an index value in the dictionary.
Word vector embedding (word embedding) is carried out on each word segmentation unit according to index values of the dictionary, one word segmentation unit corresponds to one word vector, and the word vector embedding method is used for embedding words into word segments
Figure BDA0002218895970000151
eiIs a dense vector of dimension k. The dictionary is adjusted according to the language type of the word segmentation unit, wherein the word vector matrix is obtained in a mode of random initialization or using pre-trained word vectors and exists in the whole neural network as learning parameters, and the word vector eiIs set according to specific conditions.
Then, the vectorized punctuation-free text passes through the pre-trained language submodel to represent the static word vectors
Figure BDA0002218895970000152
The sentence is converted into dynamic sentence representation and is recorded as
Figure BDA0002218895970000153
liIs a dense vector of dimension l, and the size of l is set according to specific situations, thereby being capable of deeply strengthening the representation of sentences.
Figure BDA0002218895970000154
Random field (Co) for the conditionInput of Conditional Random Field (CRF) submodel, the conditional Random Field submodel is preset with a state transition matrix [ Ai,jAnd a fractional matrix
Figure BDA0002218895970000155
[Ai,jRepresenting the transition probability of a state transition from the ith state to the jth state for two time steps, a fractional matrix
Figure BDA0002218895970000156
To represent
Figure BDA0002218895970000157
The t-th position output after the input is [ y ]]tWhere θ contains all the parameters of the entire pre-trained language submodel that require transfer learning. At this time, a feature function preset in the conditional random field model may be used:
Figure BDA0002218895970000158
which is represented as input
Figure BDA0002218895970000159
The predicted output tag sequence is
Figure BDA00022188959700001510
The score of (1). This characteristic function
Figure BDA00022188959700001511
The occurrence of label sequences which do not conform to labeling rules can be avoided, and in order to conveniently express subsequent formulas, the following formulas are expressed
Figure BDA00022188959700001512
Simplified to s (l, y). When s (l, y) is scored maximally, a second tag sequence is obtained
Figure BDA00022188959700001513
In an embodiment of the present invention, a Viterbi (Viterbi) method may be used to calculate a path with the largest s (l, y) score, i.e., y ═ argmaxs (l, y).
S1044 comparing the second tag sequence with the first tag sequence and determining an error therebetween.
In a specific implementation, assume that the first tag sequence is
Figure BDA00022188959700001514
By comparison
Figure BDA00022188959700001515
And
Figure BDA00022188959700001516
an error between the two can be obtained.
S1045, determining whether the error is larger than a preset error threshold, and finely adjusting the parameters of the punctuation mark model when the error is larger than the error threshold; and when the error is determined to be not greater than the error threshold value, obtaining a trained punctuation mark model.
In specific implementation, after iterative training of the third corpus, all the second tag sequences can be obtained
Figure BDA00022188959700001517
And corresponding first tag sequence
Figure BDA00022188959700001518
The error between. When there is at least one group
Figure BDA00022188959700001519
And
Figure BDA00022188959700001520
when the error between the two is larger than the preset error threshold value, the parameters of the punctuation mark model can be finely adjusted, and then the transfer learning training of the punctuation mark model is carried out again until the punctuation mark model is subjected to the transfer learning training again
Figure BDA0002218895970000161
And
Figure BDA0002218895970000162
the error between the two is not larger than the preset error threshold value, and the punctuation mark model trained by the transfer learning is obtained.
In one embodiment of the present invention, first, the method determines
Figure BDA0002218895970000163
The score ∑ s (l, y) for all possible tag sequences and the first tag sequence are
Figure BDA0002218895970000164
Corresponding score
Figure BDA0002218895970000165
Secondly, normalizing the first label sequence to obtain the probability of the first label sequence
Figure BDA0002218895970000166
Then, log the probability of the first tag sequence to obtain a loss function:
Figure BDA0002218895970000167
since the smaller the value of the loss function is, the higher the representation accuracy is, it is necessary to obtain a parameter of the symbol labeling model corresponding to the minimum value of the loss function. Parameters of the punctuation mark model can be fine-tuned by using a gradient descent method or a back propagation method, the range of which includes parameters of a pre-trained language sub-model and parameters of a conditional random field sub-model, thereby reducing the value of a loss function. By adopting a gradient descent method or a back propagation method, the punctuation mark model can be quickly converged, the training time is shortened, and the accuracy of the punctuation mark model is improved.
In a more preferred embodiment, the punctuation mark model further comprises a preset two-way long-and-short-term memory submodel, the preset two-way long-and-short-term memory (BiLstm) submodel being located between the pre-trained language submodel and the conditional random field submodel.
After text information is generated by using the pre-trained language submodel, the text information is input into a two-way long-short time memory (Bilstm) submodel to further enhance the representation capability of sentence characteristic information. After sentence characteristic information acquired by two layers of neural networks with time sequences is input into the conditional random field sub-model, the change range of the final prediction result is smaller, a better prediction effect is achieved, and in addition, the accuracy and the generalization capability of the model can be increased under the condition of larger data volume.
In the implementation, the pre-training of the preset neural network model with time series can be performed in various ways, which are exemplified by the following three methods:
the method comprises the steps of obtaining a random first training text from the first training corpus, inputting the random first training text into a neural network Model, predicting a next word segmentation unit of the first training text by the neural network Model under the condition of given context, determining that the training is good when the prediction accuracy reaches a preset first threshold value, and obtaining a pre-trained Language sub-Model through a Language Model (LM) training mode.
The pre-trained language submodel may estimate a likelihood that the punctuation-free text is a sentence when subsequent punctuation-free text is input. Assume that the punctuation-free text subsequently entered is: { omega [ [ omega ] ]1ω2ω3…ωnN is the number of word segmentation units in the punctuate-free text, and the pre-trained language submodel expresses that the probability that the punctuate-free text is a sentence is
Figure BDA0002218895970000168
Where i is 1,2 … n. The greater the probability of P (S), the higher the probability of stating that the punctuate-free text is a sentence. The pre-trained language submodel may enhance the pre-trained language submodel's ability to characterize the characteristics of the context sentences in the corpus of text.
The word segmentation unit corresponds to the minimum sentence component unit of the language type of the input first corpus, for example, the minimum sentence component unit of chinese is "word", and the minimum sentence component unit of english is "word", so that the concrete expression of the word segmentation unit is determined by the language type of the first corpus.
In a specific implementation, when the first training text is an empty set, the neural network model performs first-word prediction, and takes a word segmentation unit with the highest first-word probability as a prediction result according to a known prior probability, for example, "present" as the first-word probability is the highest, and then "present" is taken as the first word.
When the preamble already exists, e.g., the above "present" already exists, the probability P (ω) of the next word is predicted from "present21) When P (ω)21) At maximum, the probability P (ω) of a sentence represented by two words1ω2)=P(ω1)P(ω21) At the maximum, "today" is obtained assuming the probability of "day" is the greatest.
As another example, there has been a probability P (ω) of predicting the next word based on "weather today is very" as above61ω2ω3ω4ω5) When P (ω)61ω2ω3ω4ω5) At maximum, the 6 words represent the likelihood of a sentence
Figure BDA0002218895970000171
Figure BDA0002218895970000172
At the maximum, assuming the probability of "good" is the greatest, then "today's weather is very good" is obtained.
And a second method comprises the steps of acquiring a second training text which randomly covers a part with a preset proportion from the first training corpus, inputting the second training text into the neural network Model, predicting the covered part with the preset proportion by the neural network Model under the condition of given context, determining that the training is good when the prediction accuracy reaches a preset second threshold value, and acquiring a pre-trained Language sub-Model by a Mask Language Model (MLM) training mode.
In a specific implementation, a certain percentage of the content of the first corpus containing punctuation marks is randomly masked, and then the masked content is predicted through a normalized exponential function (softmax) function, because the randomly masked content can comprise word segmentation units and punctuation marks, the language submodel pre-trained in the MLM mode is more strongly characterized for sentence characteristic information of a sentence context.
And thirdly, acquiring a third training text for randomly masking a sentence from the first training corpus, inputting the third training text into the neural network model, predicting the masked sentence by the neural network model under the condition of the given upper text, determining that the training is good when the prediction accuracy reaches a preset third threshold value, and acquiring a pre-trained language sub-model by a Next Sentence Prediction (NSP) training mode. The language sub-model obtained by pre-training in the NSP mode has stronger capability of understanding the relationship between the upper paragraph and the lower paragraph.
In an embodiment of the present invention, the NSP method adopts a binary method, a text a may be obtained from a first corpus, one sentence of the text a is randomly masked, a sentence B is obtained from the first corpus, and it is predicted whether the sentence B is a masked sentence in the text a, if the predicted sentence B is a masked sentence in the text a, "1" or "TRUE" is output, otherwise, "0" or "flag" is output, it is determined whether the prediction is accurate according to the output result, and the probability of the prediction is counted, and when the probability of the prediction is accurate to a preset third threshold, it is determined that the training is performed. For example, in a training period, 50% of the time is that the obtained sentence B is an actually covered sentence of the text a, and the remaining 50% of the time is that the obtained sentence B is a sentence randomly extracted from the first corpus.
In specific implementation, the pre-trained language submodel can be obtained by training by any one of the three methods, or by training in a manner of combining the first method and the third method, or combining the second method and the third method.
In a specific implementation, the first corpus may be obtained in various manners, for example, the first corpus may be obtained in one or more of the following manners:
1. text data containing punctuation marks is captured in the internet. 2. And capturing language data in the Internet, and carrying out voice recognition and labeling on the language data to obtain corresponding text data. 3. Text data containing punctuation is manually entered.
The second corpus may also be obtained in various manners, for example, the second corpus may be obtained in one or more of the following manners:
1. and capturing text data in the first training corpus. 2. Text data containing punctuations in a preset field is captured in the Internet. 3. And manually inputting text data containing punctuations in a preset field.
The embodiment of the present invention further provides a punctuation mark model obtained by the training method of the punctuation mark model, and in order to enable those skilled in the art to better understand and realize the embodiment of the present invention, detailed descriptions are provided below through specific embodiments with reference to the attached drawings.
Referring to the structural diagram of the punctuation mark model shown in fig. 5, in the embodiment of the present invention, as shown in fig. 5, the punctuation mark model 50 may include an input layer 51, an encoding layer 52, a conditional random field network layer 54 and an output layer 55, wherein:
and the input layer 51 is suitable for extracting corresponding word segmentation unit characteristic information of the text to be added with punctuations.
In the specific implementation, word vector embedding (word embedding) is performed on each word segmentation unit according to the dictionary, and one word segmentation unit corresponds to one word vector (namely, one square of the input layer 51 in fig. 5) for use
Figure BDA0002218895970000181
eiIs a density of k dimensionsAnd (5) vector quantity. The dictionary is adjusted according to the language type of the word segmentation unit, wherein the acquisition mode of the word vector matrix is random initialization or pre-trained word vectors are used and exist in the whole neural network as learning parameters, and the dimensionality of the word vectors is set according to specific conditions. The text to be added with the punctuation mark may include at least one of a vacancy completion mark EOS for a sentence end, a vacancy completion mark BOS for a sentence head, and an interval mark SEP, which is specifically determined by a labeling mode of a third corpus used in the punctuation mark model training.
And the coding layer 52 is adapted to capture sentence characteristic information corresponding to the word segmentation unit characteristic information through a pre-trained language sub-model, wherein the pre-trained language sub-model is a neural network model with a time sequence, and the specific training process of the pre-trained language sub-model can refer to the description about the LM training mode, the MLM training mode and the NSP training mode.
In the specific implementation, the vectorized punctuation-free text represents the static word vector through the pre-trained language sub-model
Figure BDA0002218895970000191
The sentence is converted into dynamic sentence representation and is recorded as
Figure BDA0002218895970000192
liIs a dense vector with dimension l, the size of l is set according to specific conditions, thereby being capable of deeply strengthening the representation of sentences,
Figure BDA0002218895970000193
is an input to a Conditional Random Field (CRF) network layer 53.
And the conditional random field network layer 54 is adapted to predict a to-be-added position of a punctuation mark in the punctuation-free text and a corresponding punctuation mark type based on the sentence characteristic information, and label combinations corresponding to the punctuation mark types are adopted for labeling at word segmentation units before and after the to-be-added position of the punctuation mark.
In a specific embodiment, the conditionsThe random field network layer 54 is pre-configured with a state transition matrix [ A ]]i,jAnd a fractional matrix
Figure BDA0002218895970000194
[A]i,jRepresenting the transition probability of a state transition from the ith state to the jth state for two time steps, a fractional matrix
Figure BDA0002218895970000195
To represent
Figure BDA0002218895970000196
The t-th position output after the input is [ y ]]tWhere θ contains all the parameters of the entire pre-trained language submodel that require transfer learning. At this time, a feature function preset in the conditional random field model may be used:
Figure BDA0002218895970000197
Figure BDA0002218895970000198
which is represented as input
Figure BDA0002218895970000199
The predicted output tag sequence is
Figure BDA00022188959700001910
The score of (1). This characteristic function
Figure BDA00022188959700001911
The occurrence of label sequences which do not conform to labeling rules can be avoided, and in order to conveniently express subsequent formulas, the following formulas are expressed
Figure BDA00022188959700001912
Simplified to s (l, y). When the score of s (l, y) is maximum, outputting the label sequence
Figure BDA00022188959700001913
Wherein, the label sequenceColumn(s) of
Figure BDA00022188959700001914
The method comprises the information of the position to be added of the punctuation marks and the information of the type of the corresponding punctuation marks.
And the output layer 55 is suitable for outputting the corresponding punctuation mark marking result.
In specific implementations, tag sequences will be tagged
Figure BDA00022188959700001915
And outputting the marking result of the punctuation marks corresponding to the texts to be added with the punctuation marks.
In the embodiment of the present invention, the path with the largest s (l, y) score can be calculated by using the Viterbi (Viterbi) method, i.e. y*=argmaxs(l,y)。
In a specific implementation, as shown in fig. 5, the punctuation mark model may further include a bidirectional long-and-short term memory network layer 53, where the bidirectional long-and-short term memory network layer 53 is disposed between the encoding layer 52 and the conditional random field network layer 54, and the punctuation mark model will further include a bidirectional long-and-short term memory network layer 53
Figure BDA0002218895970000201
IniIn sequence, corresponding to two adjacent blocks, e.g. l, in the two-way long-and-short-term memory network layer 53 of FIG. 51Corresponding to 531 and 531. And enhancing the sentence characteristic information through bidirectional operation, and outputting the sentence characteristic information to the conditional random field network layer.
In an embodiment of the present invention, assuming that the dimension of the word vector is set to 100, the text to be added with punctuation marks is input as follows: { BOS Hello world EOS }, the input layer 51 of the punctuation mark model 50 gets the characterization { 123456 } of the initialized sentence word vector, which is in one-to-one correspondence with { BOS Hello world EOS }, specifically R6×100Is represented by a matrix of (a). The coding layer 52 captures sentence feature information of the context in the text through a pre-trained language sub-model obtained in the MLM + NSP manner. The conditional random field network layer 54 predicts the tag sequence according to the sentence characteristic information, and assumes the preset punctuation mark position tag set Q has 15 typesThe conditional random field network layer 54 predicts R6×15The correspondence between the matrix values of (a) and the punctuation mark position label set Q is as follows:
Figure BDA0002218895970000202
thus, the following predicted tag sequence is obtained:
{O O B-COM I-COM B-EXC I-EXC}
then, adding punctuation marks corresponding to punctuation mark types at positions, corresponding to positions to be added, of the text to which the punctuation marks are to be added respectively according to punctuation mark marking results output by the punctuation mark marking model, so as to obtain the text to which the punctuation marks are to be added: { hello, world! }.
By adopting the punctuation mark labeling model of the embodiment of the invention, the information of word segmentation units before and after punctuation marks can be fully considered through the pre-trained language submodel, so that the punctuation mark type can be more accurately predicted by a conditional random field network layer, and then, labeling is carried out by adopting label combination to ensure that the labeled labels are more uniformly distributed, so that the problem of label distribution imbalance and the condition of excessive labeled invalid labels can be avoided, and the accuracy of punctuation mark is improved; moreover, the punctuation mark model obtained by the training method can be used for punctuation mark adding of a text to be punctuated, and has the advantages of high accuracy, strong universality and wider application range.
After the punctuation mark model completes the transfer learning training, punctuation marks can be added to the text to be added with punctuation marks, so that those skilled in the art can more clearly understand and implement the embodiment of the present specification, the technical solution in the embodiment of the present specification will be clearly and completely described below with reference to the drawings in the embodiment of the present specification.
Referring to the schematic structural diagram of a punctuation mark adding system shown in fig. 6, in the embodiment of the present invention, the punctuation mark tagging model adding system 60 may specifically include:
the input unit 61 is suitable for obtaining a text to be added with punctuation marks.
In a specific implementation, the text to be added with punctuation marks may be a punctuation-free text generated by ASR, or a manually entered punctuation-free text.
And the punctuation mark unit 62 is adapted to predict a punctuation mark adding position and a corresponding punctuation mark type in the text to which the punctuation mark is to be added by using a punctuation mark model 621 which completes the transfer learning training, and mark by using a label combination corresponding to the punctuation mark type at word segmentation units before and after the punctuation mark adding position, and output a corresponding punctuation mark marking result. The punctuation mark model 621 completing the transfer learning training comprises a pre-trained language submodel, wherein the language submodel is a neural network model with a time sequence.
In the specific implementation, the position to be added with the punctuation mark and the type of the punctuation mark are determined according to the language type of the text to be added with the punctuation mark. For example, if the language type of the text to which the punctuation mark is to be added is Chinese, the position to which the punctuation mark is to be added is predicted according to the punctuation mark usage rule of Chinese, and the punctuation mark type is Chinese punctuation mark type; and if the language type of the text to which the punctuation marks are to be added is English, predicting the positions to which the punctuation marks are to be added according to English punctuation mark using rules, wherein the type of the punctuation marks is an English punctuation mark type.
And a punctuation mark adding unit 63 adapted to add corresponding punctuation marks to the text to which the punctuation marks are to be added according to the punctuation mark labeling result.
In specific implementation, according to the punctuation mark result, the punctuation mark adding unit 63 performs reverse compilation on the text to which the punctuation mark is to be added, and adds the corresponding punctuation mark at the position corresponding to the marked label combination.
By adopting the adding system of the punctuation marks, the neural network model can comprise a language submodel, the language submodel is a neural network model with a time sequence, the information of characters before and after the punctuation marks can be fully considered through the pre-trained language submodel, and the punctuation mark model comprises the pre-trained language submodel, so that the quantity of training data can be reduced by carrying out transfer learning training, the rapid convergence is realized, the obtained neural network model has better generalization capability and universality, the punctuation mark type is more accurately predicted, and the recall rate of punctuation mark prediction is improved; then, labeling is carried out by adopting label combination, so that the labeled labels are distributed more uniformly, the problem of label distribution imbalance and the condition of excessive labeled invalid labels can be avoided, the accuracy of punctuation mark prediction is improved, and the requirement of punctuation mark addition can be met.
In a specific implementation, in order to automatically label the label combinations corresponding to the punctuation mark types at the front and rear characters of the position to be added of the punctuation mark, the punctuation mark labeling model 621 may identify the punctuation mark corresponding to the punctuation mark type according to a preset first punctuation mark set, and label combinations corresponding to the punctuation mark are adopted at the front and rear word segmentation units of the position to be added of the punctuation mark according to a preset punctuation mark position label set.
In a specific implementation, the preset punctuation mark set P and the preset punctuation mark set Q may be adopted, where the punctuation mark set P and the punctuation mark set Q are associated, and the punctuation mark set Q represents a corresponding position relationship between a punctuation mark at a position to be added to each punctuation mark in the punctuation mark set P and characters before and after the punctuation mark.
The punctuation mark labeling model 621 can identify punctuation marks corresponding to punctuation mark types according to a preset punctuation mark set, then, the punctuation mark labeling model 621 can obtain punctuation mark position labels corresponding to the identified punctuation marks according to the preset punctuation mark position label set, and label combinations corresponding to the punctuation marks are adopted at the front and rear word segmentation units of the positions to be added of the punctuation marks for labeling.
In a specific implementation, the set of punctuation mark position tags may comprise: the punctuation mark adding device comprises a starting position label arranged at a word segmentation unit before the punctuation mark to-be-added position and an ending position label arranged at a word segmentation unit after the punctuation mark to-be-added position.
In a specific implementation, the punctuation mark labeling model 621 is adapted to label, according to the punctuation mark position label set, the preceding and following segmentation units of the position to be added of the punctuation mark with the label combinations corresponding to the punctuation marks when the preceding and following segmentation units of the position to be added of the punctuation mark have segmentation units, that is, the punctuation mark labeling model 621 may adopt the starting position label and the ending position label corresponding to the punctuation mark, and label is respectively carried out on the preceding and following segmentation units of the position to be added of the punctuation mark.
In addition, the punctuation mark position label set may further include a blank position label corresponding to a punctuation mark, and therefore, the punctuation mark labeling model 621 is adapted to label the blank position label at the word segmentation unit when there is no position to be added to the punctuation mark before and after the word segmentation unit.
The blank position labels can ensure that each word segmentation unit in the text to be added with the punctuation marks has the corresponding punctuation mark position label, thereby further improving the distribution balance of the labels and facilitating the subsequent addition of the punctuation marks.
In a specific implementation, as shown in fig. 6, the input unit 61 may further include a symbol adding subunit 611. The symbol adding subunit 611 may add, to the input text without punctuation marks, a symbol that is consistent with the labeling method of the third training corpus used in the punctuation mark model training, to obtain the text to which the punctuation marks are to be added, so as to ensure that the punctuation mark position tags corresponding to the punctuation marks can always appear in pairs. For example, when the third corpus may include a vacancy-filling symbol, the symbol adding subunit 611 adds a vacancy-filling symbol to the input text without punctuation marks, and obtains the text to which punctuation marks are to be added. Wherein, the gap filling symbol can be regarded as a word segmentation unit.
In a specific implementation, the punctuation mark labeling model 621 is adapted to label, according to the punctuation mark position label set, the vacancy completion symbols at the word segmentation unit after or before the punctuation mark position to be added and before or after, when there is a vacancy completion symbol before or after the punctuation mark position to be added, by using the label combination corresponding to the punctuation mark.
In an embodiment of the present invention, the third corpus includes a space-filling symbol EOS for an end of a sentence, and it is assumed that the punctuate-free text obtained by the input unit 61 is { today is wednesday }, wherein the word segmentation unit is a word of chinese. The symbol adding subunit 611 may add a vacant position completion symbol "EOS" at the end of the sentence to obtain a text { today is wednesday EOS }, and then input the text to which punctuation symbols are to be added into a punctuation mark model 621 that completes the transfer learning training, where the punctuation mark model may predict a position to be added of a punctuation symbol and a corresponding punctuation symbol type in the text to which punctuation symbols are to be added, for example, the punctuation mark model may predict that there is a position to be added of a punctuation symbol between the text "three" and the vacant position completion symbol "EOS", and the corresponding punctuation symbol type is a period, and then according to the punctuation symbol set P, the punctuation symbol corresponding to the period may be recognized. And, according to the punctuation mark position label set Q, the beginning position label "B-per" of the period number can be marked at the character "three", the ending position label "I-per" of the period number can be marked at the vacant position completion symbol "EOS", and a blank position label "O" can be added at the character position where there is no punctuation mark to be added at the front and back. Then, the punctuation mark model 621 can output the corresponding punctuation mark result, i.e. the predicted mark tag sequence { O O O O O B-per I-per }.
In another embodiment of the present invention, the third corpus further includes a null-position-filling symbol BOS for the beginning of the sentence, and it is assumed that the punctuate-free text obtained by the input unit 61 is { the reader is a good magazine }, wherein the word segmentation unit is a word of chinese.
The symbol adding subunit 611 may add a null position padding symbol "BOS" to the beginning of the sentence, and then input the punctuation mark model 621 that completes the transfer learning training as the text to which the punctuation mark is to be added, where the punctuation mark model 621 may predict the punctuation mark to be added position and the corresponding punctuation mark type in the text to which the punctuation mark is to be added.
In addition, the method can be used together with other labeling methods, for example, the symbol adding subunit 611 may further add a vacancy-filling symbol "EOS" at the end of a sentence, obtain a text to be added with punctuation marks { BOS readers are native magazine EOSs }, then input a punctuation mark model 621 that completes the migration learning training, where the punctuation mark model 621 may predict a position to be added with punctuation marks and a corresponding punctuation mark type in the text to be added with punctuation marks, for example, the punctuation mark model 621 may predict that there is a position to be added with punctuation marks between a text "read" and a vacancy-filling symbol "BOS", and the corresponding punctuation mark type is a left title, then according to the punctuation mark set P, a punctuation mark "<" corresponding to the left title "may be recognized, and, according to the punctuation mark set Q, a start position label" B-ope "of the left title" may be labeled at the vacancy-filling symbol "BOS", the end position label "I-ope" for the left title number is marked at the word "read". And by analogy, adding punctuations at positions to be added with other punctuations, and adding a blank position label O at the character position where the punctuations are not added at the front and the back. And then outputting a corresponding punctuation mark result, namely a predicted marking label sequence { B-ope I-ope B-clo I-clo O O B-per I-per }.
For another example, when the third corpus may further include a start position label corresponding to a plurality of consecutive identical punctuations and an end position label of one punctuation, the text to which the punctuation is to be added may also adopt the above labeling method.
The punctuation mark labeling model 621 is suitable for labeling only the initial position labels corresponding to punctuation marks at the segmentation units before the positions to be added of the punctuation marks except the last punctuation mark when the punctuation marks corresponding to the positions to be added of a plurality of continuous punctuation marks have the same punctuation mark type and only one segmentation unit exists between the positions to be added of the punctuation marks.
In the embodiment of the present invention, the punctuation mark-free text obtained by the input unit 61 is { no match is possible between three parties in the sun and the moon, wherein the word segmentation unit is a character of chinese, and can be directly used as a punctuation mark labeling model 621 that completes the transfer learning training of the text to which punctuation marks are to be added, and the punctuation mark labeling model 621 can predict the punctuation mark position to be added and the corresponding punctuation mark type in the text to which punctuation marks are to be added.
In addition, the method can be used with other labeling methods, for example, the symbol adding subunit 611 may further add a null position completion symbol "BOS" at the beginning of a sentence, add a null position completion symbol "EOS" at the end of a sentence, obtain a text to be added with a punctuation symbol { BOS is a big third party on day and month ready to start a sentence no win EOS }, and then input a punctuation symbol labeling model 621 that completes migration learning training, where the punctuation symbol labeling model may predict a punctuation symbol to-be-added position and a corresponding punctuation symbol type in the text to be added with a punctuation symbol, for example, the punctuation symbol labeling model may predict a position between day and month, a punctuation symbol to-be-added position between month and day, and corresponding punctuation symbols of the same type and both are a pause number, and according to the punctuation symbol set P, may identify a punctuation symbol "-," and a pause number corresponding to the pause number, "and according to the punctuation mark position label set Q," B-cae "can be marked at" day "and" month ", and" B-cae "is marked at" day ", so on, punctuation marks at positions to be added with other punctuation marks are added, and a blank position label" O "is added at the character position where there is no punctuation mark at the front and back to be added with position. Then, the punctuation mark model 621 may output the corresponding punctuation mark result, i.e. the predicted mark label sequence { O B-cae B-cae I-cae O O O O B-com I-com O B-exc I-excO O O O O B-com I-com B-que I-que }.
For example, when the third corpus may further include an interval symbol after each word segmentation unit, the symbol adding sub-unit 611 adds an interval symbol after each word segmentation unit of the input punctuation-free text, and obtains the text to which the punctuation symbol is to be added. Wherein, the interval symbol can be regarded as a word segmentation unit.
The punctuation mark labeling model 621 is adapted to label a starting position label corresponding to a previous punctuation mark at a segmentation unit before a position to be added with a previous punctuation mark when no segmentation unit exists between two continuous punctuation marks to be added with positions, and label a starting position label and an ending position label corresponding to a next punctuation mark at an interval symbol before the position to be added with the next punctuation mark and a segmentation unit after the position to be added with the next punctuation mark, respectively.
In the embodiment of the present invention, the punctuation mark-free text obtained by the input unit 61 is { leadership achievement profit is the root goal }, wherein the word segmentation unit is a word of chinese, the mark addition subunit 611 may add an interval mark SEP after each word, and then, as the punctuation mark to be added, input the punctuation mark labeling model 621 that completes the transfer learning training, where the punctuation mark labeling model 621 may predict a punctuation mark to be added position and a corresponding punctuation mark type in the punctuation mark to be added text.
In addition, the symbol adding sub-unit 611 may also be used with other labeling methods, for example, to add a null-padding symbol "BOS" at the beginning of the sentence, adding a vacancy filling symbol 'EOS' at the tail of a sentence to obtain a text to be added with punctuation symbols { BOS leads SEP1 to SEP2, SEP3 realizes that SEP4 realizes SEP5 filling SEP6 and SEP7 is SEP8 is SEP9, SEP10, SEP11 target SEP12 targets SEP13EOS }, then inputting a punctuation mark model 621 completing the transfer learning training, wherein the punctuation mark model 621 can predict the punctuation mark to be added position and the corresponding punctuation mark type in the text to be added with punctuation marks, for example, punctuation mark model 621 can predict that there are two punctuation marks to be added between the words "say" and "real", and the corresponding punctuation mark types are colon and quotation marks respectively, then according to the punctuation mark set P, the punctuation mark "corresponding to the colon can be identified: "punctuation marks" corresponding to the quotation marks ", and, according to the punctuation mark position mark set Q, a start position label" B-col "of the colon may be marked at the text" say ", a start position label" B-upp "of the quotation marks may be marked at the interval symbol" SEP3 ", and an end position label" I-upp "of the quotation marks may be marked at the text" real ". And by analogy, adding punctuations at positions to be added with other punctuations, adding a blank position label 'O' at the character position where the punctuations are not arranged at the front and the back and the positions to be added with the punctuations are not arranged at the front and the back, and labeling an interval label S at the interval symbol position where the punctuations are not arranged at the front and the back and the positions to be added with the punctuations. Then, the punctuation mark model 621 can output the corresponding punctuation mark result, i.e. the predicted mark tag sequence { O S B-colB-uppI-uppS O S O S B-low B-per I-per }.
It is to be understood that the "SEP" followed by the number sequence is used for descriptive purposes only for purposes of distinction, and that the "SEP" followed by the number sequence may not be used in a particular implementation.
By adopting the scheme, the label marking mode can be optimized, and the marking of punctuation marks which have short intervals with word segmentation units and are easy to continuously appear is more accurate.
It should also be understood that the above-mentioned embodiments are only examples, and are not limiting to the specific implementation, and in the specific implementation process, the above-mentioned labeling methods can be randomly combined and used according to the actual situation.
As can be seen from the above, the punctuation mark to be added may be consistent with the labeling method of the third training corpus employed in the punctuation mark model training, so that after the corresponding punctuation mark labeling result is output by the punctuation mark labeling model, the text to be added with punctuation marks is reversely compiled according to the labeling method employed by the text to be added with punctuation marks, and the corresponding punctuation marks are added at the positions corresponding to the labeling label combinations, which will be described in detail below by specific embodiments.
In a specific implementation, according to the punctuation mark result, a corresponding punctuation mark may be added to the text to which the punctuation mark is to be added, as shown in fig. 6, the punctuation mark adding unit 63 may include:
the first adding subunit 631 is adapted to add the punctuation mark between two word segmentation units when label combinations corresponding to the same punctuation mark are respectively marked at the front and the rear word segmentation units in the text to which the punctuation mark is to be added.
The second adding subunit 632 is adapted to add a corresponding punctuation mark between the space filling symbol and its adjacent segmentation unit when the space filling symbol in the text to which the punctuation mark is to be added and the label combination corresponding to the same punctuation mark are respectively marked at the position of the space filling symbol and its adjacent segmentation unit.
A symbol deletion subunit 635 adapted to at least one of fill a vacant position in the text to which the punctuation mark is to be added with a symbol or space in the text to which the punctuation mark is to be added.
For example, the text of the punctuation mark to be added input into the punctuation mark model 621 is { BOS reader is a native magazine EOS }, and the punctuation mark result output by the punctuation mark model 621 is { B-ope I-ope B-clo I-clo O B-per I-per }, because the text of the punctuation mark to be added and the output mark tag sequence may correspond to each other one by one, therefore, the first adding subunit 631 and the second adding subunit 632 may identify the punctuation marks corresponding to each tag combination as "sum", respectively, according to the punctuation mark set P and the punctuation mark position tag set Q. And then, adding punctuation marks at corresponding positions of the text to be added with the punctuation marks according to the corresponding positions of the label combinations. The symbol deletion subunit 635 may delete the vacancy completion symbols "BOS" and "EOS" in the text to which the punctuation mark is to be added, to obtain the text to which the punctuation mark has been added, that is, the { [ reader ] is a good magazine. }.
In a specific implementation, as shown in fig. 6, the punctuation mark adding unit 63 further includes a third adding subunit 633, which is adapted to add a corresponding punctuation mark after the word segmentation unit when only corresponding start position labels are labeled at a plurality of continuous word segmentation units in the text to which the punctuation mark is to be added.
For example, the text to be added with punctuation marks input into the punctuation mark model 621 is { three parties in a big decision on day and month of BOS are ready to win the match-matrix EOS }, and the punctuation mark result output by the punctuation mark model 621 is { O B-cae B-cae I-cae O O O O B-com I-com O B-exc I-excO O O O O B-com I-com B-que I-que }, because the text to be added with punctuation marks and the output mark tag sequences can be in one-to-one correspondence, the first adding subunit 631, the second adding subunit 632 and the third adding subunit 633 can respectively identify that the punctuation marks corresponding to each tag combination are ", respectively according to the punctuation mark set P and the punctuation mark set Q, ","! "," and "? And then adding punctuation marks at corresponding positions of the text to be added with the punctuation marks according to the corresponding positions of the label combinations. The symbol deletion subunit 635 may delete the vacancy completion symbols "BOS" and "EOS" in the text to which the punctuation mark is to be added, to obtain the text to which the punctuation mark has been added, i.e., { date, month, and day tripartite, ready to start! Is it impossible for a member of the group to win? }.
In a specific implementation, as shown in fig. 6, the punctuation mark adding unit 63 further includes a fourth adding subunit 634, which is adapted to add corresponding punctuation marks after a preceding segmentation unit and a subsequent segmentation unit when a starting position label corresponding to a previous segmentation unit label, an interval symbol before the subsequent segmentation unit and an ending position label corresponding to the same punctuation mark are respectively labeled in the text to which the punctuation mark is to be added.
For example, the text to be added with punctuation marks of the punctuation mark model 621 is { BOS-lead SEP1 leads SEP2 to say SEP3 real SEP4 now seg 5 full SEP6 and SEP7 is SEP8 is SEP9 root SEP10 and SEP11 target SEP12 marks SEP13EOS }, and the punctuation mark result output by the punctuation mark model 621 is { O S B-colB-uppI-up ps O S O S B-low B-per I-per }, because the text to be added with punctuation marks and the output sequence of labeling labels can be in one-to-one correspondence, therefore, the fourth adding subunit 634 can label the set Q according to the punctuation mark set P and the punctuation mark set Q, and the punctuation marks corresponding to each combination are ": "," "and". And then adding punctuation marks at corresponding positions of the text to be added with the punctuation marks according to the positions corresponding to the label combinations. The symbol deletion subunit 635 may delete the vacancy filling symbols "BOS", "EOS", and 13 space symbols "SEP" in the text to which the punctuation mark is to be added, to obtain the text to which the punctuation mark has been added, that is, { leadership: "achieving profitability is the primary goal". }.
It is to be understood that the "SEP" followed by the number sequence is used for descriptive purposes only for purposes of distinction, and that the "SEP" followed by the number sequence may not be used in a particular implementation.
It should also be understood that the above-mentioned embodiments are only examples, and are not limiting to the specific implementation, and in the specific implementation process, the above-mentioned labeling methods can be randomly combined and used according to the actual situation.
In a specific implementation, the punctuation mark adding system may train a language sub-model, as shown in fig. 6, the punctuation mark adding system 60 may further include at least one of the following:
the first language sub-model generating unit 64 is adapted to obtain a random first training text from the training corpus and input the first training text into the neural network model, where the neural network model predicts a next word segmentation unit of the first training text under a given condition, and when a prediction accuracy probability reaches a preset first threshold, obtains the pre-trained language sub-model.
And the second language sub-model generating unit 65 is adapted to obtain a second training text randomly masking a preset proportion part from the training corpus, and input the second training text into the neural network model, where the neural network model predicts the masked preset proportion part under the condition of a given context, and when the probability of accurate prediction reaches a preset second threshold, obtains the pre-trained language sub-model.
The corpus may include, but is not limited to, the third corpus. The specific training mode of the language submodel may refer to the above embodiments, and is not described in detail.
By adopting the scheme, the representation capability of the pre-trained language submodel on the characteristics of the upper and lower sentences in the text corpus can be enhanced by selecting the training corpus and inputting the training corpus into the neural network model to predict the next word segmentation unit under the condition of giving the upper text or predict the covered preset proportion part under the condition of giving the context.
In a specific implementation, as shown in fig. 6, the system 60 for adding punctuation marks may further include:
and a third language sub-model generating unit 66, adapted to obtain a third training text randomly masking a sentence from the training corpus, and input the third training text into the neural network model, where the neural network model predicts the masked sentence under the given above condition, and when the prediction accuracy probability reaches a preset third threshold, obtains the pre-trained language sub-model.
The corpus may include, but is not limited to, the third corpus. The specific training mode of the language submodel may refer to the above embodiments, and is not described in detail.
By adopting the scheme, the ability of the pre-trained language sub-model to understand the relationship between two texts in the text corpus can be further enhanced by selecting the training corpus and inputting the training corpus into the neural network model and predicting the covered sentence under the condition of the given text.
In a specific implementation, the pre-trained language submodel may be obtained by training with any one of the first to third language submodel generating units, or the pre-trained language submodel may be obtained by training with a combination of the first language submodel generating unit and the third language submodel generating unit, and a combination of the second language submodel generating unit and the third language submodel generating unit.
It can be noted that the prefix terms "first", "second", "third" and "fourth" are used herein only to distinguish different functional terms, and do not represent order, size or importance.
Although the present invention is disclosed above, the present invention is not limited thereto. Various changes and modifications may be effected therein by one skilled in the art without departing from the spirit and scope of the invention as defined in the appended claims.

Claims (10)

1. A system for adding punctuation marks, comprising:
the input unit is suitable for acquiring a text to be added with punctuations;
the punctuation mark marking unit is suitable for predicting the punctuation mark adding position and the corresponding punctuation mark type in the text of the punctuation mark to be added by adopting a punctuation mark marking model which finishes the transfer learning training, marking by adopting a label combination corresponding to the punctuation mark type at the word segmentation unit before and after the punctuation mark adding position and outputting a corresponding punctuation mark marking result;
the punctuation mark adding unit is suitable for adding corresponding punctuation marks to the texts to be added with punctuation marks according to the punctuation mark marking result;
the punctuation mark model completing the transfer learning training comprises a pre-trained language submodel, wherein the language submodel is a neural network model with a time sequence.
2. The system for adding punctuation marks according to claim 1, wherein the punctuation mark model is adapted to identify punctuation marks corresponding to the type of the punctuation marks according to a preset first set of punctuation marks, and mark, according to a preset set of punctuation mark position tags, the front and rear word segmentation units of the position to be added of the punctuation marks by using a combination of tags corresponding to the punctuation marks.
3. The system for adding punctuations according to claim 2, wherein the model for tagging punctuations is adapted to tag, when there are word segmentation units both before and after the position to be added of the punctuations, the word segmentation units before and after the position to be added of the punctuations according to the set of punctuations tags, by using a combination of tags corresponding to the punctuations;
the punctuation mark adding unit comprises a first adding subunit and is suitable for adding the punctuation mark between the two word segmentation units when the front word segmentation unit and the rear word segmentation unit in the text to be added with the punctuation mark are respectively marked with the corresponding mark combination of the same punctuation mark.
4. The punctuation mark adding system of claim 3, wherein the input unit comprises a mark adding subunit, adapted to add a vacancy completion mark to the input punctuation mark-free text to obtain the punctuation mark-to-be-added text;
the punctuation mark marking model is suitable for marking the vacancy filling symbols at the word segmentation unit position behind or in front of the punctuation mark position to be added and the vacancy filling symbols at the word segmentation unit position ahead of or behind the punctuation mark position to be added by adopting the label combination corresponding to the punctuation mark according to the punctuation mark position label set when one position is the vacancy filling symbol ahead of or behind the punctuation mark position to be added;
the punctuation mark adding unit also comprises a second adding subunit, which is suitable for adding corresponding punctuation marks between the vacancy filling symbols and the adjacent word segmentation units when the vacancy filling symbols in the text to be added with punctuation marks and the adjacent word segmentation units are respectively marked with label combinations corresponding to the same punctuation marks.
5. The punctuation addition system of claim 4 wherein the set of punctuation position tags comprises: the punctuation mark adding method comprises the steps that a starting position label is arranged at a word segmentation unit before a position to be added of each punctuation mark, and an ending position label is arranged at a word segmentation unit after the position to be added of each punctuation mark;
the punctuation mark labeling model is suitable for labeling only initial position labels corresponding to punctuation marks at the word segmentation units before the punctuation mark positions to be added except the last punctuation mark position when the punctuation marks corresponding to the punctuation marks at a plurality of continuous punctuation mark positions to be added have the same punctuation mark type and only one word segmentation unit exists between the punctuation mark positions to be added;
the punctuation mark adding unit also comprises a third adding subunit, which is suitable for adding corresponding punctuation marks after the word segmentation unit when the corresponding initial position labels are only marked at a plurality of continuous word segmentation units in the text of the punctuation mark to be added.
6. The punctuation mark adding system of claim 5, wherein the mark adding subunit is adapted to add an interval mark after each word segmentation unit of the input punctuation mark-free text to obtain the punctuation mark-to-be-added text;
the punctuation mark labeling model is suitable for labeling a starting position label corresponding to a previous punctuation mark at a segmentation unit before a position to be added with the previous punctuation mark when no segmentation unit exists between two continuous punctuation marks to be added with the position, and labeling a starting position label and an ending position label corresponding to a next punctuation mark at an interval symbol before the position to be added with the next punctuation mark and a segmentation unit after the position to be added with the next punctuation mark respectively;
the punctuation mark adding unit further comprises a fourth adding subunit, and is suitable for adding corresponding punctuation marks after a preceding segmentation unit and a subsequent segmentation unit when a starting position label corresponding to a marking position of the preceding segmentation unit, an interval symbol before the subsequent segmentation unit and an ending position label corresponding to the same punctuation mark are respectively marked in the text to be added with punctuation marks.
7. The punctuation addition system of claim 6 wherein the set of punctuation position tags further comprises: no blank position label corresponding to the punctuation mark;
the punctuation mark marking model is suitable for marking the blank position label at the word segmentation unit when the punctuation mark position to be added is not available before and after the word segmentation unit.
8. The punctuation addition system of claim 7 wherein the punctuation addition unit further comprises a symbol deletion subunit adapted to at least one of fill in the empty space in the text of the punctuation to be added or space symbols in the text of the punctuation to be added.
9. The punctuation addition system of claim 1 further comprising at least one of:
the first language submodel generating unit is suitable for acquiring a random first training text from a training corpus and inputting the random first training text into the neural network model, the neural network model predicts the next word segmentation unit of the first training text under the given condition, and when the prediction accuracy probability reaches a preset first threshold value, the pre-trained language submodel is acquired;
and the second language submodel generating unit is suitable for acquiring a second training text which randomly covers a part with a preset proportion from the training corpus and inputting the second training text into the neural network model, the neural network model predicts the covered part with the preset proportion under the condition of given context, and when the accurate prediction probability reaches a preset second threshold value, the pre-trained language submodel is acquired.
10. The punctuation addition system of claim 9 further comprising: and the third language sub-model generating unit is suitable for acquiring a third training text which randomly covers a sentence from the training corpus, inputting the third training text into the neural network model, predicting the covered sentence by the neural network model under the given condition, and acquiring the pre-trained language sub-model when the accurate prediction probability reaches a preset third threshold value.
CN201910925835.0A 2019-09-27 2019-09-27 Punctuation adding system Pending CN112580303A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910925835.0A CN112580303A (en) 2019-09-27 2019-09-27 Punctuation adding system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910925835.0A CN112580303A (en) 2019-09-27 2019-09-27 Punctuation adding system

Publications (1)

Publication Number Publication Date
CN112580303A true CN112580303A (en) 2021-03-30

Family

ID=75110522

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910925835.0A Pending CN112580303A (en) 2019-09-27 2019-09-27 Punctuation adding system

Country Status (1)

Country Link
CN (1) CN112580303A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113822060A (en) * 2021-09-29 2021-12-21 深圳市木愚科技有限公司 Punctuation recovery method and device, computer equipment and storage medium
CN114880990A (en) * 2022-05-16 2022-08-09 马上消费金融股份有限公司 Punctuation mark prediction model training method, punctuation mark prediction method and punctuation mark prediction device

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113822060A (en) * 2021-09-29 2021-12-21 深圳市木愚科技有限公司 Punctuation recovery method and device, computer equipment and storage medium
CN114880990A (en) * 2022-05-16 2022-08-09 马上消费金融股份有限公司 Punctuation mark prediction model training method, punctuation mark prediction method and punctuation mark prediction device

Similar Documents

Publication Publication Date Title
CN110674629B (en) Punctuation mark labeling model, training method, training equipment and storage medium thereof
CN110688822A (en) Punctuation mark adding method, punctuation mark adding device and punctuation mark adding medium
CN110705264B (en) Punctuation mark correction method, punctuation mark correction equipment and medium
CN109697291B (en) Text semantic paragraph recognition method and device
CN112580326A (en) Punctuation mark model and training system thereof
CN111428485B (en) Judicial document paragraph classifying method, device, computer equipment and storage medium
CN110909549B (en) Method, device and storage medium for punctuating ancient Chinese
CN107980130A (en) It is automatic to answer method, apparatus, storage medium and electronic equipment
CN112541095B (en) Video title generation method and device, electronic equipment and storage medium
CN111046652A (en) Text error correction method, text error correction device, storage medium, and electronic apparatus
CN112131881A (en) Information extraction method and device, electronic equipment and storage medium
CN113158656B (en) Ironic content recognition method, ironic content recognition device, electronic device, and storage medium
CN114282527A (en) Multi-language text detection and correction method, system, electronic device and storage medium
CN112580303A (en) Punctuation adding system
CN111160026B (en) Model training method and device, and text processing method and device
CN112861864A (en) Topic entry method, topic entry device, electronic device and computer-readable storage medium
CN110610006B (en) Morphological double-channel Chinese word embedding method based on strokes and fonts
CN112151019B (en) Text processing method and device and computing equipment
CN110489727A (en) Name recognition methods and relevant apparatus
CN112667768A (en) Correction system for punctuation marks
CN112749544A (en) Training method and system for paragraph segmentation model
CN115906835B (en) Chinese question text representation learning method based on clustering and contrast learning
CN115455143A (en) Document processing method and device
CN115130475A (en) Extensible universal end-to-end named entity identification method
CN114462427A (en) Machine translation method and device based on term protection

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination