CN115129877A - Method and device for generating punctuation mark prediction model and electronic equipment - Google Patents

Method and device for generating punctuation mark prediction model and electronic equipment Download PDF

Info

Publication number
CN115129877A
CN115129877A CN202210823101.3A CN202210823101A CN115129877A CN 115129877 A CN115129877 A CN 115129877A CN 202210823101 A CN202210823101 A CN 202210823101A CN 115129877 A CN115129877 A CN 115129877A
Authority
CN
China
Prior art keywords
sequence
text
sample
language model
punctuation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210823101.3A
Other languages
Chinese (zh)
Inventor
蒙嘉颖
安哲成
吴培昊
马泽君
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Youzhuju Network Technology Co Ltd
Original Assignee
Beijing Youzhuju Network Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Youzhuju Network Technology Co Ltd filed Critical Beijing Youzhuju Network Technology Co Ltd
Priority to CN202210823101.3A priority Critical patent/CN115129877A/en
Publication of CN115129877A publication Critical patent/CN115129877A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Software Systems (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Medical Informatics (AREA)
  • Mathematical Physics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Machine Translation (AREA)

Abstract

The embodiment of the disclosure discloses a method and a device for generating a punctuation mark prediction model and electronic equipment. One embodiment of the method comprises: acquiring a text set, wherein texts in the text set are provided with punctuations; generating a text sequence without punctuation marks and a punctuation mark sequence by utilizing the text aiming at each text in the text set to form a first sample, performing mask processing on the text in the text sequence, and forming a second sample by using the text sequence after the mask processing and the character sequence subjected to the mask processing, wherein each object in the text sequence is a character in the text; training a pre-trained language model based on the first sample and the corresponding second sample to obtain a retrained language model; and training the retrained language model based on the first sample to obtain a punctuation prediction model. This embodiment improves the accuracy of punctuation prediction.

Description

Method and device for generating punctuation mark prediction model and electronic equipment
Technical Field
The present disclosure relates to the field of computer technologies, and in particular, to a method and an apparatus for generating a punctuation mark prediction model, and an electronic device.
Background
In a speech recognition scene, an input is a section of audio, a speech recognition system transcribes the input audio word by word into a natural language text, and the text is converted into a section of text with strong readability through a series of text processing steps and a punctuation mark prediction model after adding punctuation marks to the text.
Taking Chinese as an example, common punctuations such as commas, periods, question marks, exclamation marks, quotation marks and dash marks include multiple functions such as semantic pause, period segmentation, emotion enhancement and the like, are helpful for improving the readability of texts, and sometimes play an important role in eliminating text ambiguity. The punctuation prediction model aims to understand punctuation-free texts, punctuation prediction is carried out on the basis of faithful original semantics, the quality and readability of the texts are improved, and text ambiguity is reduced.
Disclosure of Invention
This disclosure is provided to introduce concepts in a simplified form that are further described below in the detailed description. This disclosure is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.
In a first aspect, an embodiment of the present disclosure provides a method for generating a punctuation mark prediction model, where the method includes: acquiring a text set, wherein texts in the text set are provided with punctuations; generating a text sequence without punctuations and a punctuation sequence by utilizing the text aiming at each text in the text set to form a first sample, performing mask processing on the text in the text sequence, and forming a second sample by using the text sequence after the mask processing and the character sequence subjected to the mask processing, wherein each object in the text sequence is a character in the text, and the text in the text sequence corresponds to the punctuations in the punctuation sequence one by one; training a pre-trained language model based on the first sample and the corresponding second sample to obtain a retrained language model; and training the retrained language model based on the first sample to obtain a punctuation prediction model.
In a second aspect, an embodiment of the present disclosure provides an apparatus for generating a punctuation prediction model, the apparatus including: the acquisition unit is used for acquiring a text set, wherein texts in the text set are provided with punctuations; the generating unit is used for generating a text sequence without punctuations and a punctuation sequence by utilizing the text aiming at each text in the text set to form a first sample, performing mask processing on the text in the text sequence, and forming a second sample by using the text sequence after the mask processing and the character sequence subjected to the mask processing, wherein each object in the text sequence is a character in the text, and the text in the text sequence corresponds to the punctuation in the punctuation sequence one by one; the first training unit is used for training a pre-trained language model based on the first sample and the corresponding second sample to obtain a retrained language model; and the second training unit is used for training the retrained language model based on the first sample to obtain the punctuation mark prediction model.
In a third aspect, an embodiment of the present disclosure provides an electronic device, including: at least one processor; a storage device for storing at least one program which, when executed by at least one processor, causes the at least one processor to implement a method of generating a punctuation prediction model as in the first aspect.
In a fourth aspect, the disclosed embodiments provide a computer readable medium, on which a computer program is stored, which program, when executed by a processor, implements the steps of the method for generating a punctuation prediction model as in the first aspect.
The method, the device and the electronic equipment for generating the punctuation mark prediction model provided by the embodiment of the disclosure are characterized in that a text set with punctuation marks is obtained; then, for each text in the text set, generating a text sequence without punctuation marks and a punctuation mark sequence by using the text to form a first sample, performing mask processing on the text in the text sequence, and forming a second sample by using the text sequence after the mask processing and the masked character sequence; then, training the pre-trained language model based on the first sample and the corresponding second sample to obtain a retrained language model; finally, the retrained language model can be trained based on the first sample to obtain a punctuation prediction model. Compared with the existing punctuation mark prediction model, the method has the advantages that the text with punctuation marks is firstly used for training the language model aiming at the semantic understanding task, and then the text without punctuation marks is used for training the punctuation mark prediction task, so that the same training text is used for training the language model aiming at the semantic understanding task and the punctuation mark prediction task in the process of training the language model, the method is favorable for reducing the dependence on punctuation marks by the language model and learning the semantic information in the text without punctuation marks; meanwhile, a multi-task training mode (namely a mode of combining a semantic understanding task and a punctuation mark prediction task) is adopted, so that training samples are fully utilized, and the punctuation mark prediction accuracy is improved.
Drawings
The above and other features, advantages, and aspects of embodiments of the present disclosure will become more apparent by referring to the following detailed description when taken in conjunction with the accompanying drawings. Throughout the drawings, the same or similar reference numbers refer to the same or similar elements. It should be understood that the drawings are schematic and that elements and features are not necessarily drawn to scale.
FIG. 1 is a flow diagram of one embodiment of a method of generating a punctuation prediction model according to the present disclosure;
FIG. 2 is a schematic diagram of an application scenario of a method of generating a punctuation prediction model according to the present disclosure;
FIG. 3 is a flow diagram of one embodiment of a language model trained and retrained in a method of generating punctuation prediction models according to the present disclosure;
FIG. 4 is a flow diagram of yet another embodiment of a method of generating a punctuation prediction model according to the present disclosure;
FIG. 5 is a flow diagram of yet another embodiment of a language model trained and retrained in a method of generating punctuation prediction models according to the present disclosure;
FIG. 6 is a schematic block diagram illustrating one embodiment of an apparatus for generating a punctuation prediction model according to the present disclosure;
FIG. 7 is an exemplary system architecture to which the method of generating punctuation prediction models of one embodiment of the present disclosure may be applied;
fig. 8 is a schematic diagram of a basic structure of an electronic device provided according to an embodiment of the present disclosure.
Detailed Description
Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure are shown in the drawings, it is to be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but rather are provided for a more thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the disclosure are for illustration purposes only and are not intended to limit the scope of the disclosure.
It should be understood that the various steps recited in the method embodiments of the present disclosure may be performed in a different order, and/or performed in parallel. Moreover, method embodiments may include additional steps and/or omit performing the illustrated steps. The scope of the present disclosure is not limited in this respect.
The term "include" and variations thereof as used herein are open-ended, i.e., "including but not limited to". The term "based on" is "based, at least in part, on". The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one additional embodiment"; the term "some embodiments" means "at least some embodiments". Relevant definitions for other terms will be given in the following description.
It should be noted that the terms "first", "second", and the like in the present disclosure are only used for distinguishing different devices, modules or units, and are not used for limiting the order or interdependence relationship of the functions performed by the devices, modules or units.
It is noted that references to "a", "an", and "the" modifications in this disclosure are intended to be illustrative rather than limiting, and that those skilled in the art will recognize that "one or more" may be used unless the context clearly dictates otherwise.
The names of messages or information exchanged between devices in the embodiments of the present disclosure are for illustrative purposes only, and are not intended to limit the scope of the messages or information.
Referring to FIG. 1, a flow 100 of one embodiment of a method for generating a punctuation prediction model according to the present disclosure is shown. The generation method of the punctuation mark prediction model comprises the following steps:
step 101, a text set is obtained.
In this embodiment, the executive body of the punctuation prediction model generation method may obtain a text set for training. The text in the text collection is usually punctuated.
The sentence is paused before and after and has a certain sentence tone, which indicates a relatively complete meaning. Pauses before and after or in between sentences are represented by time intervals in spoken language and punctuation in written language. Punctuation marks are organic components of written languages, are indispensable auxiliary tools of the written languages, and can help people to express thought emotion and understand the written languages definitely. Herein, punctuation may include, but is not limited to, at least one of the following: a period (), a comma (), a question mark (), an exclamation mark (|), and a pause mark ().
Step 102, aiming at the text in the text set, generating a text sequence without punctuation marks and a punctuation mark sequence by using the text to form a first sample, performing mask processing on the text in the text sequence, and forming a second sample by using the text sequence after the mask processing and the character sequence subjected to the mask processing.
In this embodiment, for each text in the text set, the execution subject may generate a text sequence without punctuation marks and a punctuation mark sequence by using the text, and form the first sample. Each object in the text sequence described above is typically a character in the text. The text in the text sequence and the punctuation in the punctuation sequence are usually in a one-to-one correspondence.
Specifically, the text sequence may be represented as: x ═ x 1 ,x 2 ,...,x n ]The punctuation mark sequence can be expressed as: y ═ y 1 ,y 2 ,...,y n ]Where n denotes the length of the input text, i.e. the number of characters in the text sequence, x i ,i∈[1,n]Denotes the ith character, y, in the text i ,i∈[1,n]Indicating punctuation marks following the ith character in the text.
It should be noted that, if there is no punctuation mark after the ith character, the punctuation mark may be identified by a preset character, for example, "empty" or "null".
As an example, if the text is "virtual heart makes progress, pride makes people fall behind. ", the extracted text sequence is: [ "virtual", "heart", "make", "person", "step", "pride", "make", "person", "fall", "back" ], the punctuation mark sequence of extraction is: [ "null", "null", "null", "," "null", "null", "null", and "". "].
It should be noted that, if the text is: the floret cat says that: "cocks, thank you! ", the extracted text sequence is: [ "little", "flower", "cat", "say", "small", "cock", "chicken", "thank", "you" ], the sequence of the extracted punctuation marks is: [ "null", "null", "null", ": "," null "," null ",", "," null "," null ","! ""]. That is, since the text in the text sequence and the punctuation marks in the punctuation mark sequence are usually in one-to-one correspondence, if two punctuation marks exist after one character, the two punctuation marks can be taken as a whole.
Then, the execution body may perform masking processing on the text in the text sequence, and combine the masked text sequence and the masked character sequence into a second sample. Here, the execution main body may select a preset proportion (for example, 10%) of characters from the text sequence as characters to be masked, and then may replace the characters to be masked with a mask identifier (mask), an original character, or another random character, so as to obtain a masked text sequence.
Specifically, the text sequence after the masking process may be represented as: x '═ x' 1 ,x′ 2 ,...,x′ n ]The above-mentioned masked character sequence can be expressed as:
Figure BDA0003742439680000061
where n denotes the length of the input text, i.e. the number of characters in the text sequence, x' i ,i∈[1,n]Indicating the ith character in the masked text,
Figure BDA0003742439680000062
and the masked character corresponding to the ith character in the text after the mask processing is represented.
It should be noted that, if the ith character in the text after the masking processing is an unmasked character, the corresponding masked character may be identified by a preset character, for example, "null" or "null".
As an example, if the text is: "the deficient heart makes progress, and pride makes people fall behind. "the text sequence obtained by masking the text may be: [ "virtual", "mask", "make", "person", "step", "ao", "ha", "person", "fall", "back" ], then the masked character sequence is: [ "null", "heart", "null", "null", "null", "null", "null", "null", "leading", "null", "null", "null" ].
And 103, training the pre-trained language model based on the first sample and the corresponding second sample to obtain a retrained language model.
In this embodiment, the executing agent may train the pre-trained language model based on the first sample and the corresponding second sample generated in step 102, so as to obtain a retrained language model. Here, the pre-trained Language Model is generally a Model that already has a certain natural Language understanding capability, and the Language Model may be a Masked Language Model (MLM).
Specifically, the executing agent may train the pre-trained language model by using the text sequence without punctuation marks and the punctuation mark sequence in the first sample as the input and the expected output of the pre-trained language model, respectively. And simultaneously, respectively taking the text sequence and the character sequence subjected to the mask processing in the corresponding second sample as the input and the expected output of the pre-trained language model, and training the pre-trained language model to obtain a retrained language model.
And 104, training the retrained language model based on the first sample to obtain a punctuation prediction model.
In this embodiment, the executing entity may train the retrained language model based on the first sample to obtain the punctuation prediction model. Note that the first sample used at this time may be a sample that is not used in step 103.
Specifically, the executing entity may train the retrained language model by using the text sequence without punctuation marks and the punctuation mark sequence in the first sample as the input and the expected output of the retrained language model, respectively.
The method provided by the above embodiment of the present disclosure obtains a text set with punctuation marks; then, aiming at each text in the text set, generating a text sequence without punctuation marks and a punctuation mark sequence by using the text to form a first sample, performing mask processing on the text in the text sequence, and forming a second sample by using the text sequence after the mask processing and the masked character sequence; then, training the pre-trained language model based on the first sample and the corresponding second sample to obtain a retrained language model; finally, the retrained language model can be trained based on the first sample to obtain a punctuation prediction model. Compared with the existing punctuation mark prediction model, the method has the advantages that the text with punctuation marks is firstly used for training the language model aiming at the semantic understanding task, and then the text without punctuation marks is used for training the punctuation mark prediction task, so that the same training text is used for training the language model aiming at the semantic understanding task and the punctuation mark prediction task in the process of training the language model, the method is favorable for reducing the dependence on punctuation marks by the language model and learning the semantic information in the text without punctuation marks; meanwhile, a multi-task training mode (namely a mode of combining a semantic understanding task and a punctuation mark prediction task) is adopted, so that training samples are fully utilized, the punctuation mark prediction accuracy is improved, and the readability of the speech recognition transcribed text is improved.
In some optional implementations, the executing entity may train the retrained language model based on the first sample to obtain a punctuation prediction model by: the executive may input the text sequence without punctuation in the first sample into the retrained language model to obtain a punctuation sequence. Then, the difference between the punctuation mark sequence in the input first sample and the obtained punctuation mark sequence can be determined by adopting a preset first loss function. Finally, based on the difference, the model parameters of the retrained language model are adjusted until the retrained language model converges to obtain the punctuation prediction model. Here, various implementations may be employed to adjust model parameters of the pre-trained language model based on the above-described overall differences. For example, a BP (Back Propagation) algorithm and an SGD (Stochastic Gradient Descent) algorithm may be used to adjust model parameters of the pre-trained language model.
With continued reference to fig. 2, fig. 2 is a schematic diagram of an application scenario of the method for generating a punctuation prediction model according to the present embodiment. In the application scenario of fig. 2, the executing body of the generation of the punctuation prediction model obtains the text 201: "tadpoles find their mother, and can give a description of five animals, frog, duck, fish, tortoise and goose. ", the execution agent may generate a text sequence 202 of the text without punctuation: "little", "", "pole", "find", "mom", "meet", "talk", "arrive", "five", "breed", "move", "things", "green", "frog", "duck", "fish", "black", "tortoise", "goose", and generate the corresponding punctuation mark sequence 203: "null", "null", "null", "null", "null", "null", "null", ",", ". ". The execution body may compose the text sequence 202 and the corresponding punctuation sequence 203 into a first sample. Then, the text sequence 202 may be masked to obtain a masked text sequence 204: "small", "mask", "pole", "find", "ma", "mask", "meet", "talk", "arrive", "five", "haar", "move", "thing", "frog", "duck", "fish", "black-bone", "tortoise", "goose", and obtain the character sequence 205 that is masked: "null", "", "null", "null", "null", "null", "mom", "null", "null", "null", "null", "null", "seed", "null", "null", "null", "null", "null", "null", etc. The execution body may combine the masked text sequence 204 and the masked character sequence 205 into a second sample. The executive may then train the pre-trained language model using the text sequence 202 and the corresponding punctuation sequence 203 as inputs and expected outputs, respectively, of the pre-trained language model. Meanwhile, the text sequence 204 after the mask processing and the character sequence 205 after the mask processing are respectively used as the input and the expected output of the pre-trained language model, and the pre-trained language model is trained to obtain a retrained language model. Finally, the text sequence 202 and the corresponding punctuation mark sequence 203 can be used as the input and the expected output of the retrained language model, respectively, and the retrained language model is trained to obtain a punctuation mark prediction model.
With further reference to FIG. 3, a flow 300 of one embodiment of a language model trained to retrain in a method of generating a punctuation prediction model is illustrated. The process 300 of training a retrained language model includes the following steps:
step 301, inputting the text sequence without the punctuation marks in the first sample into a pre-trained language model to obtain a punctuation mark sequence, and determining a difference between the punctuation mark sequence in the input first sample and the obtained punctuation mark sequence as a first difference by adopting a preset first loss function.
In this embodiment, the executing agent of the method for generating a punctuation prediction model may input the text sequence without punctuation in the first sample into a pre-trained language model to obtain a punctuation sequence. Then, a difference between the punctuation mark sequence in the input first sample and the obtained punctuation mark sequence can be determined as a first difference by adopting a preset first loss function. Here, the first loss function may be a cross entropy loss function, and the first loss function may be expressed by the following formula (1):
Figure BDA0003742439680000091
wherein N is the total number of texts contained in the text set; m p The number of possible punctuation marks preset in the task for punctuation prediction, e.g. if the possible punctuation marks preset have commas, periods, question marks, exclamation marks, dash and null (indicating no punctuation), then M p Is 6; c is a predetermined punctuation mark that may appearOne of the labels;
Figure BDA0003742439680000092
the true value (boolean value) for the ith sample under the c-th class of punctuation marks;
Figure BDA0003742439680000093
the predicted probability value (continuous value) of the ith sample predicted by the model under the c-th class of the punctuation mark.
Step 302, inputting the text sequence after the masking processing in the corresponding second sample into a pre-trained language model to obtain a predicted masked character sequence, and determining a difference between the masked character sequence in the input second sample and the predicted masked character sequence as a second difference by using a preset second loss function.
In this embodiment, the executing entity may input the text sequence after the masking processing in the corresponding second sample into a pre-trained language model, so as to obtain the predicted masked character sequence. Then, a difference between the masked character sequence in the input second sample and the predicted masked character sequence may be determined as a second difference using a preset second loss function. Here, the second loss function may be a cross entropy loss function, and the second loss function may be expressed by the following formula (2):
Figure BDA0003742439680000101
wherein N is the total number of texts contained in the text set; m m For the number of possible character labels preset in the semantic understanding task, for example, if there are 1000 characters in the preset character table, M m Is 1000; c refers to one of preset character labels which may appear;
Figure BDA0003742439680000102
for the ith sample in the c class of character labelThe actual value (boolean value) of lower;
Figure BDA0003742439680000103
the predicted probability value (continuous value) of the ith sample predicted by the model under the c-th class of the character label.
Step 303, determining a total difference using the first difference and the second difference.
In this embodiment, the execution subject may determine the total difference by using the first difference determined in step 301 and the second difference determined in step 302. Specifically, the execution subject may determine a weighted sum of the first difference and the second difference as a total difference.
And 304, adjusting model parameters of the pre-trained language model based on the total difference until the pre-trained language model converges to obtain a retrained language model.
In this embodiment, the executing entity may adjust the model parameters of the pre-trained language model based on the total difference until the pre-trained language model converges to obtain the retrained language model.
Here, various implementations may be employed to adjust model parameters of the pre-trained language model based on the above-described overall differences. For example, the BP algorithm and the SGD algorithm may be employed to adjust model parameters of a pre-trained language model.
In this embodiment, in the process of model training, for punctuation mark prediction tasks, the accuracy and recall rate of each label can be monitored. For language model prediction tasks (i.e., semantic understanding tasks), the ppl value (perplexity) of the language model may be monitored. And when the two tasks both reach convergence, finishing the training of the model to obtain a retrained language model.
In the method provided by the embodiment of the disclosure, in the process of model training, the difference of the model under the punctuation mark prediction task and the semantic understanding task is respectively determined by using the preset first loss function and the preset second loss function, and then the model parameters of the model are adjusted by using the difference until the model converges, so that the accuracy of the retrained language model can be improved.
With further reference to FIG. 4, a flow 400 of yet another embodiment of a method for generating a punctuation prediction model is illustrated. The process 400 of the method for generating the punctuation prediction model comprises the following steps:
step 401, a text set is obtained.
Step 402, aiming at each text in the text set, generating a text sequence without punctuation marks and a punctuation mark sequence by using the text to form a first sample, performing mask processing on the text in the text sequence, and forming a second sample by using the text sequence after the mask processing and the character sequence subjected to the mask processing.
In the present embodiment, the steps 401-402 can be performed in a similar manner to the steps 101-102, and are not described herein again.
Step 403, converting the capital letters in the text sequence into lowercase letters to obtain a full-lowercase text sequence, extracting corresponding upper and lower case label sequences, and combining the full-lowercase text sequence and the corresponding upper and lower case label sequences to form a third sample.
In this embodiment, the text may be english text, which usually includes upper case letters and lower case letters. The execution main body of the generation method of the punctuation mark prediction model can convert capital letters in the text sequence into lowercase letters to obtain a full-lowercase text sequence. It should be noted that capital letters may be abbreviated in english, for example, CPU, or capital letters, for example, Central Processing Unit.
Then, the corresponding upper and lower case label sequence can be extracted, and the full lower case text sequence and the corresponding upper and lower case label sequence form a third sample. Specifically, the above text sequence of all lower case can be expressed as: x ″ ═ x ″, and 1 ,x″ 2 ,...,x″ n ]the corresponding case tag sequence may be expressed as:
Figure BDA0003742439680000111
wherein n isIndicating the length of the input text, i.e. the number of characters in the text sequence, x ″ i ,i∈[1,n]Representing the ith character in full lower case text,
Figure BDA0003742439680000112
and the case label corresponding to the ith character in the full-lower-case text is represented.
Here, it is possible to represent all lower case letters in the character, for example, "0", with a preset first label; it may be indicated that a capital letter exists in the character by a preset second label, and the capital letter is an english abbreviation, that is, the character is in all capital letters, for example, "1"; a case where the third label indicates that a capital letter exists in the character and the first letter in the character is a capital letter, for example, "2", may also be preset.
As an example, if the text is: "Sometimes your plants don't work out using God has better words lines", the text sequence in all lower case is: [ "somemeters", "your", "planes", "don't", "work", "out", "because", "god", "has", "better", "ones" ], the corresponding case label sequence is: ["2","0","0","0","0","0","0","2","0","0","0"].
Step 404, training the pre-trained language model based on the first sample, the corresponding second sample and the corresponding third sample to obtain a retrained language model.
In this embodiment, the executing agent may train a pre-trained language model based on the first sample and the corresponding second sample generated in step 402 and the corresponding third sample generated in step 403, so as to obtain a retrained language model. Here, the pre-trained language model is usually a model that has a certain natural language understanding capability, and the language model may be a mask language model.
Specifically, the executing agent may train the pre-trained language model by using the text sequence without punctuation marks and the punctuation mark sequence in the first sample as the input and the expected output of the pre-trained language model, respectively. And simultaneously, respectively taking the text sequence and the character sequence which are subjected to mask processing in the corresponding second sample as the input and the expected output of the pre-trained language model, training the pre-trained language model, respectively taking the full-lowercase text sequence and the corresponding upper and lower case label sequence in the corresponding third sample as the input and the expected output of the pre-trained language model, and training the pre-trained language model to obtain the retrained language model.
Step 405, training the retrained language model based on the first sample to obtain the punctuation prediction model.
In this embodiment, the executing entity may train the retrained language model based on the first sample to obtain the punctuation prediction model. It should be noted that the first sample used at this time may be a training sample that is not used in step 103.
Specifically, the executing entity may train the retrained language model by using the text sequence without punctuation marks and the punctuation mark sequence in the first sample as the input and the expected output of the retrained language model, respectively.
In some cases, while training with the first sample, the performing agent may train the retrained language model with the full-lowercase text sequence and the corresponding case label sequence in the third sample as inputs and expected outputs, respectively, of the retrained language model.
As can be seen from fig. 4, compared with the embodiment corresponding to fig. 1, the flow 400 of the punctuation mark prediction model generation method in this embodiment embodies a step of forming a third sample by the full-lowercase text sequence and the corresponding upper and lower case label sequence, and training the pre-trained language model by using the third sample. Therefore, the scheme described in this embodiment can increase the function of recovering capital letters while training the semantic understanding function and the punctuation prediction function of the pre-trained language model, so that the function of recovering capital letters can be realized while predicting punctuation of a text by using the punctuation prediction model under the condition of inputting a full-lowercase text.
With further reference to FIG. 5, a flow 500 of yet another embodiment of a language model trained to retrain in the method for generating a punctuation prediction model is illustrated. The process 500 of training a retrained language model includes the following steps:
step 501, inputting the text sequence without the punctuation marks in the first sample into a pre-trained language model to obtain a punctuation mark sequence, and determining a difference between the punctuation mark sequence in the input first sample and the obtained punctuation mark sequence as a first difference by using a preset first loss function.
Step 502, inputting the text sequence after the mask processing in the corresponding second sample into a pre-trained language model to obtain a predicted masked character sequence, and determining a difference between the masked character sequence in the input second sample and the predicted masked character sequence as a second difference by using a preset second loss function.
In the present embodiment, the steps 501-502 can be performed in a manner similar to the steps 301-302, and are not described herein again.
Step 503, inputting the full-lowercase text sequence in the corresponding third sample into a pre-trained language model to obtain a predicted case label sequence, and determining a difference between the case label sequence in the input third sample and the predicted case label sequence as a third difference by using a preset third loss function.
In this embodiment, the executing body of the method for generating a punctuation mark prediction model may input the full-lowercase text sequence in the corresponding third sample into a pre-trained language model to obtain the predicted capitalization and capitalization label sequence. Then, the difference between the upper and lower case label sequences in the input third sample and the predicted upper and lower case label sequence can be determined as a third difference by using a preset third loss function. Here, the third loss function may be a cross entropy loss function, and the third loss function may be expressed by the following formula (3):
Figure BDA0003742439680000141
wherein N is the total number of texts contained in the text set; m c The number of possible upper and lower case tags (or tag number) preset for the task of recovering upper case, for example, if the preset possible tags have all lower case, all upper case and first letter upper case, then M c Is 3; c refers to a type of preset possible upper and lower case labels;
Figure BDA0003742439680000142
the true value (Boolean value) of the ith sample under the c-th class of the case label;
Figure BDA0003742439680000143
the predicted probability value (continuous value) of the ith sample predicted by the model under the c-th class of the case label.
Step 504, a total difference is determined using the first difference, the second difference, and the third difference.
In this embodiment, the execution subject may determine the total difference by using the first difference determined in step 501, the second difference determined in step 502, and the third difference determined in step 503. Specifically, the execution subject may determine a weighted sum of the first difference, the second difference, and the third difference as a total difference. That is, the execution subject described above can determine the total difference using the following equation (4):
Loss=w punc Loss punc +w case Loss case +w mlm Loss mlm (4)
therein, Loss punc As a first Loss function, Loss mlm Is a second Loss function, Loss case As a third loss function, w punc Is the weight corresponding to the first loss function, w mlm Is the weight corresponding to the second loss function, w case The weight corresponding to the third loss function.
And 505, based on the total difference, adjusting model parameters of the pre-trained language model until the pre-trained language model converges to obtain a retrained language model.
In this embodiment, step 505 may be performed in a manner similar to step 304, and will not be described herein again.
In the method provided by the embodiment of the disclosure, in the model training process, the differences of the model under the punctuation mark prediction task, the semantic understanding task and the capital letter recovery task are respectively determined by using the preset first loss function, the preset second loss function and the preset third loss function, and then the model parameters of the model are adjusted by using the differences until the model converges, so that the accuracy of the retrained language model can be improved.
With further reference to fig. 6, as an implementation of the method shown in the above figures, the present application provides an embodiment of an apparatus for generating a punctuation mark prediction model, where the embodiment of the apparatus corresponds to the embodiment of the method shown in fig. 1, and the apparatus may be specifically applied to various electronic devices.
As shown in fig. 6, the punctuation mark prediction model generation apparatus 600 of the present embodiment includes: an acquisition unit 601, a generation unit 602, a first training unit 603, and a second training unit 604. The obtaining unit 601 is configured to obtain a text set, where texts in the text set have punctuations; the generating unit 602 is configured to generate, for each text in the text set, a text sequence without punctuations and a punctuation sequence by using the text, to form a first sample, perform mask processing on the text in the text sequence, and form a second sample by using the text sequence after the mask processing and the character sequence subjected to the mask processing, where each object in the text sequence is a character in the text, and the text in the text sequence corresponds to the punctuation in the punctuation sequence one to one; the first training unit 603 is configured to train a pre-trained language model based on the first sample and the corresponding second sample to obtain a retrained language model; the second training unit 604 is configured to train the retrained language model based on the first sample to obtain a punctuation prediction model.
In this embodiment, the specific processing of the obtaining unit 601, the generating unit 602, the first training unit 603, and the second training unit 604 of the punctuation mark prediction model generating device 600 may refer to step 101, step 102, step 103, and step 104 in the corresponding embodiment of fig. 1.
In some alternative implementations, the text is english text; the apparatus 600 for generating punctuation prediction model may further include: a conversion unit (not shown in the figure). The conversion unit is configured to convert upper case letters in the text sequence into lower case letters to obtain a full-lower case text sequence, extract a corresponding upper case label sequence, and combine the full-lower case text sequence and the corresponding upper case label sequence into a third sample.
In some optional implementation manners, the first training unit 603 may be further configured to train the pre-trained language model based on the first sample and the corresponding second sample, so as to obtain a retrained language model: the first training unit 603 may train a pre-trained language model based on the first sample, the corresponding second sample, and the corresponding third sample, so as to obtain a retrained language model.
In some optional implementation manners, the first training unit 603 may be further configured to train the pre-trained language model based on the first sample, the corresponding second sample, and the corresponding third sample, so as to obtain a retrained language model: the first training unit 603 may input the text sequence without punctuation marks in the first sample into a pre-trained language model to obtain a punctuation mark sequence, and determine, by using a preset first loss function, a difference between the punctuation mark sequence in the input first sample and the obtained punctuation mark sequence as a first difference; then, the text sequence after the masking processing in the corresponding second sample may be input into a pre-trained language model to obtain a predicted masked character sequence, and a difference between the masked character sequence in the input second sample and the predicted masked character sequence is determined as a second difference by using a preset second loss function; then, the full-lowercase text sequence in the corresponding third sample can be input into a pre-trained language model to obtain a predicted capital and small case label sequence, and the difference between the input capital and small case label sequence in the third sample and the predicted capital and small case label sequence is determined as a third difference by adopting a preset third loss function; thereafter, a total difference may be determined using the first difference, the second difference, and the third difference; finally, based on the total difference, the model parameters of the pre-trained language model are adjusted until the pre-trained language model converges to obtain the retrained language model.
In some optional implementations, the first training unit 603 may be further configured to train the pre-trained language model based on the first sample and the corresponding second sample to obtain a retrained language model by: the first training unit 603 may input the text sequence without punctuation marks in the first sample into a pre-trained language model to obtain a punctuation mark sequence, and determine, by using a preset first loss function, a difference between the punctuation mark sequence in the input first sample and the obtained punctuation mark sequence as a first difference; then, inputting the text sequence after the masking processing in the corresponding second sample into a pre-trained language model to obtain a predicted masked character sequence, and determining a difference between the masked character sequence in the input second sample and the predicted masked character sequence as a second difference by adopting a preset second loss function; then, the total difference can be determined by using the first difference and the second difference; finally, based on the total difference, the model parameters of the pre-trained language model are adjusted until the pre-trained language model converges to obtain the retrained language model.
In some alternative implementations, the second training unit 604 may be further configured to train the retrained language model based on the first sample to obtain a punctuation prediction model by: the second training unit 604 may input the text sequence without the punctuation marks in the first sample into the retrained language model to obtain a punctuation mark sequence, and determine a difference between the punctuation mark sequence in the input first sample and the obtained punctuation mark sequence by using a preset first loss function; then, based on the difference, the model parameters of the retrained language model are adjusted until the retrained language model converges to obtain the punctuation prediction model.
Referring to fig. 7, fig. 7 illustrates an exemplary system architecture to which a method of generating a punctuation prediction model of an embodiment of the present disclosure may be applied.
As shown in fig. 7, the system architecture may include terminal devices 701, 702, 703, a network 704, and a server 705. The network 704 serves to provide a medium for communication links between the terminal devices 701, 702, 703 and the server 705. Network 704 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.
The terminal devices 701, 702, 703 may interact with a server 705 over a network 704 to receive or send messages or the like. For example, the terminal devices 701, 702, 703 may obtain a pre-trained language model from the server 705. Various client applications, such as a speech recognition application, a text processing application, and instant messaging software, may be installed on the terminal devices 701, 702, and 703.
The terminal devices 701, 702 and 703 can obtain a text set with punctuations; then, aiming at each text in the text set, generating a text sequence without punctuation marks and a punctuation mark sequence by using the text to form a first sample, performing mask processing on the text in the text sequence, and forming a second sample by using the text sequence after the mask processing and the masked character sequence; then, training the pre-trained language model based on the first sample and the corresponding second sample to obtain a retrained language model; finally, the retrained language model can be trained based on the first sample to obtain a punctuation prediction model.
The terminal devices 701, 702, and 703 may be hardware or software. When the terminal devices 701, 702, and 703 are hardware, they may be various electronic devices having a display screen and supporting web browsing, including but not limited to smart phones, tablet computers, e-book readers, MP3 players (Moving Picture Experts Group Audio Layer III, mpeg compression standard Audio Layer 3), MP4 players (Moving Picture Experts Group Audio Layer IV, mpeg compression standard Audio Layer 4), laptop portable computers, desktop computers, and the like. When the terminal devices 701, 702, and 703 are software, they can be installed in the electronic devices listed above. It may be implemented as multiple pieces of software or software modules (e.g., software or software modules used to provide distributed services) or as a single piece of software or software module. And is not particularly limited herein.
The server 705 may be a server providing various services, for example, a text set sent by the receiving terminal devices 701, 702, and 703, and then, for each text in the text set, a text sequence without punctuation marks and a punctuation mark sequence may be generated by using the text to form a first sample, the text in the text sequence is subjected to mask processing, and the masked text sequence and the masked character sequence form a second sample; then, training the pre-trained language model based on the first sample and the corresponding second sample to obtain a retrained language model; finally, the retrained language model can be trained based on the first sample to obtain a punctuation prediction model. The server 705 may send the generated punctuation prediction model to the terminal devices 701, 702, 703.
It should be noted that the generating method of the punctuation mark prediction model provided by the embodiment of the present disclosure may be executed by a terminal device, and accordingly, the generating device of the punctuation mark prediction model may be disposed in the terminal device 701, 702, and 703. In addition, the method for generating the punctuation mark prediction model provided by the embodiment of the present disclosure may also be executed by the server 705, and accordingly, the generating device of the punctuation mark prediction model may be disposed in the server 705.
It should be noted that, if the terminal devices 701, 702, and 703 locally store a language model trained in advance, the server 705 and the network 704 may not exist in the exemplary system architecture; if the server 705 stores the text collection locally, the terminal devices 701, 702, 703 and the network 704 may not be present in the above exemplary system architecture.
It should be understood that the number of terminal devices, networks, and servers in fig. 7 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.
Referring now to fig. 8, shown is a schematic diagram of an electronic device (e.g., a terminal device or a server of fig. 7) suitable for use in implementing embodiments of the present disclosure. The terminal device in the embodiments of the present disclosure may include, but is not limited to, a mobile terminal such as a mobile phone, a notebook computer, a digital broadcast receiver, a PDA (personal digital assistant), a PAD (tablet computer), a PMP (portable multimedia player), a vehicle terminal (e.g., a car navigation terminal), and the like, and a stationary terminal such as a digital TV, a desktop computer, and the like. The electronic device shown in fig. 8 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present disclosure.
As shown in fig. 8, an electronic device may include a processing means (e.g., a central processing unit, a graphics processor, etc.) 801 that may perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM)802 or a program loaded from a storage means 808 into a Random Access Memory (RAM) 803. In the RAM 803, various programs and data necessary for the operation of the electronic apparatus 800 are also stored. The processing apparatus 801, the ROM 802, and the RAM 803 are connected to each other by a bus 804. An input/output (I/O) interface 805 is also connected to bus 804.
Generally, the following devices may be connected to the I/O interface 805: input devices 806 including, for example, a touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; output devices 807 including, for example, a Liquid Crystal Display (LCD), speakers, vibrators, or the like; storage 808 including, for example, magnetic tape, hard disk, etc.; and a communication device 809. The communication means 809 may allow the electronic device to communicate with other devices wirelessly or by wire to exchange data. While fig. 8 illustrates an electronic device having various means, it is to be understood that not all illustrated means are required to be implemented or provided. More or fewer devices may be alternatively implemented or provided.
In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program carried on a non-transitory computer readable medium, the computer program containing program code for performing the method illustrated by the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network via the communication means 809, or installed from the storage means 808, or installed from the ROM 802. The computer program, when executed by the processing apparatus 801, performs the above-described functions defined in the methods of the embodiments of the present disclosure.
It should be noted that the computer readable medium in the present disclosure can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In contrast, in the present disclosure, a computer readable signal medium may comprise a propagated data signal with computer readable program code embodied therein, either in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, optical cables, RF (radio frequency), etc., or any suitable combination of the foregoing.
In some embodiments, the clients, servers may communicate using any currently known or future developed network Protocol, such as HTTP (HyperText Transfer Protocol), and may be interconnected with any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include a local area network ("LAN"), a wide area network ("WAN"), the Internet (e.g., the Internet), and peer-to-peer networks (e.g., ad hoc peer-to-peer networks), as well as any currently known or future developed network.
The computer readable medium may be embodied in the electronic device; or may exist separately without being assembled into the electronic device.
The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to: acquiring a text set, wherein texts in the text set are provided with punctuations; generating a text sequence without punctuations and a punctuation sequence by utilizing the text aiming at each text in the text set to form a first sample, performing mask processing on the text in the text sequence, and forming a second sample by using the text sequence after the mask processing and the character sequence subjected to the mask processing, wherein each object in the text sequence is a character in the text, and the text in the text sequence corresponds to the punctuations in the punctuation sequence one by one; training a pre-trained language model based on the first sample and the corresponding second sample to obtain a retrained language model; and training the retrained language model based on the first sample to obtain a punctuation mark prediction model.
Computer program code for carrying out operations for the present disclosure may be written in any combination of one or more programming languages, including but not limited to an object oriented programming language such as Java, Smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
According to one or more embodiments of the present disclosure, there is provided a method for generating a punctuation prediction model, the method including: acquiring a text set, wherein texts in the text set are provided with punctuations; generating a text sequence without punctuations and a punctuation sequence by using the text according to each text in the text set to form a first sample, performing mask processing on the text in the text sequence, and forming a second sample by using the text sequence after the mask processing and the character sequence subjected to the mask processing, wherein each object in the text sequence is a character in the text, and the text in the text sequence corresponds to the punctuation in the punctuation sequence one by one; training a pre-trained language model based on the first sample and the corresponding second sample to obtain a retrained language model; and training the retrained language model based on the first sample to obtain a punctuation mark prediction model.
According to one or more embodiments of the present disclosure, the text is english text; and after generating a text sequence without punctuation and a punctuation sequence using the text, composing a first sample, the method further comprising: converting capital letters in the text sequence into lowercase letters to obtain a full-lowercase text sequence, extracting corresponding upper and lower case label sequences, and forming a third sample by using the full-lowercase text sequence and the corresponding upper and lower case label sequences.
According to one or more embodiments of the present disclosure, training a pre-trained language model based on a first sample and a corresponding second sample to obtain a retrained language model includes: and training the pre-trained language model based on the first sample, the corresponding second sample and the corresponding third sample to obtain a retrained language model.
According to one or more embodiments of the present disclosure, training a pre-trained language model based on a first sample, a corresponding second sample, and a corresponding third sample to obtain a retrained language model, includes: inputting the text sequence without the punctuation marks in the first sample into a pre-trained language model to obtain a punctuation mark sequence, and determining the difference between the punctuation mark sequence in the input first sample and the obtained punctuation mark sequence as a first difference by adopting a preset first loss function; inputting the text sequence after the mask processing in the corresponding second sample into a pre-trained language model to obtain a predicted masked character sequence, and determining the difference between the masked character sequence in the input second sample and the predicted masked character sequence as a second difference by adopting a preset second loss function; inputting the full-lowercase text sequence in the corresponding third sample into a pre-trained language model to obtain a predicted capital and small case label sequence, and determining the difference between the capital and small case label sequence in the input third sample and the predicted capital and small case label sequence as a third difference by adopting a preset third loss function; determining a total difference using the first difference, the second difference, and the third difference; and adjusting model parameters of the pre-trained language model based on the total difference until the pre-trained language model converges to obtain a retrained language model.
According to one or more embodiments of the present disclosure, training a pre-trained language model based on a first sample and a corresponding second sample to obtain a retrained language model includes: inputting the text sequence without the punctuation marks in the first sample into a pre-trained language model to obtain a punctuation mark sequence, and determining the difference between the punctuation mark sequence in the input first sample and the obtained punctuation mark sequence as a first difference by adopting a preset first loss function; inputting the text sequence after the mask processing in the corresponding second sample into a pre-trained language model to obtain a predicted masked character sequence, and determining the difference between the masked character sequence in the input second sample and the predicted masked character sequence as a second difference by adopting a preset second loss function; determining a total difference using the first difference and the second difference; and adjusting model parameters of the pre-trained language model based on the total difference until the pre-trained language model converges to obtain a retrained language model.
According to one or more embodiments of the present disclosure, the training the retrained language model based on the first sample to obtain a punctuation prediction model includes: inputting the text sequence without punctuation marks in the first sample into the retrained language model to obtain a punctuation mark sequence, and determining the difference between the punctuation mark sequence in the input first sample and the obtained punctuation mark sequence by adopting a preset first loss function; and adjusting the model parameters of the retrained language model based on the difference until the retrained language model converges to obtain the punctuation prediction model.
According to one or more embodiments of the present disclosure, there is provided an apparatus for generating a punctuation prediction model, the apparatus including: the device comprises an acquisition unit, a processing unit and a display unit, wherein the acquisition unit is used for acquiring a text set, and texts in the text set are provided with punctuations; a generating unit, configured to generate, for each text in the text set, a text sequence without punctuation marks and a punctuation mark sequence by using the text, to form a first sample, perform mask processing on the text in the text sequence, and form a second sample by using the text sequence after the mask processing and the character sequence subjected to the mask processing, where each object in the text sequence is a character in the text, and the text in the text sequence corresponds to the punctuation marks in the punctuation mark sequence one to one; the first training unit is used for training the pre-trained language model based on the first sample and the corresponding second sample to obtain a retrained language model; and the second training unit is used for training the retrained language model based on the first sample to obtain a punctuation mark prediction model.
According to one or more embodiments of the present disclosure, the text is english text; and the apparatus further comprises: and the conversion unit is used for converting the capital letters in the text sequence into the lowercase letters to obtain a full-lowercase text sequence, extracting the corresponding upper and lower case label sequence, and forming a third sample by the full-lowercase text sequence and the corresponding upper and lower case label sequence.
According to one or more embodiments of the present disclosure, the first training unit is further configured to train the pre-trained language model based on the first sample and the corresponding second sample to obtain a retrained language model by: and training the pre-trained language model based on the first sample, the corresponding second sample and the corresponding third sample to obtain a retrained language model.
According to one or more embodiments of the present disclosure, the first training unit is further configured to train the pre-trained language model based on the first sample, the corresponding second sample, and the corresponding third sample, to obtain a retrained language model, by: inputting the text sequence without the punctuation marks in the first sample into a pre-trained language model to obtain a punctuation mark sequence, and determining the difference between the punctuation mark sequence in the input first sample and the obtained punctuation mark sequence as a first difference by adopting a preset first loss function; inputting the text sequence after the mask processing in the corresponding second sample into a pre-trained language model to obtain a predicted masked character sequence, and determining the difference between the masked character sequence in the input second sample and the predicted masked character sequence as a second difference by adopting a preset second loss function; inputting the full-lowercase text sequence in the corresponding third sample into a pre-trained language model to obtain a predicted capital and small case label sequence, and determining the difference between the capital and small case label sequence in the input third sample and the predicted capital and small case label sequence as a third difference by adopting a preset third loss function; determining a total difference using the first difference, the second difference, and the third difference; and adjusting model parameters of the pre-trained language model based on the total difference until the pre-trained language model converges to obtain a retrained language model.
According to one or more embodiments of the present disclosure, the first training unit is further configured to train the pre-trained language model based on the first sample and the corresponding second sample, to obtain a retrained language model, as follows: inputting the text sequence without the punctuation marks in the first sample into a pre-trained language model to obtain a punctuation mark sequence, and determining the difference between the punctuation mark sequence in the input first sample and the obtained punctuation mark sequence as a first difference by adopting a preset first loss function; inputting the text sequence after the mask processing in the corresponding second sample into a pre-trained language model to obtain a predicted masked character sequence, and determining the difference between the masked character sequence in the input second sample and the predicted masked character sequence as a second difference by adopting a preset second loss function; determining a total difference using the first difference and the second difference; and adjusting model parameters of the pre-trained language model based on the total difference until the pre-trained language model converges to obtain a retrained language model.
According to one or more embodiments of the present disclosure, the second training unit is further configured to train the retrained language model based on the first sample to obtain a punctuation mark prediction model by: inputting the text sequence without punctuation marks in the first sample into the retrained language model to obtain a punctuation mark sequence, and determining the difference between the punctuation mark sequence in the input first sample and the obtained punctuation mark sequence by adopting a preset first loss function; and adjusting the model parameters of the retrained language model based on the difference until the retrained language model converges to obtain the punctuation prediction model.
The units described in the embodiments of the present disclosure may be implemented by software or hardware. Where the name of a unit does not in some cases constitute a limitation on the unit itself, for example, an acquisition unit may also be described as a "unit to acquire a text set".
The functions described herein above may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), systems on a chip (SOCs), Complex Programmable Logic Devices (CPLDs), and the like.
In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
The foregoing description is only exemplary of the preferred embodiments of the disclosure and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the disclosure herein is not limited to the particular combination of features described above, but also encompasses other combinations of features described above or equivalents thereof without departing from the spirit of the disclosure. For example, the above features and (but not limited to) the features disclosed in this disclosure having similar functions are replaced with each other to form the technical solution.
Further, while operations are depicted in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order. Under certain circumstances, multitasking and parallel processing may be advantageous. Likewise, while several specific implementation details are included in the above discussion, these should not be construed as limitations on the scope of the disclosure. Certain features that are described in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination.
Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as example forms of implementing the claims.

Claims (10)

1. A method for generating a punctuation prediction model, comprising:
acquiring a text set, wherein texts in the text set are provided with punctuations;
generating a text sequence without punctuations and a punctuation sequence by using the text to form a first sample, performing mask processing on the text in the text sequence, and forming a second sample by using the masked text sequence and the masked character sequence, wherein each object in the text sequence is a character in the text, and the text in the text sequence corresponds to the punctuations in the punctuation sequence one by one;
training a pre-trained language model based on the first sample and the corresponding second sample to obtain a retrained language model;
and training the retrained language model based on the first sample to obtain a punctuation prediction model.
2. The method of claim 1, wherein the text is English text; and
after the generating, using the text, a text sequence without punctuation and a punctuation sequence, constituting a first sample, the method further comprises:
converting capital letters in the text sequence into lowercase letters to obtain a full-lowercase text sequence, extracting corresponding upper and lower case label sequences, and forming a third sample by the full-lowercase text sequence and the corresponding upper and lower case label sequences.
3. The method of claim 2, wherein training the pre-trained language model based on the first sample and the corresponding second sample to obtain a retrained language model comprises:
and training the pre-trained language model based on the first sample, the corresponding second sample and the corresponding third sample to obtain a retrained language model.
4. The method of claim 3, wherein training the pre-trained language model based on the first sample, the corresponding second sample, and the corresponding third sample to obtain a retrained language model comprises:
inputting the text sequence without the punctuation marks in the first sample into a pre-trained language model to obtain a punctuation mark sequence, and determining the difference between the punctuation mark sequence in the input first sample and the obtained punctuation mark sequence as a first difference by adopting a preset first loss function;
inputting the text sequence after the mask processing in the corresponding second sample into a pre-trained language model to obtain a predicted masked character sequence, and determining the difference between the masked character sequence in the input second sample and the predicted masked character sequence as a second difference by adopting a preset second loss function;
inputting the full-lowercase text sequence in the corresponding third sample into a pre-trained language model to obtain a predicted capital and small case label sequence, and determining the difference between the capital and small case label sequence in the input third sample and the predicted capital and small case label sequence as a third difference by adopting a preset third loss function;
determining a total difference using the first difference, the second difference, and the third difference;
and adjusting model parameters of the pre-trained language model based on the total difference until the pre-trained language model converges to obtain a retrained language model.
5. The method of claim 1, wherein training the pre-trained language model based on the first sample and the corresponding second sample to obtain a retrained language model comprises:
inputting the text sequence without the punctuation marks in the first sample into a pre-trained language model to obtain a punctuation mark sequence, and determining the difference between the punctuation mark sequence in the input first sample and the obtained punctuation mark sequence as a first difference by adopting a preset first loss function;
inputting the text sequence after the mask processing in the corresponding second sample into a pre-trained language model to obtain a predicted masked character sequence, and determining the difference between the masked character sequence in the input second sample and the predicted masked character sequence as a second difference by adopting a preset second loss function;
determining a total difference using the first difference and the second difference;
and adjusting model parameters of the pre-trained language model based on the total difference until the pre-trained language model converges to obtain a retrained language model.
6. The method according to any one of claims 1-5, wherein training the retrained language model based on the first sample to obtain a punctuation prediction model comprises:
inputting the text sequence without punctuation marks in the first sample into the retrained language model to obtain a punctuation mark sequence, and determining the difference between the punctuation mark sequence in the input first sample and the obtained punctuation mark sequence by adopting a preset first loss function;
and adjusting the model parameters of the retrained language model based on the difference until the retrained language model converges to obtain a punctuation mark prediction model.
7. An apparatus for generating a punctuation prediction model, comprising:
the device comprises an acquisition unit, a processing unit and a display unit, wherein the acquisition unit is used for acquiring a text set, and texts in the text set are provided with punctuations;
a generating unit, configured to generate, for each text in the text set, a text sequence without punctuation marks and a punctuation mark sequence by using the text, to form a first sample, perform mask processing on the text in the text sequence, and form a second sample by using the text sequence after the mask processing and the character sequence subjected to the mask processing, where each object in the text sequence is a character in the text, and the text in the text sequence corresponds to the punctuation marks in the punctuation mark sequence one to one;
the first training unit is used for training a pre-trained language model based on the first sample and the corresponding second sample to obtain a retrained language model;
and the second training unit is used for training the retrained language model based on the first sample to obtain a punctuation mark prediction model.
8. The apparatus of claim 7, wherein the text is English text; and
the device further comprises:
and the conversion unit is used for converting the capital letters in the text sequence into the lowercase letters to obtain a full-lowercase text sequence, extracting the corresponding upper and lower case label sequence, and forming a third sample by the full-lowercase text sequence and the corresponding upper and lower case label sequence.
9. An electronic device, comprising:
at least one processor;
a storage device having at least one program stored thereon,
when executed by the at least one processor, cause the at least one processor to implement the method of any one of claims 1-6.
10. A computer-readable medium, on which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1-6.
CN202210823101.3A 2022-07-12 2022-07-12 Method and device for generating punctuation mark prediction model and electronic equipment Pending CN115129877A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210823101.3A CN115129877A (en) 2022-07-12 2022-07-12 Method and device for generating punctuation mark prediction model and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210823101.3A CN115129877A (en) 2022-07-12 2022-07-12 Method and device for generating punctuation mark prediction model and electronic equipment

Publications (1)

Publication Number Publication Date
CN115129877A true CN115129877A (en) 2022-09-30

Family

ID=83383387

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210823101.3A Pending CN115129877A (en) 2022-07-12 2022-07-12 Method and device for generating punctuation mark prediction model and electronic equipment

Country Status (1)

Country Link
CN (1) CN115129877A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117113941A (en) * 2023-10-23 2023-11-24 新声科技(深圳)有限公司 Punctuation mark recovery method and device, electronic equipment and storage medium

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110674629A (en) * 2019-09-27 2020-01-10 上海智臻智能网络科技股份有限公司 Punctuation mark model and its training method, equipment and storage medium
CN111222321A (en) * 2019-12-24 2020-06-02 北京明略软件系统有限公司 Punctuation symbol processing method and device
CN112580326A (en) * 2019-09-27 2021-03-30 上海智臻智能网络科技股份有限公司 Punctuation mark model and training system thereof
CN112667768A (en) * 2019-09-27 2021-04-16 上海智臻智能网络科技股份有限公司 Correction system for punctuation marks
CN113239705A (en) * 2021-07-12 2021-08-10 北京百度网讯科技有限公司 Pre-training method and device of semantic representation model, electronic equipment and storage medium
CN113449489A (en) * 2021-07-22 2021-09-28 深圳追一科技有限公司 Punctuation mark marking method, punctuation mark marking device, computer equipment and storage medium
WO2021189851A1 (en) * 2020-09-03 2021-09-30 平安科技(深圳)有限公司 Text error correction method, system and device, and readable storage medium
CN114281997A (en) * 2021-12-28 2022-04-05 维沃移动通信有限公司 Model training method, text processing device and electronic equipment
CN114492390A (en) * 2021-12-17 2022-05-13 深圳市北科瑞讯信息技术有限公司 Data expansion method, device, equipment and medium based on keyword recognition

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110674629A (en) * 2019-09-27 2020-01-10 上海智臻智能网络科技股份有限公司 Punctuation mark model and its training method, equipment and storage medium
CN112580326A (en) * 2019-09-27 2021-03-30 上海智臻智能网络科技股份有限公司 Punctuation mark model and training system thereof
CN112667768A (en) * 2019-09-27 2021-04-16 上海智臻智能网络科技股份有限公司 Correction system for punctuation marks
CN111222321A (en) * 2019-12-24 2020-06-02 北京明略软件系统有限公司 Punctuation symbol processing method and device
WO2021189851A1 (en) * 2020-09-03 2021-09-30 平安科技(深圳)有限公司 Text error correction method, system and device, and readable storage medium
CN113239705A (en) * 2021-07-12 2021-08-10 北京百度网讯科技有限公司 Pre-training method and device of semantic representation model, electronic equipment and storage medium
CN113449489A (en) * 2021-07-22 2021-09-28 深圳追一科技有限公司 Punctuation mark marking method, punctuation mark marking device, computer equipment and storage medium
CN114492390A (en) * 2021-12-17 2022-05-13 深圳市北科瑞讯信息技术有限公司 Data expansion method, device, equipment and medium based on keyword recognition
CN114281997A (en) * 2021-12-28 2022-04-05 维沃移动通信有限公司 Model training method, text processing device and electronic equipment

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
NUNO MIGUEL GUERREIRO ET AL.: "Towards better subtitles A multilingual approach for punctuation restoration of speech transcripts", 《EXPERT SYSTEMS WITH APPLICATIONS》, 14 August 2021 (2021-08-14), pages 1 - 10 *
赵连振: "面向数字人文的先秦两汉典籍自动标点研究——以SikuBERT预训练模型为例", 《图书馆论坛》, vol. 42, no. 12, 16 April 2022 (2022-04-16), pages 120 - 128 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117113941A (en) * 2023-10-23 2023-11-24 新声科技(深圳)有限公司 Punctuation mark recovery method and device, electronic equipment and storage medium
CN117113941B (en) * 2023-10-23 2024-02-06 新声科技(深圳)有限公司 Punctuation mark recovery method and device, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
CN109902186B (en) Method and apparatus for generating neural network
CN107526725B (en) Method and device for generating text based on artificial intelligence
CN112489620B (en) Speech synthesis method, device, readable medium and electronic equipment
CN112634876B (en) Speech recognition method, device, storage medium and electronic equipment
CN111753551B (en) Information generation method and device based on word vector generation model
CN112489621B (en) Speech synthesis method, device, readable medium and electronic equipment
WO2023165538A1 (en) Speech recognition method and apparatus, and computer-readable medium and electronic device
CN111435592B (en) Voice recognition method and device and terminal equipment
CN112509562B (en) Method, apparatus, electronic device and medium for text post-processing
CN111597825B (en) Voice translation method and device, readable medium and electronic equipment
CN112270200B (en) Text information translation method and device, electronic equipment and storage medium
CN112883968B (en) Image character recognition method, device, medium and electronic equipment
US11036996B2 (en) Method and apparatus for determining (raw) video materials for news
CN111625649A (en) Text processing method and device, electronic equipment and medium
US20240078385A1 (en) Method and apparatus for generating text
CN111582360A (en) Method, apparatus, device and medium for labeling data
CN111883117A (en) Voice wake-up method and device
CN115908640A (en) Method and device for generating image, readable medium and electronic equipment
CN112364653A (en) Text analysis method, apparatus, server and medium for speech synthesis
CN110008926B (en) Method and device for identifying age
CN113468857B (en) Training method and device for style conversion model, electronic equipment and storage medium
CN113902838A (en) Animation generation method, animation generation device, storage medium and electronic equipment
CN115129877A (en) Method and device for generating punctuation mark prediction model and electronic equipment
CN112906381B (en) Dialog attribution identification method and device, readable medium and electronic equipment
CN113571044A (en) Voice information processing method and device and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination