CN111428479B - Method and device for predicting punctuation in text - Google Patents

Method and device for predicting punctuation in text Download PDF

Info

Publication number
CN111428479B
CN111428479B CN202010207942.2A CN202010207942A CN111428479B CN 111428479 B CN111428479 B CN 111428479B CN 202010207942 A CN202010207942 A CN 202010207942A CN 111428479 B CN111428479 B CN 111428479B
Authority
CN
China
Prior art keywords
punctuation
text
character
sequence
predicted
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010207942.2A
Other languages
Chinese (zh)
Other versions
CN111428479A (en
Inventor
薛小娜
张文剑
牟小峰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Mininglamp Software System Co ltd
Original Assignee
Beijing Mininglamp Software System Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Mininglamp Software System Co ltd filed Critical Beijing Mininglamp Software System Co ltd
Priority to CN202010207942.2A priority Critical patent/CN111428479B/en
Publication of CN111428479A publication Critical patent/CN111428479A/en
Application granted granted Critical
Publication of CN111428479B publication Critical patent/CN111428479B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Abstract

The invention discloses a method for predicting punctuation in a text, which comprises the following steps: preprocessing the text with punctuation to obtain training corpus; training a preset training model by using the training corpus to obtain a prediction model; inputting the text without punctuation to be predicted into the prediction model to obtain a prediction result. The invention also discloses a device for predicting punctuation in the text.

Description

Method and device for predicting punctuation in text
Technical Field
The invention relates to the technical field of computers, in particular to a method and a device for predicting punctuation in a punctuation-free text.
Background
Text generated by Automatic Speech Recognition (ASR) systems is typically non-punctuation and non-segmentation. However, the presence of punctuation can greatly improve the readability of the text, and segmentation of the text based on punctuation location can also improve the performance of many downstream natural language processing tasks, such as relational extraction, semantic parsing, or machine translation.
In the public security field, a large number of valuable voice files are generated every day, but the storage cost and the use cost are high, and the utilization rate is low. In order to reduce the storage and use cost and effectively utilize the voice information, people want to convert the voice file into a text file through ASR technology, but the voice text does not contain punctuation and is not segmented, so that the readability is poor and the voice file is difficult to directly use for other tasks, and therefore, it is very meaningful to construct a scheme capable of carrying out punctuation on the voice text.
Disclosure of Invention
In order to solve the technical problems, the invention provides a method and a device for predicting punctuation in a text, which are used for training a model by taking the text with the punctuation as training corpus to obtain a corresponding prediction model, so as to predict the punctuation of a non-punctuation text, and improve the readability of the non-punctuation text and the convenience of further being utilized.
The embodiment of the invention provides a method for predicting punctuation in a text, which comprises the following steps of,
preprocessing the text with punctuation to obtain training corpus;
training a preset training model by using the training corpus to obtain a prediction model;
inputting the text without punctuation to be predicted into the prediction model to obtain a prediction result.
The embodiment of the invention also provides a device for predicting punctuation in the text, which comprises,
the corpus preprocessing module is used for preprocessing the text with the punctuation to obtain training corpus;
the training module is used for training a preset training model by utilizing the training corpus to obtain a prediction model;
and the prediction module is used for inputting the text without the punctuation to be predicted into the prediction model to obtain a prediction result.
The embodiment of the invention also provides a storage medium, wherein the storage medium stores a computer program, and the computer program is configured to execute the method for predicting punctuation in the text when running.
The embodiment of the invention also provides an electronic device comprising a memory and a processor, wherein the memory stores a computer program, and the processor is configured to run the computer program to execute the method for predicting punctuation in the text.
Drawings
FIG. 1 is a flowchart of a method for predicting punctuation in text according to a first embodiment;
FIG. 2 is a flowchart of a method for predicting punctuation in text according to a second embodiment;
FIG. 3 is a schematic diagram of a procedure for solving sequence annotation problem by BERT-CRF according to the second embodiment;
FIG. 4 is a schematic diagram of the BERT model input representation provided in the second embodiment;
fig. 5 is a block diagram of an apparatus for predicting punctuation in text according to a third embodiment.
Detailed Description
The present invention will be described in further detail with reference to the drawings and the embodiments, for the purpose of making the objects, technical solutions and advantages of the present invention more apparent. It should be noted that, in the case of no conflict, the embodiments and features in the embodiments may be arbitrarily combined with each other.
In the embodiment of the invention, a BERT model is used, the BERT is a depth bi-directional language characterization model based on a transducer, essentially a multi-layer bi-directional Encoder network is constructed by utilizing a transducer structure, the performance of the system exceeds that of a plurality of systems using task specific architecture, and the current optimal performance record of 11 NLP tasks is refreshed. The BERT pre-training model greatly reduces the difficulty of word vector training and improves the accuracy of various natural language processing tasks including text classification, sequence labeling and the like.
Example 1
The embodiment of the invention provides a method for predicting punctuation in a text, which is shown in a figure 1 and comprises the following steps:
step 101, preprocessing a text with punctuation to obtain a training corpus;
step 102, training a preset training model by using the training corpus to obtain a prediction model;
and step 103, inputting the text without punctuation to be predicted into the prediction model to obtain a prediction result.
Optionally, preprocessing the text with punctuation in step 101 to obtain a training corpus, including:
determining punctuation categories to be predicted;
converting punctuation included in the text with the punctuation according to the determined punctuation category, and then segmenting the converted text to obtain at least one sequence;
and after adding punctuation marks to the at least one sequence, writing the punctuation marks into a file according to a preset rule to obtain the training corpus.
Optionally, after the punctuation marks are added to the at least one sequence, writing the at least one sequence into a file according to a preset rule to obtain the training corpus, including:
the following operations are performed in turn for each of the at least one sequence:
a punctuation label is set for each character in the sequence, and non-punctuation characters in the sequence and the corresponding labels are combined to form a labeled row corresponding to the sequence;
and writing the labeled rows corresponding to each sequence into a file according to the preset rule to obtain the training corpus.
Optionally, the setting a punctuation mark for each character in the sequence includes:
the punctuation mark comprises: comma labels, period labels, and other labels;
setting the punctuation mark of each character according to the next character of each character, comprising:
punctuation marks of each character are set as other marks by default;
when the next character is comma, modifying the punctuation mark of the current character into comma mark; when the next character is a period, the punctuation mark of the current character is modified into a period mark; when each character is an end character, the punctuation mark of the current character is still other marks.
Optionally, the predetermined training model includes:
adding a depth bi-directional language characterization BERT model based on a Transformer with a full connection layer and a conditional random field CRF layer; the full connection layer is connected with an output layer of the BERT model and is used for mapping output vectors of the BERT model to a predefined label set; the CRF layer is connected with the output of the full-connection layer.
Optionally, the inputting the text to be predicted without punctuation into the prediction model to obtain a prediction result includes:
segmenting the text to be predicted without punctuation to obtain at least one sequence;
inputting the at least one sequence into the prediction model, and determining punctuation marks corresponding to each character in each sequence;
and merging according to punctuation marks corresponding to each character in each sequence to form a predicted text with punctuation.
Optionally, the segmenting the text to be predicted without punctuation to obtain at least one sequence includes: and cutting by taking the characters as units to obtain the at least one sequence.
Optionally, the punctuation category to be predicted at least includes: commas and periods;
the converting, according to the determined punctuation category, the punctuation included in the text with punctuation includes:
when a punctuation included in the punctuation-bearing text has a comma function in the punctuation-bearing text, the punctuation is replaced with a comma;
when a punctuation included in the punctuation-bearing text has a period function in the punctuation-bearing text, the punctuation is replaced with a period;
and deleting the punctuation from the text with punctuation when the punctuation included in the text with punctuation has neither comma nor period functions in the text with punctuation.
Optionally, when the text with punctuation and the text without punctuation to be predicted contain english, the character is an english word.
Example two
The embodiment of the invention provides a method for predicting punctuation in a text, which is shown in a figure 2 and comprises the following steps:
step 201: preprocessing the text with the punctuation to obtain training corpus, which is also called training text;
step 202: performing fine tuning on the BERT model to determine a training model, and training the training model by using training corpus to obtain a prediction model;
step 203: forming corpus to be predicted according to a preset rule by using the text without punctuation, and inputting the corpus to be predicted into the prediction model to obtain a prediction result.
Wherein, before preprocessing the text with punctuation in step 201, the method further includes step 200, punctuation classification data preparation includes:
step 20001: the category of the punctuation to be predicted is determined, for example, in this embodiment, there are two categories of punctuation to be predicted: commas and periods. Optionally, according to the requirement of the text to be predicted, selecting the related training corpus, and determining the category of the corresponding punctuation to be predicted, which may not be limited to include commas and periods only.
Step 20002: classifying the punctuations which possibly exist according to the category of the punctuations which need to be predicted; for example, the above-identified punctuations that need to be predicted are of two types, comma and period. Then, punctuation points, such as commas, stop signs, and semicolons, which may appear in the training text and represent pauses in the middle of a sentence are classified as the same as or similar to the comma function; punctuation marks, which may appear in the training text, such as periods, exclamation marks or question marks, which indicate the end of a period are classified as having the same or similar functions as the periods; other punctuations in the training text that cannot be categorized as commas or periods, such as dashes, quotations, etc., are categorized as other categories of punctuations.
Step 20003: after the possible punctuation is classified, the punctuation category and the included same or similar punctuation are correspondingly saved. For example, in this embodiment, the following is stored:
comma, corresponding identical or similar punctuation is: commas, stop signs, semicolons, etc. represent punctuations of a pause in the middle of a sentence; these same or similar punctuations are stored in corresponding comma lists;
periods, corresponding identical or similar punctuation marks are: punctuation marks representing the end of a sentence, a mark or question mark, etc.; the same or similar punctuations are stored in the corresponding period list;
otherwise, the corresponding punctuation is: punctuation marks, quotations, etc. which cannot be categorized as comma types or period types; these other punctuations are saved in corresponding other punctuation lists.
Optionally, if it is determined in 20001 that the punctuation category to be predicted further includes a punctuation category a, the corresponding classification in step 20002 also extends the classification of the same or similar punctuation including the punctuation category a, and in step 20003, the punctuation category a and the same or similar punctuation included are also saved. From the above, one skilled in the art can understand how to determine the category of punctuation that needs to be predicted and classify the corresponding plurality of punctuations that may appear in the training text. The present embodiment is not limited to the example illustrated in the present embodiment.
Optionally, in step 201, preprocessing the text with punctuation to obtain a training corpus, including:
step 20101: the text with punctuation, namely training text, is segmented by taking characters as units to obtain character sequences, for example: for the text of Chinese rose, rose and crab apple in the garden, the text is colorful; the garden is also provided with phoenix tree, maple and green onion; is indeed a derivative of Bo's-! ", which is split into sequences: the [ 'flower' garden 'has' month 'season', 'rose', 'sea' crab ',' five 'color' six 'color'; 'flower' garden 'in' also 'have' phoenix 'tree', 'maple' tree ',' depression 'onion' respectively; 'true' is 'one' group 'derived' machine 'Bo' I! ' s ];
step 20102: performing punctuation conversion and filtering, including converting, including replacing and/or filtering, each character after segmentation, as follows:
if the character appears in the comma list, the character is replaced by a comma; if the character appears in the period list, the character is replaced by the period; deleting the character if the character appears in the other punctuation list; the other characters remain.
For example, the above symbol sequences, after substitution and/or filtering, result in: [ ' flower ' garden ' interior ' have ' month ' season '' Rose ' ''sea' crab 'and' five 'face' six 'color'.'flower' garden 'interior' return 'have' phoenix 'tung' tree 'shape''maple' tree ','' depression ' onion ''true' is 'one' pie '' give birth to 'machine' Bo's'’];
Step 20103: the replaced or filtered sequences are combined into a sequence S' Chinese rose, crab apple, five-color and six-color garden, and Chinese phoenix tree, maple, depressed onion, which is a derivative of Bo. ";
step 20104: dynamically segmenting the sequence S, wherein the operation parameters during segmentation at least comprise: maximum length sentenceLen of the dynamic sequence, and maximum overlapping character number preTextLen of the current sequence and the above; the maximum length sentenceLen of a dynamic sequence represents the maximum number of characters of a sub-sequence when slicing.
For example, sentenceLen is 10 and pretextlen is 3, the above sequence S is split into the following sub-sequences:
s1: the garden is provided with China rose and rose,
s2: begonia, five-color, six-color, garden
S3: the garden also contains Firmiana tree, maple tree
S4: maple, onion, etc. with depression
S5: is indeed a derivative of Bo and Bo.
The cutting steps are as follows:
according to sentencelen=10, cutting to obtain S1' rose in garden, ";
continuing to cut the next subsequence, wherein the last punctuation in S1, the distance between the last punctuation and the beginning sea of the next sequence is 0, and 0 is smaller than preTextLen (3), so that the cutting is directly started from the sea, and S2 'begonia, five colors and gardens' are obtained by cutting according to sentenceLen=10;
continuing to cut the next subsequence, wherein the last punctuation in S2, the distance between the last punctuation and the beginning "inside" of the next sequence is 2,2 is smaller than preTextLen (3), so that 2 characters can be repeated, the cutting is started from "flower", and the cutting is carried out according to sentenceLen=10 to obtain S3, further Chinese phoenix tree and maple tree in the garden;
continuing to cut the next subsequence, wherein the last punctuation in S3 is 2, and the distance between the last punctuation and the beginning of the next subsequence is 2,2 is smaller than preTextLen (3), so that 2 characters can be repeated, the cutting is started from "maple", and the cutting is carried out according to sentenceLen=10 to obtain S4 "maple, depression onion, true" according to the sentenceLen=10;
continuing to cut the next sub-sequence, the last punctuation in S4, the distance of "one" from the beginning of the next sequence is 2,2 less than preTextLen (3), so 2 characters can be repeated, starting the cut from "true", and according to sentencelen=10, the cut gets S5 "true is a derivative of the erection. ".
For another example, S is "today, there is rose in my home garden, rose, …", then S1 "today, there is month in my home garden" is obtained by slicing according to sentencelen=10; continuing to cut the next sub-sequence, since the last punctuation in S1 and the distance between the last punctuation in S1 and the beginning of the next sequence 'season' is 7,7 is greater than preTextLen (3), only 3 characters can be repeated at most, and the cutting is started from 'inside', so that China rose, rose and … 'in S2' are obtained by cutting.
The above sentenceLen is 10 and pretextlen is 3 for convenience of illustration. Alternatively, in practical application, it may be set to 150 for sentenceLen, 20 for pretextlen, and so on. The length of the sentence of the training text can be adjusted according to the length of the sentence.
And 20105, adding punctuation marks to each character of each sub-sequence, and combining according to a preset format to form a training corpus.
Wherein, the punctuation mark includes: comma labels, period labels, and other labels; punctuation marks are added to each character of the subsequence as follows:
the punctuation marks of each character are set as other marks O by default;
when the next character is comma, modifying the punctuation mark of the current character into comma mark C; when the next character is a period, the punctuation mark of the current character is modified into a period mark P; when each character is an end character, the punctuation mark of the current character is still other marks O;
taking the S1 subsequence as an example, the tag setting result is as follows:
s1: flower O; round O; an inner O; o is present; month O; a season C; o; rose O; rose C; o;
after the sub-sequence S1 is provided with the label, the character is the punctuation, namely, O and O of the punctuation are removed;
the sub-sequence S1 is processed into S1': flower O; round O; an inner O; o is present; month O; a season C; rose O; rose C;
and merging and writing the files according to the preset formats according to the setting and processing results of all the subsequences to form a training corpus:
1) Each character in each subsequence and its label are separated by space;
2) Each sequence and its tag value is a row;
3) The sub-sequences are separated by empty rows.
Alternatively, the predetermined format corresponds to a format that can be identified by the trimmed model (training model) in step 202, and may be specified according to the specific requirement of the trimmed model, which is not limited to the above example illustrated in the present embodiment.
Optionally, in step 202, fine tuning the BERT model determines a training model, including:
adding a depth bi-directional language characterization BERT model based on a Transformer with a full connection layer and a conditional random field CRF layer; the full connection layer is connected with an output layer of the BERT model and is used for mapping output vectors of the BERT model to a predefined label set; the CRF layer is connected with the output of the full-connection layer.
FIG. 3 illustrates a process for solving the problem of labeling Chinese text sequences using the BERT-CRF method.
FIG. 4 is an input representation of the BERT-CRF model, with a special symbol [ CLS ] appended to each sequence to identify the start when the model is input. The initiator [ CLS ] and each character in the input sequence input in fig. 4 are Token. For each input Token, its vector representation is added from its corresponding word vector, segment vector, and position vector, all of which are adjusted by random initialization in advance and then model training. When the model converges, the vector representations also learn the word representation, segment representation and position representation corresponding to the Token, respectively.
In the BERT model output, each input token corresponds to an output vector, and the BERT has strong characterization capability because the BERT uses a bidirectional encoder based on a transducer model and the output vector of each character contains context information. For the sequence labeling problem, an output vector can be mapped to a predefined label set by adding a fully connected layer on the basis of BERT and determining the output dimension of the fully connected layer.
The BERT model can solve the sequence labeling problem by combining with a full connection layer. Taking punctuation prediction as an example, after the code vector of the BERT is mapped to a punctuation label set through a full connection layer, the output vector of a single token is processed through a Softmax function, and each dimension value represents the probability that the token is a punctuation label. The solution of this embodiment adds a CRF layer based on the combination of BERT and full connectivity layer. The CRF is a classical probability map model, and the CRF layer ensures that the final prediction result is effective by adding some constraints, and the constraints can be automatically learned by the CRF layer when training data, so that the error of a prediction sequence can be obviously reduced.
Optionally, in step 202, training the training model using a training corpus to obtain a prediction model, and further includes:
before training, a maximum sequence length parameter of the training model is preset, maxSeqLen, which is greater than the maximum length sentenceLen of the dynamic sequence in step 20104.
Forming a corpus to be predicted according to a preset rule by using the text without punctuation, inputting the corpus to be predicted into the prediction model to obtain a prediction result, wherein the method comprises the following steps:
the text without punctuation to be predicted is segmented to obtain at least one sequence;
inputting the at least one sequence into the prediction model, and determining punctuation marks corresponding to each character in each sequence;
and merging according to punctuation marks corresponding to each character in each sequence to form a predicted text with punctuation.
If, in the present embodiment, the text to be predicted is "the fruits including apple, orange, persimmon, etc.", and a plurality of subsequences are obtained after cutting in units of characters:
t1: containing the materials
T2: production of
T3: water and its preparation method
T4: fruit set
T5: bag(s)
T6: scraper
T7: apple type
T8: fruit set
T9: orange peel
T10: son
T11: persimmon fruit
T12: son
T13: etc
Inputting the subsequences T1-T13 into the prediction model trained in step 202 to obtain a punctuation mark predicted by each character, where the punctuation mark predicted by each character represents a punctuation point that may be immediately following the character, and the result is exemplified as follows:
t1: hold O
T2: o production
T3: water O
T4: fruit C
T5: bag O
T6: draw O
T7: apple O
T8: fruit C
T9: orange O
T10: son C
T11: persimmon O
T12: son C
T13: equal P
Wherein comma label C indicates that the punctuation immediately following the character is comma, period label P indicates that the punctuation immediately following the character is period, and other labels O indicate that there is no punctuation following the character.
Then, according to the above sub-sequencesAfter the punctuation marks are predicted (of each character), the predicted results are obtained by splicing: fruit for fruit bearingComprises appleOrangePersimmon (persimmon)Etc
Optionally, if the text to be predicted includes english, correspondingly selecting the text with punctuation including english as the training corpus. The processing steps in units of characters in the above example are accordingly changed to units of english words. If the language comprises other languages, corresponding estimation is carried out according to the characteristics of different languages.
Optionally, the related procedure of dynamically splitting the replaced and/or filtered combined sequence S in step 20104 is implemented as follows:
symbolic variable description
S is the whole character sequence;
set send i Representing an ith sequence after dynamic segmentation;
punct i,end is send i The last punctuation of (3);
character i,first representing send i Is the first character of (a);
punctId i-1 =index(punct i-1,end ) Representing put i-1,end Subscripts in S;
characterI i d=index(character i,first ) Representing a character i,first Subscripts in S;
dist i =punctId i-1 -characterI i d represents a pubtid i-1 And character I i d distance.
S [ start: end ] represents a character sequence composed of characters subscripted in a [ start, end) interval;
min (a, b) represents the minimum value taken from a and b;
len (S) represents the number of all the character numbers in S;
d is an ordered list of stored dynamic sequences.
And (3) a segmentation step:
(1) Firstly, taking out a character with the position falling between 0 and sendenceLen in S as a first dynamic sequence sent i Word sequences of (i=0), i.e. send 0 =S[0:sentenceLen];
(2) When the ith (i > 0) dynamic sequence is acquired:
1) If dist i The initial position and the end position of the ith dynamic sequence in S are respectively equal to or less than preTextLen
start i =punctId i-1 +1,
end i =min(len(S),start i +preTextLen)
2) If dist i >preTextLen, then
start i =characterI i d-preTextLen,
end i =min(len(S),start i +preTextLen)
At this time, the ith dynamic sentence is
sent i =S[start i :end i ]
3) Sequence send i Saving in D;
4) When end i And (c) ending the segmentation when the number is equal to len (S).
Example III
An embodiment of the present invention provides a device 50 for predicting punctuation in text, which is configured as shown in fig. 5, including,
the corpus preprocessing module 501 is used for preprocessing the text with punctuation to obtain training corpus;
the training module 502 is configured to train a predetermined training model by using the training corpus to obtain a prediction model;
and the prediction module 503 is used for inputting the text without punctuation to be predicted into the prediction model to obtain a prediction result.
Optionally, the corpus preprocessing module 501 performs preprocessing on the text with punctuation to obtain a training corpus, including:
determining punctuation categories to be predicted;
converting punctuation included in the text with the punctuation according to the determined punctuation category, and then segmenting the converted text to obtain at least one sequence;
and after adding punctuation marks to the at least one sequence, writing the punctuation marks into a file according to a preset rule to obtain the training corpus.
Optionally, after adding punctuation labels to the at least one sequence, the corpus preprocessing module 501 writes the punctuation labels into a file according to a preset rule to obtain the training corpus, including:
the following operations are performed in turn for each of the at least one sequence:
a punctuation label is set for each character in the sequence, and non-punctuation characters in the sequence and the corresponding labels are combined to form a labeled row corresponding to the sequence;
and writing the labeled rows corresponding to each sequence into a file according to the preset rule to obtain the training corpus.
Optionally, the punctuation mark includes: comma labels, period labels, and other labels;
optionally, the corpus preprocessing module 501 sets a punctuation label of each character according to a next character of the each character, including:
punctuation marks of each character are set as other marks by default;
when the next character is comma, modifying the punctuation mark of the current character into comma mark; when the next character is a period, the punctuation mark of the current character is modified into a period mark; when each character is an end character, the punctuation mark of the current character is still other marks.
Optionally, the predetermined training model includes:
adding a depth bi-directional language characterization BERT model based on a Transformer with a full connection layer and a conditional random field CRF layer; the full connection layer is connected with an output layer of the BERT model and is used for mapping output vectors of the BERT model to a predefined label set; the CRF layer is connected with the output of the full-connection layer.
Optionally, the predicting module 503 inputs the text to be predicted without punctuation into the prediction model to obtain a prediction result, including:
segmenting the text to be predicted without punctuation to obtain at least one sequence;
inputting the at least one sequence into the prediction model, and determining punctuation marks corresponding to each character in each sequence;
and merging according to punctuation marks corresponding to each character in each sequence to form a predicted text with punctuation.
Optionally, the predicting module 503 segments the text to be predicted without punctuation to obtain at least one sequence, including: and cutting by taking the characters as units to obtain the at least one sequence.
Optionally, the punctuation category to be predicted at least includes: commas and periods;
the converting, according to the determined punctuation category, the punctuation included in the text with punctuation includes:
when a punctuation included in the punctuation-bearing text has a comma function in the punctuation-bearing text, the punctuation is replaced with a comma;
when a punctuation included in the punctuation-bearing text has a period function in the punctuation-bearing text, the punctuation is replaced with a period;
and deleting the punctuation from the text with punctuation when the punctuation included in the text with punctuation has neither comma nor period functions in the text with punctuation.
Optionally, when the text with punctuation and the text without punctuation to be predicted contain english, the character is an english word.
The embodiment of the invention also provides a storage medium, wherein the storage medium stores a computer program, and the computer program is configured to execute any method for predicting punctuation in the text when running.
The embodiment of the invention also provides an electronic device, which comprises a memory and a processor, wherein the memory stores a computer program, and the processor is used for running any method for predicting punctuation in the text.
Those of ordinary skill in the art will appreciate that all or a portion of the steps of the above-described embodiments may be implemented using a computer program flow, which may be stored on a computer readable storage medium, which when executed, comprises one or a combination of the steps of the method embodiments, and which are executed on a corresponding hardware platform (e.g., system, apparatus, device, etc.).
Alternatively, all or part of the steps of the above embodiments may be implemented using integrated circuits, and the steps may be implemented as individual integrated circuit modules, or as individual integrated circuit modules. Thus, the present invention is not limited to any specific combination of hardware and software.
The devices/functional modules/functional units in the above embodiments may be implemented by using general-purpose computing devices, and they may be centralized in a single computing device, or may be distributed over a network formed by a plurality of computing devices.
Each of the devices/functional modules/functional units in the above-described embodiments may be stored in a computer-readable storage medium when implemented in the form of a software functional module and sold or used as a separate product. The above-mentioned computer readable storage medium may be a read-only memory, a magnetic disk or an optical disk, or the like.
The foregoing is merely illustrative of the present invention, and the present invention is not limited thereto, and any person skilled in the art will readily recognize that variations or substitutions are within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims (6)

1. A method for predicting punctuation in text, comprising,
preprocessing the text with punctuation to obtain training corpus;
training a preset training model by using the training corpus to obtain a prediction model;
inputting the text without punctuation to be predicted into the prediction model to obtain a prediction result,
the preprocessing of the text with punctuation to obtain a training corpus comprises the following steps:
determining punctuation categories to be predicted;
converting punctuation included in the text with the punctuation according to the determined punctuation category, and then segmenting the converted text to obtain at least one sequence;
after adding punctuation labels to the at least one sequence, writing the punctuation labels into a file according to a preset rule to obtain the training corpus, wherein the training corpus comprises the following components: the following operations are performed in turn for each of the at least one sequence: a punctuation label is set for each character in the sequence, and non-punctuation characters in the sequence and the corresponding labels are combined to form a labeled row corresponding to the sequence; writing the labeled row corresponding to each sequence into a file according to the preset rule to obtain the training corpus;
the step of setting punctuation marks for each character in the sequence comprises the following steps: the punctuation mark comprises: comma labels, period labels, and other labels; setting the punctuation mark of each character according to the next character of each character, comprising: punctuation marks of each character are set as other marks by default; when the next character is comma, modifying the punctuation mark of the current character into comma mark; when the next character is a period, the punctuation mark of the current character is modified into a period mark; when each character is an end character, the punctuation mark of the current character is still other marks;
the punctuation category to be predicted at least comprises: commas and periods;
the converting, according to the determined punctuation category, the punctuation included in the text with punctuation includes:
when a punctuation included in the punctuation-bearing text has a comma function in the punctuation-bearing text, the punctuation is replaced with a comma;
when a punctuation included in the punctuation-bearing text has a period function in the punctuation-bearing text, the punctuation is replaced with a period;
and deleting the punctuation from the text with punctuation when the punctuation included in the text with punctuation has neither comma nor period functions in the text with punctuation.
2. The method of claim 1, wherein,
the predetermined training model comprises:
adding a depth bi-directional language characterization BERT model based on a Transformer with a full connection layer and a conditional random field CRF layer; the full connection layer is connected with an output layer of the BERT model and is used for mapping output vectors of the BERT model to a predefined label set; the CRF layer is connected with the output of the full-connection layer.
3. The method of claim 1, wherein,
inputting the text without punctuation to be predicted into the prediction model to obtain a prediction result, wherein the method comprises the following steps:
segmenting the text to be predicted without punctuation to obtain at least one sequence;
inputting the at least one sequence into the prediction model, and determining punctuation marks corresponding to each character in each sequence;
and merging according to punctuation marks corresponding to each character in each sequence to form a predicted text with punctuation.
4. The method of claim 3, wherein,
the method for segmenting the text to be predicted without punctuation to obtain at least one sequence comprises the following steps:
and cutting by taking the characters as units to obtain the at least one sequence.
5. The method of claim 1, wherein the step of determining the position of the substrate comprises,
and when the text with the punctuation and the text without the punctuation to be predicted contain English, the character is an English word.
6. An apparatus for predicting punctuation in a text, comprising,
the corpus preprocessing module is used for preprocessing the text with the punctuation to obtain training corpus;
the training module is used for training a preset training model by utilizing the training corpus to obtain a prediction model;
a prediction module for inputting the text without punctuation to be predicted into the prediction model to obtain a prediction result,
the corpus preprocessing module is also used for determining punctuation categories needing to be predicted; converting punctuation included in the text with the punctuation according to the determined punctuation category, and then segmenting the converted text to obtain at least one sequence; after adding punctuation labels to the at least one sequence, writing the punctuation labels into a file according to a preset rule to obtain the training corpus, wherein the training corpus comprises the following components: the following operations are performed in turn for each of the at least one sequence: a punctuation label is set for each character in the sequence, and non-punctuation characters in the sequence and the corresponding labels are combined to form a labeled row corresponding to the sequence; writing the labeled row corresponding to each sequence into a file according to the preset rule to obtain the training corpus;
the step of setting punctuation marks for each character in the sequence comprises the following steps: the punctuation mark comprises: comma labels, period labels, and other labels; setting the punctuation mark of each character according to the next character of each character, comprising: punctuation marks of each character are set as other marks by default; when the next character is comma, modifying the punctuation mark of the current character into comma mark; when the next character is a period, the punctuation mark of the current character is modified into a period mark; when each character is an end character, the punctuation mark of the current character is still other marks;
the punctuation category to be predicted at least comprises: commas and periods;
the corpus preprocessing module is further used for:
when a punctuation included in the punctuation-bearing text has a comma function in the punctuation-bearing text, the punctuation is replaced with a comma;
when a punctuation included in the punctuation-bearing text has a period function in the punctuation-bearing text, the punctuation is replaced with a period;
and deleting the punctuation from the text with punctuation when the punctuation included in the text with punctuation has neither comma nor period functions in the text with punctuation.
CN202010207942.2A 2020-03-23 2020-03-23 Method and device for predicting punctuation in text Active CN111428479B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010207942.2A CN111428479B (en) 2020-03-23 2020-03-23 Method and device for predicting punctuation in text

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010207942.2A CN111428479B (en) 2020-03-23 2020-03-23 Method and device for predicting punctuation in text

Publications (2)

Publication Number Publication Date
CN111428479A CN111428479A (en) 2020-07-17
CN111428479B true CN111428479B (en) 2024-01-30

Family

ID=71549098

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010207942.2A Active CN111428479B (en) 2020-03-23 2020-03-23 Method and device for predicting punctuation in text

Country Status (1)

Country Link
CN (1) CN111428479B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112148856B (en) * 2020-09-22 2024-01-23 北京百度网讯科技有限公司 Method and device for establishing punctuation prediction model
CN112906366B (en) * 2021-01-29 2023-07-07 深圳力维智联技术有限公司 ALBERT-based model construction method, device, system and medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108564953A (en) * 2018-04-20 2018-09-21 科大讯飞股份有限公司 A kind of punctuate processing method and processing device of speech recognition text
CN110674629A (en) * 2019-09-27 2020-01-10 上海智臻智能网络科技股份有限公司 Punctuation mark model and its training method, equipment and storage medium
CN110688822A (en) * 2019-09-27 2020-01-14 上海智臻智能网络科技股份有限公司 Punctuation mark adding method, punctuation mark adding device and punctuation mark adding medium
CN110705264A (en) * 2019-09-27 2020-01-17 上海智臻智能网络科技股份有限公司 Punctuation correction method, punctuation correction apparatus, and punctuation correction medium

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108564953A (en) * 2018-04-20 2018-09-21 科大讯飞股份有限公司 A kind of punctuate processing method and processing device of speech recognition text
CN110674629A (en) * 2019-09-27 2020-01-10 上海智臻智能网络科技股份有限公司 Punctuation mark model and its training method, equipment and storage medium
CN110688822A (en) * 2019-09-27 2020-01-14 上海智臻智能网络科技股份有限公司 Punctuation mark adding method, punctuation mark adding device and punctuation mark adding medium
CN110705264A (en) * 2019-09-27 2020-01-17 上海智臻智能网络科技股份有限公司 Punctuation correction method, punctuation correction apparatus, and punctuation correction medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于BERT的古文断句研究与应用;俞敬松 等;《中文信息学报》;20191130;第33卷(第11期);第59-60页 *

Also Published As

Publication number Publication date
CN111428479A (en) 2020-07-17

Similar Documents

Publication Publication Date Title
CN108984683B (en) Method, system, equipment and storage medium for extracting structured data
CN110795543B (en) Unstructured data extraction method, device and storage medium based on deep learning
CN110851596A (en) Text classification method and device and computer readable storage medium
CN113807098A (en) Model training method and device, electronic equipment and storage medium
CN111709242B (en) Chinese punctuation mark adding method based on named entity recognition
JP2021192277A (en) Method for extracting information, method for training extraction model, device, and electronic apparatus
CN113220835B (en) Text information processing method, device, electronic equipment and storage medium
CN111274804A (en) Case information extraction method based on named entity recognition
CN111428479B (en) Method and device for predicting punctuation in text
CN112966088B (en) Unknown intention recognition method, device, equipment and storage medium
CN113051914A (en) Enterprise hidden label extraction method and device based on multi-feature dynamic portrait
CN116012481B (en) Image generation processing method and device, electronic equipment and storage medium
CN111742322A (en) System and method for domain and language independent definition extraction using deep neural networks
CN113360001A (en) Input text processing method and device, electronic equipment and storage medium
CN108491381B (en) Syntax analysis method of Chinese binary structure
CN112101031A (en) Entity identification method, terminal equipment and storage medium
CN115099239B (en) Resource identification method, device, equipment and storage medium
CN111353314A (en) Story text semantic analysis method for animation generation
CN113255331B (en) Text error correction method, device and storage medium
CN114970514A (en) Artificial intelligence based Chinese word segmentation method, device, computer equipment and medium
CN112860871B (en) Natural language understanding model training method, natural language understanding method and device
CN114218940A (en) Text information processing method, text information processing device, text information model training method, text information model training device, text information model training equipment and storage medium
CN113342935A (en) Semantic recognition method and device, electronic equipment and readable storage medium
CN117290515A (en) Training method of text annotation model, method and device for generating text graph
CN115357720B (en) BERT-based multitasking news classification method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant