CN111428479B

CN111428479B - Method and device for predicting punctuation in text

Info

Publication number: CN111428479B
Application number: CN202010207942.2A
Authority: CN
Inventors: 薛小娜; 张文剑; 牟小峰
Original assignee: Beijing Mininglamp Software System Co ltd
Current assignee: Beijing Mininglamp Software System Co ltd
Priority date: 2020-03-23
Filing date: 2020-03-23
Publication date: 2024-01-30
Anticipated expiration: 2040-03-23
Also published as: CN111428479A

Abstract

The invention discloses a method for predicting punctuation in a text, which comprises the following steps: preprocessing the text with punctuation to obtain training corpus; training a preset training model by using the training corpus to obtain a prediction model; inputting the text without punctuation to be predicted into the prediction model to obtain a prediction result. The invention also discloses a device for predicting punctuation in the text.

Description

Method and device for predicting punctuation in text

Technical Field

The invention relates to the technical field of computers, in particular to a method and a device for predicting punctuation in a punctuation-free text.

Background

Text generated by Automatic Speech Recognition (ASR) systems is typically non-punctuation and non-segmentation. However, the presence of punctuation can greatly improve the readability of the text, and segmentation of the text based on punctuation location can also improve the performance of many downstream natural language processing tasks, such as relational extraction, semantic parsing, or machine translation.

In the public security field, a large number of valuable voice files are generated every day, but the storage cost and the use cost are high, and the utilization rate is low. In order to reduce the storage and use cost and effectively utilize the voice information, people want to convert the voice file into a text file through ASR technology, but the voice text does not contain punctuation and is not segmented, so that the readability is poor and the voice file is difficult to directly use for other tasks, and therefore, it is very meaningful to construct a scheme capable of carrying out punctuation on the voice text.

Disclosure of Invention

In order to solve the technical problems, the invention provides a method and a device for predicting punctuation in a text, which are used for training a model by taking the text with the punctuation as training corpus to obtain a corresponding prediction model, so as to predict the punctuation of a non-punctuation text, and improve the readability of the non-punctuation text and the convenience of further being utilized.

The embodiment of the invention provides a method for predicting punctuation in a text, which comprises the following steps of,

preprocessing the text with punctuation to obtain training corpus;

training a preset training model by using the training corpus to obtain a prediction model;

inputting the text without punctuation to be predicted into the prediction model to obtain a prediction result.

The embodiment of the invention also provides a device for predicting punctuation in the text, which comprises,

the corpus preprocessing module is used for preprocessing the text with the punctuation to obtain training corpus;

the training module is used for training a preset training model by utilizing the training corpus to obtain a prediction model;

and the prediction module is used for inputting the text without the punctuation to be predicted into the prediction model to obtain a prediction result.

The embodiment of the invention also provides a storage medium, wherein the storage medium stores a computer program, and the computer program is configured to execute the method for predicting punctuation in the text when running.

The embodiment of the invention also provides an electronic device comprising a memory and a processor, wherein the memory stores a computer program, and the processor is configured to run the computer program to execute the method for predicting punctuation in the text.

Drawings

FIG. 1 is a flowchart of a method for predicting punctuation in text according to a first embodiment;

FIG. 2 is a flowchart of a method for predicting punctuation in text according to a second embodiment;

FIG. 3 is a schematic diagram of a procedure for solving sequence annotation problem by BERT-CRF according to the second embodiment;

FIG. 4 is a schematic diagram of the BERT model input representation provided in the second embodiment;

fig. 5 is a block diagram of an apparatus for predicting punctuation in text according to a third embodiment.

Detailed Description

The present invention will be described in further detail with reference to the drawings and the embodiments, for the purpose of making the objects, technical solutions and advantages of the present invention more apparent. It should be noted that, in the case of no conflict, the embodiments and features in the embodiments may be arbitrarily combined with each other.

In the embodiment of the invention, a BERT model is used, the BERT is a depth bi-directional language characterization model based on a transducer, essentially a multi-layer bi-directional Encoder network is constructed by utilizing a transducer structure, the performance of the system exceeds that of a plurality of systems using task specific architecture, and the current optimal performance record of 11 NLP tasks is refreshed. The BERT pre-training model greatly reduces the difficulty of word vector training and improves the accuracy of various natural language processing tasks including text classification, sequence labeling and the like.

Example 1

The embodiment of the invention provides a method for predicting punctuation in a text, which is shown in a figure 1 and comprises the following steps:

step 101, preprocessing a text with punctuation to obtain a training corpus;

step 102, training a preset training model by using the training corpus to obtain a prediction model;

and step 103, inputting the text without punctuation to be predicted into the prediction model to obtain a prediction result.

Optionally, preprocessing the text with punctuation in step 101 to obtain a training corpus, including:

determining punctuation categories to be predicted;

converting punctuation included in the text with the punctuation according to the determined punctuation category, and then segmenting the converted text to obtain at least one sequence;

and after adding punctuation marks to the at least one sequence, writing the punctuation marks into a file according to a preset rule to obtain the training corpus.

Optionally, after the punctuation marks are added to the at least one sequence, writing the at least one sequence into a file according to a preset rule to obtain the training corpus, including:

the following operations are performed in turn for each of the at least one sequence:

a punctuation label is set for each character in the sequence, and non-punctuation characters in the sequence and the corresponding labels are combined to form a labeled row corresponding to the sequence;

and writing the labeled rows corresponding to each sequence into a file according to the preset rule to obtain the training corpus.

Optionally, the setting a punctuation mark for each character in the sequence includes:

the punctuation mark comprises: comma labels, period labels, and other labels;

setting the punctuation mark of each character according to the next character of each character, comprising:

punctuation marks of each character are set as other marks by default;

when the next character is comma, modifying the punctuation mark of the current character into comma mark; when the next character is a period, the punctuation mark of the current character is modified into a period mark; when each character is an end character, the punctuation mark of the current character is still other marks.

Optionally, the predetermined training model includes:

adding a depth bi-directional language characterization BERT model based on a Transformer with a full connection layer and a conditional random field CRF layer; the full connection layer is connected with an output layer of the BERT model and is used for mapping output vectors of the BERT model to a predefined label set; the CRF layer is connected with the output of the full-connection layer.

Optionally, the inputting the text to be predicted without punctuation into the prediction model to obtain a prediction result includes:

segmenting the text to be predicted without punctuation to obtain at least one sequence;

inputting the at least one sequence into the prediction model, and determining punctuation marks corresponding to each character in each sequence;

and merging according to punctuation marks corresponding to each character in each sequence to form a predicted text with punctuation.

Optionally, the segmenting the text to be predicted without punctuation to obtain at least one sequence includes: and cutting by taking the characters as units to obtain the at least one sequence.

Optionally, the punctuation category to be predicted at least includes: commas and periods;

the converting, according to the determined punctuation category, the punctuation included in the text with punctuation includes:

when a punctuation included in the punctuation-bearing text has a comma function in the punctuation-bearing text, the punctuation is replaced with a comma;

when a punctuation included in the punctuation-bearing text has a period function in the punctuation-bearing text, the punctuation is replaced with a period;

and deleting the punctuation from the text with punctuation when the punctuation included in the text with punctuation has neither comma nor period functions in the text with punctuation.

Optionally, when the text with punctuation and the text without punctuation to be predicted contain english, the character is an english word.

Example two

The embodiment of the invention provides a method for predicting punctuation in a text, which is shown in a figure 2 and comprises the following steps:

step 201: preprocessing the text with the punctuation to obtain training corpus, which is also called training text;

step 202: performing fine tuning on the BERT model to determine a training model, and training the training model by using training corpus to obtain a prediction model;

step 203: forming corpus to be predicted according to a preset rule by using the text without punctuation, and inputting the corpus to be predicted into the prediction model to obtain a prediction result.

Wherein, before preprocessing the text with punctuation in step 201, the method further includes step 200, punctuation classification data preparation includes:

step 20001: the category of the punctuation to be predicted is determined, for example, in this embodiment, there are two categories of punctuation to be predicted: commas and periods. Optionally, according to the requirement of the text to be predicted, selecting the related training corpus, and determining the category of the corresponding punctuation to be predicted, which may not be limited to include commas and periods only.

Step 20002: classifying the punctuations which possibly exist according to the category of the punctuations which need to be predicted; for example, the above-identified punctuations that need to be predicted are of two types, comma and period. Then, punctuation points, such as commas, stop signs, and semicolons, which may appear in the training text and represent pauses in the middle of a sentence are classified as the same as or similar to the comma function; punctuation marks, which may appear in the training text, such as periods, exclamation marks or question marks, which indicate the end of a period are classified as having the same or similar functions as the periods; other punctuations in the training text that cannot be categorized as commas or periods, such as dashes, quotations, etc., are categorized as other categories of punctuations.

Step 20003: after the possible punctuation is classified, the punctuation category and the included same or similar punctuation are correspondingly saved. For example, in this embodiment, the following is stored:

comma, corresponding identical or similar punctuation is: commas, stop signs, semicolons, etc. represent punctuations of a pause in the middle of a sentence; these same or similar punctuations are stored in corresponding comma lists;

periods, corresponding identical or similar punctuation marks are: punctuation marks representing the end of a sentence, a mark or question mark, etc.; the same or similar punctuations are stored in the corresponding period list;

otherwise, the corresponding punctuation is: punctuation marks, quotations, etc. which cannot be categorized as comma types or period types; these other punctuations are saved in corresponding other punctuation lists.

Optionally, if it is determined in 20001 that the punctuation category to be predicted further includes a punctuation category a, the corresponding classification in step 20002 also extends the classification of the same or similar punctuation including the punctuation category a, and in step 20003, the punctuation category a and the same or similar punctuation included are also saved. From the above, one skilled in the art can understand how to determine the category of punctuation that needs to be predicted and classify the corresponding plurality of punctuations that may appear in the training text. The present embodiment is not limited to the example illustrated in the present embodiment.

Optionally, in step 201, preprocessing the text with punctuation to obtain a training corpus, including:

step 20101: the text with punctuation, namely training text, is segmented by taking characters as units to obtain character sequences, for example: for the text of Chinese rose, rose and crab apple in the garden, the text is colorful; the garden is also provided with phoenix tree, maple and green onion; is indeed a derivative of Bo's-! ", which is split into sequences: the [ 'flower' garden 'has' month 'season', 'rose', 'sea' crab ',' five 'color' six 'color'; 'flower' garden 'in' also 'have' phoenix 'tree', 'maple' tree ',' depression 'onion' respectively; 'true' is 'one' group 'derived' machine 'Bo' I! ' s ];

step 20102: performing punctuation conversion and filtering, including converting, including replacing and/or filtering, each character after segmentation, as follows:

if the character appears in the comma list, the character is replaced by a comma; if the character appears in the period list, the character is replaced by the period; deleting the character if the character appears in the other punctuation list; the other characters remain.

For example, the above symbol sequences, after substitution and/or filtering, result in: [ ' flower ' garden ' interior ' have ' month ' season '，' Rose ' '，'sea' crab 'and' five 'face' six 'color'.，'flower' garden 'interior' return 'have' phoenix 'tung' tree 'shape'，'maple' tree ','' depression ' onion '，'true' is 'one' pie '' give birth to 'machine' Bo's'。’]；

Step 20103: the replaced or filtered sequences are combined into a sequence S' Chinese rose, crab apple, five-color and six-color garden, and Chinese phoenix tree, maple, depressed onion, which is a derivative of Bo. ";

step 20104: dynamically segmenting the sequence S, wherein the operation parameters during segmentation at least comprise: maximum length sentenceLen of the dynamic sequence, and maximum overlapping character number preTextLen of the current sequence and the above; the maximum length sentenceLen of a dynamic sequence represents the maximum number of characters of a sub-sequence when slicing.

For example, sentenceLen is 10 and pretextlen is 3, the above sequence S is split into the following sub-sequences:

s1: the garden is provided with China rose and rose,

s2: begonia, five-color, six-color, garden

S3: the garden also contains Firmiana tree, maple tree

S4: maple, onion, etc. with depression

S5: is indeed a derivative of Bo and Bo.

The cutting steps are as follows:

according to sentencelen=10, cutting to obtain S1' rose in garden, ";

continuing to cut the next subsequence, wherein the last punctuation in S1, the distance between the last punctuation and the beginning sea of the next sequence is 0, and 0 is smaller than preTextLen (3), so that the cutting is directly started from the sea, and S2 'begonia, five colors and gardens' are obtained by cutting according to sentenceLen=10;

continuing to cut the next subsequence, wherein the last punctuation in S2, the distance between the last punctuation and the beginning "inside" of the next sequence is 2,2 is smaller than preTextLen (3), so that 2 characters can be repeated, the cutting is started from "flower", and the cutting is carried out according to sentenceLen=10 to obtain S3, further Chinese phoenix tree and maple tree in the garden;

continuing to cut the next subsequence, wherein the last punctuation in S3 is 2, and the distance between the last punctuation and the beginning of the next subsequence is 2,2 is smaller than preTextLen (3), so that 2 characters can be repeated, the cutting is started from "maple", and the cutting is carried out according to sentenceLen=10 to obtain S4 "maple, depression onion, true" according to the sentenceLen=10;

continuing to cut the next sub-sequence, the last punctuation in S4, the distance of "one" from the beginning of the next sequence is 2,2 less than preTextLen (3), so 2 characters can be repeated, starting the cut from "true", and according to sentencelen=10, the cut gets S5 "true is a derivative of the erection. ".

For another example, S is "today, there is rose in my home garden, rose, …", then S1 "today, there is month in my home garden" is obtained by slicing according to sentencelen=10; continuing to cut the next sub-sequence, since the last punctuation in S1 and the distance between the last punctuation in S1 and the beginning of the next sequence 'season' is 7,7 is greater than preTextLen (3), only 3 characters can be repeated at most, and the cutting is started from 'inside', so that China rose, rose and … 'in S2' are obtained by cutting.

The above sentenceLen is 10 and pretextlen is 3 for convenience of illustration. Alternatively, in practical application, it may be set to 150 for sentenceLen, 20 for pretextlen, and so on. The length of the sentence of the training text can be adjusted according to the length of the sentence.

And 20105, adding punctuation marks to each character of each sub-sequence, and combining according to a preset format to form a training corpus.

Wherein, the punctuation mark includes: comma labels, period labels, and other labels; punctuation marks are added to each character of the subsequence as follows:

the punctuation marks of each character are set as other marks O by default;

when the next character is comma, modifying the punctuation mark of the current character into comma mark C; when the next character is a period, the punctuation mark of the current character is modified into a period mark P; when each character is an end character, the punctuation mark of the current character is still other marks O;

taking the S1 subsequence as an example, the tag setting result is as follows:

s1: flower O; round O; an inner O; o is present; month O; a season C; o; rose O; rose C; o;

after the sub-sequence S1 is provided with the label, the character is the punctuation, namely, O and O of the punctuation are removed;

the sub-sequence S1 is processed into S1': flower O; round O; an inner O; o is present; month O; a season C; rose O; rose C;

and merging and writing the files according to the preset formats according to the setting and processing results of all the subsequences to form a training corpus:

1) Each character in each subsequence and its label are separated by space;

2) Each sequence and its tag value is a row;

3) The sub-sequences are separated by empty rows.

Alternatively, the predetermined format corresponds to a format that can be identified by the trimmed model (training model) in step 202, and may be specified according to the specific requirement of the trimmed model, which is not limited to the above example illustrated in the present embodiment.

Optionally, in step 202, fine tuning the BERT model determines a training model, including:

FIG. 3 illustrates a process for solving the problem of labeling Chinese text sequences using the BERT-CRF method.

FIG. 4 is an input representation of the BERT-CRF model, with a special symbol [ CLS ] appended to each sequence to identify the start when the model is input. The initiator [ CLS ] and each character in the input sequence input in fig. 4 are Token. For each input Token, its vector representation is added from its corresponding word vector, segment vector, and position vector, all of which are adjusted by random initialization in advance and then model training. When the model converges, the vector representations also learn the word representation, segment representation and position representation corresponding to the Token, respectively.

In the BERT model output, each input token corresponds to an output vector, and the BERT has strong characterization capability because the BERT uses a bidirectional encoder based on a transducer model and the output vector of each character contains context information. For the sequence labeling problem, an output vector can be mapped to a predefined label set by adding a fully connected layer on the basis of BERT and determining the output dimension of the fully connected layer.

The BERT model can solve the sequence labeling problem by combining with a full connection layer. Taking punctuation prediction as an example, after the code vector of the BERT is mapped to a punctuation label set through a full connection layer, the output vector of a single token is processed through a Softmax function, and each dimension value represents the probability that the token is a punctuation label. The solution of this embodiment adds a CRF layer based on the combination of BERT and full connectivity layer. The CRF is a classical probability map model, and the CRF layer ensures that the final prediction result is effective by adding some constraints, and the constraints can be automatically learned by the CRF layer when training data, so that the error of a prediction sequence can be obviously reduced.

Optionally, in step 202, training the training model using a training corpus to obtain a prediction model, and further includes:

before training, a maximum sequence length parameter of the training model is preset, maxSeqLen, which is greater than the maximum length sentenceLen of the dynamic sequence in step 20104.

Forming a corpus to be predicted according to a preset rule by using the text without punctuation, inputting the corpus to be predicted into the prediction model to obtain a prediction result, wherein the method comprises the following steps:

the text without punctuation to be predicted is segmented to obtain at least one sequence;

If, in the present embodiment, the text to be predicted is "the fruits including apple, orange, persimmon, etc.", and a plurality of subsequences are obtained after cutting in units of characters:

t1: containing the materials

T2: production of

T3: water and its preparation method

T4: fruit set

T5: bag(s)

T6: scraper

T7: apple type

T8: fruit set

T9: orange peel

T10: son

T11: persimmon fruit

T12: son

T13: etc

Inputting the subsequences T1-T13 into the prediction model trained in step 202 to obtain a punctuation mark predicted by each character, where the punctuation mark predicted by each character represents a punctuation point that may be immediately following the character, and the result is exemplified as follows:

t1: hold O

T2: o production

T3: water O

T4: fruit C

T5: bag O

T6: draw O

T7: apple O

T8: fruit C

T9: orange O

T10: son C

T11: persimmon O

T12: son C

T13: equal P

Wherein comma label C indicates that the punctuation immediately following the character is comma, period label P indicates that the punctuation immediately following the character is period, and other labels O indicate that there is no punctuation following the character.

Then, according to the above sub-sequencesAfter the punctuation marks are predicted (of each character), the predicted results are obtained by splicing: fruit for fruit bearing，Comprises apple，Orange，Persimmon (persimmon)，Etc。

Optionally, if the text to be predicted includes english, correspondingly selecting the text with punctuation including english as the training corpus. The processing steps in units of characters in the above example are accordingly changed to units of english words. If the language comprises other languages, corresponding estimation is carried out according to the characteristics of different languages.

Optionally, the related procedure of dynamically splitting the replaced and/or filtered combined sequence S in step 20104 is implemented as follows:

symbolic variable description

S is the whole character sequence;

set send _i Representing an ith sequence after dynamic segmentation;

punct _i,end is send _i The last punctuation of (3);

character _i,first representing send _i Is the first character of (a);

punctId _i-1 ＝index(punct _i-1,end ) Representing put _i-1,end Subscripts in S;

characterI _i d＝index(character _i,first ) Representing a character _i,first Subscripts in S;

dist _i ＝punctId _i-1 -characterI _i d represents a pubtid _i-1 And character I _i d distance.

S [ start: end ] represents a character sequence composed of characters subscripted in a [ start, end) interval;

min (a, b) represents the minimum value taken from a and b;

len (S) represents the number of all the character numbers in S;

d is an ordered list of stored dynamic sequences.

And (3) a segmentation step:

(1) Firstly, taking out a character with the position falling between 0 and sendenceLen in S as a first dynamic sequence sent _i Word sequences of (i=0), i.e. send ₀ ＝S[0:sentenceLen]；

(2) When the ith (i > 0) dynamic sequence is acquired:

1) If dist _i The initial position and the end position of the ith dynamic sequence in S are respectively equal to or less than preTextLen

start _i ＝punctId _i-1 +1，

end _i ＝min(len(S),start _i +preTextLen)

2) If dist _i >preTextLen, then

start _i ＝characterI _i d-preTextLen，

end _i ＝min(len(S),start _i +preTextLen)

At this time, the ith dynamic sentence is

sent _i ＝S[start _i :end _i ]

3) Sequence send _i Saving in D;

4) When end _i And (c) ending the segmentation when the number is equal to len (S).

Example III

An embodiment of the present invention provides a device 50 for predicting punctuation in text, which is configured as shown in fig. 5, including,

the corpus preprocessing module 501 is used for preprocessing the text with punctuation to obtain training corpus;

the training module 502 is configured to train a predetermined training model by using the training corpus to obtain a prediction model;

and the prediction module 503 is used for inputting the text without punctuation to be predicted into the prediction model to obtain a prediction result.

Optionally, the corpus preprocessing module 501 performs preprocessing on the text with punctuation to obtain a training corpus, including:

determining punctuation categories to be predicted;

Optionally, after adding punctuation labels to the at least one sequence, the corpus preprocessing module 501 writes the punctuation labels into a file according to a preset rule to obtain the training corpus, including:

Optionally, the punctuation mark includes: comma labels, period labels, and other labels;

optionally, the corpus preprocessing module 501 sets a punctuation label of each character according to a next character of the each character, including:

punctuation marks of each character are set as other marks by default;

Optionally, the predetermined training model includes:

Optionally, the predicting module 503 inputs the text to be predicted without punctuation into the prediction model to obtain a prediction result, including:

Optionally, the predicting module 503 segments the text to be predicted without punctuation to obtain at least one sequence, including: and cutting by taking the characters as units to obtain the at least one sequence.

The embodiment of the invention also provides a storage medium, wherein the storage medium stores a computer program, and the computer program is configured to execute any method for predicting punctuation in the text when running.

The embodiment of the invention also provides an electronic device, which comprises a memory and a processor, wherein the memory stores a computer program, and the processor is used for running any method for predicting punctuation in the text.

Those of ordinary skill in the art will appreciate that all or a portion of the steps of the above-described embodiments may be implemented using a computer program flow, which may be stored on a computer readable storage medium, which when executed, comprises one or a combination of the steps of the method embodiments, and which are executed on a corresponding hardware platform (e.g., system, apparatus, device, etc.).

Alternatively, all or part of the steps of the above embodiments may be implemented using integrated circuits, and the steps may be implemented as individual integrated circuit modules, or as individual integrated circuit modules. Thus, the present invention is not limited to any specific combination of hardware and software.

The devices/functional modules/functional units in the above embodiments may be implemented by using general-purpose computing devices, and they may be centralized in a single computing device, or may be distributed over a network formed by a plurality of computing devices.

Each of the devices/functional modules/functional units in the above-described embodiments may be stored in a computer-readable storage medium when implemented in the form of a software functional module and sold or used as a separate product. The above-mentioned computer readable storage medium may be a read-only memory, a magnetic disk or an optical disk, or the like.

The foregoing is merely illustrative of the present invention, and the present invention is not limited thereto, and any person skilled in the art will readily recognize that variations or substitutions are within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A method for predicting punctuation in text, comprising,

preprocessing the text with punctuation to obtain training corpus;

inputting the text without punctuation to be predicted into the prediction model to obtain a prediction result,

the preprocessing of the text with punctuation to obtain a training corpus comprises the following steps:

determining punctuation categories to be predicted;

after adding punctuation labels to the at least one sequence, writing the punctuation labels into a file according to a preset rule to obtain the training corpus, wherein the training corpus comprises the following components: the following operations are performed in turn for each of the at least one sequence: a punctuation label is set for each character in the sequence, and non-punctuation characters in the sequence and the corresponding labels are combined to form a labeled row corresponding to the sequence; writing the labeled row corresponding to each sequence into a file according to the preset rule to obtain the training corpus;

the step of setting punctuation marks for each character in the sequence comprises the following steps: the punctuation mark comprises: comma labels, period labels, and other labels; setting the punctuation mark of each character according to the next character of each character, comprising: punctuation marks of each character are set as other marks by default; when the next character is comma, modifying the punctuation mark of the current character into comma mark; when the next character is a period, the punctuation mark of the current character is modified into a period mark; when each character is an end character, the punctuation mark of the current character is still other marks;

the punctuation category to be predicted at least comprises: commas and periods;

2. The method of claim 1, wherein,

the predetermined training model comprises:

3. The method of claim 1, wherein,

inputting the text without punctuation to be predicted into the prediction model to obtain a prediction result, wherein the method comprises the following steps:

4. The method of claim 3, wherein,

the method for segmenting the text to be predicted without punctuation to obtain at least one sequence comprises the following steps:

and cutting by taking the characters as units to obtain the at least one sequence.

5. The method of claim 1, wherein the step of determining the position of the substrate comprises,

and when the text with the punctuation and the text without the punctuation to be predicted contain English, the character is an English word.

6. An apparatus for predicting punctuation in a text, comprising,

a prediction module for inputting the text without punctuation to be predicted into the prediction model to obtain a prediction result,

the corpus preprocessing module is also used for determining punctuation categories needing to be predicted; converting punctuation included in the text with the punctuation according to the determined punctuation category, and then segmenting the converted text to obtain at least one sequence; after adding punctuation labels to the at least one sequence, writing the punctuation labels into a file according to a preset rule to obtain the training corpus, wherein the training corpus comprises the following components: the following operations are performed in turn for each of the at least one sequence: a punctuation label is set for each character in the sequence, and non-punctuation characters in the sequence and the corresponding labels are combined to form a labeled row corresponding to the sequence; writing the labeled row corresponding to each sequence into a file according to the preset rule to obtain the training corpus;

the corpus preprocessing module is further used for: