CN112069816A - Chinese punctuation adding method, system and equipment - Google Patents

Chinese punctuation adding method, system and equipment Download PDF

Info

Publication number
CN112069816A
CN112069816A CN202010958997.7A CN202010958997A CN112069816A CN 112069816 A CN112069816 A CN 112069816A CN 202010958997 A CN202010958997 A CN 202010958997A CN 112069816 A CN112069816 A CN 112069816A
Authority
CN
China
Prior art keywords
word
level
information
sequence
phoneme
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010958997.7A
Other languages
Chinese (zh)
Inventor
黄石磊
刘轶
王昕�
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Raisound Technology Co ltd
Original Assignee
Shenzhen Raisound Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Raisound Technology Co ltd filed Critical Shenzhen Raisound Technology Co ltd
Priority to CN202010958997.7A priority Critical patent/CN112069816A/en
Publication of CN112069816A publication Critical patent/CN112069816A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/126Character encoding

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Document Processing Apparatus (AREA)

Abstract

The invention discloses a Chinese punctuation adding method, a system and equipment. The method comprises the following steps: acquiring text information and preprocessing the text information to obtain training data; the method comprises the steps of converting a word sequence of training data into phoneme-level feature vectors, word-level feature vectors and word-level feature vectors by performing phoneme-level coding, word-level coding and word-level coding on the training data, and performing superposition fusion to obtain feature vectors fusing three levels of information; and performing feature extraction and classifier training based on the feature vector fused with the three levels of information to obtain a prediction model, wherein the prediction model is used for adding Chinese punctuation marks to the input text information. In the model training stage, the word level information is added, so that the prediction model does not need to be retrained due to the OOV problem; by adding phoneme-level information, the rule between punctuation marks and Chinese character pronunciations is learned, and the punctuation prediction precision can be improved.

Description

Chinese punctuation adding method, system and equipment
Technical Field
The invention relates to the technical field of voice recognition, in particular to a method, a system and equipment for adding Chinese punctuation marks.
Background
In recent years, with the help of rapid development of the fields of machine learning and deep learning and accumulation of big data corpora, the speech recognition technology is developed dramatically. Nowadays, with the breakthrough of the research of the voice recognition technology, the voice recognition technology is used to convert the language of people into a character recognition result as an instruction to be added to a related product or as a final result, so that the man-machine interaction efficiency can be greatly improved, the importance of the character recognition result on the development of computers and social life is increasingly highlighted, and the application fields of products developed by the voice recognition technology are more and more extensive, such as intelligent conference systems, medical services, bank services and the like, and almost reach every industry and every aspect of the society. Companies such as Baidu, Tencent, Ali and the like provide cloud services, upload a section of audio and related information such as the sampling rate of the section of audio, and return the text content of the section of audio.
The current popular voice recognition system can convert the voice content into characters such as English, Chinese characters or numbers in the voice recognition process. When the received voice content is a series of voices or document-level voices, most voice recognition systems convert the received voice content into a series of characters without punctuation information, and the readability of the transcription result is poor, thereby causing reading obstruction to people. If the transcribed text is properly punctuated, the intelligibility of the text is greatly improved.
In recent years, more and more researchers have been invested in punctuation prediction research for voice transcription, and many efforts have been made to recover or predict punctuation marks of voice transcription. Early attempts proposed adding punctuation as a hidden event to the language model and integrating this knowledge into the speech recognition system. Recent trends have tended to use separate punctuation prediction modules, which are added to speech recognition systems as a post-processing means. In recent studies, methods for using them as independent modules are mainly classified into three main groups: the method comprises the steps of firstly, punctuation prediction based on acoustic features, secondly, punctuation prediction based on text features, and thirdly, punctuation prediction based on the acoustic features and the text features.
The punctuation prediction based on the acoustic features needs audio as input, the audio processing is more complex than the text processing, and is easily influenced by spoken language, and the accuracy of the punctuation prediction is influenced. In addition, punctuation prediction based on text and acoustic features, the input of the punctuation prediction comprises text and audio, the text content and the audio content are required to be aligned, and some errors exist in the voice transcription result, so that the calculation is complex, the calculation amount of the method is large, and training data are difficult to obtain. Whereas punctuation prediction based on this text only takes text data as data, almost any text material can be considered, and is very readily available and can be used very freely for late fusion of acoustic features. In view of these facts, a good punctuation prediction research algorithm based on text features is widely applicable.
Most of the existing Chinese punctuation prediction research technologies based on text features train models on a large amount of text data (a large amount of data acquired on the Internet), and then the trained models are taken into a speech recognition system for use. However, in a real speech recognition system, the recognition result of the system is not all the Vocabulary contained in the training data, and there is an OOV (Out of speech) problem. If the training data does not contain the words, the punctuation prediction model may add punctuation marks to the words, causing the words to be separated and misleading. To deal with this problem, the prior art approach first thought of retraining punctuation models and adding a large amount of data text with new words in the training of the model. However, in fact, this method of reacquiring data retraining is time consuming and labor intensive, and subsequently suffers from OOV problems.
Disclosure of Invention
The invention aims to provide a Chinese punctuation mark adding method, a Chinese punctuation mark adding system and Chinese punctuation mark adding equipment, which are used for solving the defects of the traditional Chinese punctuation mark prediction research technology based on text features, are beneficial to solving the problem that a model needs to be retrained caused by an OOV problem and are beneficial to improving punctuation prediction accuracy.
In order to achieve the purpose, the invention adopts the following technical scheme.
In a first aspect, a method for adding chinese punctuation marks is provided, comprising: acquiring text information and preprocessing the text information to obtain training data comprising word sequences and corresponding label sequences, wherein labels in the label sequences represent punctuation marks which are added behind corresponding words in the word sequences; the method comprises the steps of converting a word sequence of training data into phoneme-level feature vectors, word-level feature vectors and word-level feature vectors by performing phoneme-level coding, word-level coding and word-level coding on the training data, and performing superposition fusion to obtain feature vectors fusing three levels of information; and performing feature extraction and classifier training based on the feature vector fused with the three levels of information to obtain a prediction model, wherein the prediction model is used for adding Chinese punctuation marks to the input text information.
In one possible implementation, the pre-processing includes data cleansing and data normalization; the data cleansing includes: replacing punctuation marks in the text information, and replacing punctuation marks except commas, periods, question marks and exclamation marks with commas, periods, question marks or exclamation marks; the data normalization comprises: corresponding each word in the text message to a label, the label comprising: a tag C representing a comma, a tag P representing a period, a tag Q representing a question mark, a tag E representing an exclamation mark, and an unsigned tag O.
In one possible implementation, the phoneme-level encoding includes: converting the word sequence in the training data into a phoneme sequence by using a word-pronunciation conversion G2P model; and coding the obtained phoneme sequence by adopting a one-hot coding mode to obtain a phoneme-level feature vector.
In one possible implementation, the word-level encoding includes: and coding the word sequence of the training data by adopting a single hot coding mode to obtain a word-level feature vector.
In one possible implementation, the word-level encoding includes: performing word segmentation and part-of-speech tagging on training data by adopting a Chinese word segmentation tool, acquiring a word segmentation sequence and a corresponding part-of-speech sequence, and adding word segmentation information into the part-of-speech sequence by utilizing a BMES label; and coding the part-of-speech sequence added with the word segmentation information by adopting a single-hot coding mode to obtain a word level characteristic vector.
In one possible implementation, the method further includes: adding new words in the Chinese word segmentation tool, performing part-of-speech tagging on the new words, and adding segmentation information in part-of-speech sequences of the new words by using BMES labels.
In one possible implementation, the method further includes: and receiving text information input by a voice recognition system, and outputting the text information added with the Chinese punctuation marks to the voice recognition system by utilizing the prediction model.
In a second aspect, a system for adding chinese punctuation marks is provided, comprising: the data processing module is used for acquiring text information and preprocessing the text information to obtain training data comprising word sequences and corresponding label sequences, wherein labels in the label sequences represent punctuation marks which are added behind corresponding words in the word sequences; the model training module is used for converting a word sequence of the training data into a phoneme-level feature vector, a word-level feature vector and a word-level feature vector by performing phoneme-level coding, word-level coding and word-level coding on the training data and performing superposition fusion to obtain a feature vector fusing information of three levels; and performing feature extraction and classifier training based on the feature vector fused with the three levels of information to obtain a prediction model, wherein the prediction model is used for adding Chinese punctuation marks to the input text information.
In a third aspect, there is provided a computer device comprising a processor and a memory, the memory storing a program, the program comprising computer-executable instructions, and when the computer device is running, the processor executing the computer-executable instructions stored in the memory to cause the computer device to execute the method for adding a chinese punctuation mark according to the first aspect.
In a fourth aspect, there is provided a computer readable storage medium storing one or more programs, the one or more programs comprising computer executable instructions, which when executed by a computer device, cause the computer device to perform the method of Chinese punctuation addition according to the first aspect.
According to the technical scheme, the embodiment of the invention has the following advantages:
in the model training stage, the invention extracts the characteristics of the training data by adopting the characteristics of three levels of information instead of only adopting word level information, and simultaneously carries out phoneme level coding, word level coding and word level coding.
Firstly, the word-level coding is carried out in the model training stage, and the word-level information is adopted, so that when a new word is encountered later, only the new word needs to be added to the word segmentation tool, and thus, the prediction model can know that several characters included in the new word are an integer without retraining, and punctuation marks are not added in the middle, so that the retraining of the model for adding the new word is avoided, the labor is saved, and the cost is reduced.
Secondly, phoneme-level coding is carried out in a model training stage, and phoneme-level information is added, so that the prediction model can learn the rule between the punctuation marks and the pronunciation of adjacent Chinese characters, and the pronunciation rule is utilized to assist in punctuation mark prediction, thereby improving the accuracy of punctuation prediction.
In conclusion, the prediction model obtained by performing model training by using the characteristics of multiple levels (phoneme level, word level and word level) can overcome various problems in the conventional Chinese punctuation mark prediction research technology, is favorable for solving the problem of retraining the model caused by the OOV problem and is favorable for improving punctuation prediction accuracy.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the following briefly introduces the embodiments and the drawings used in the description of the prior art, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.
FIG. 1 is a flow chart of a method for adding Chinese punctuation marks according to an embodiment of the present invention;
FIG. 2 is a flow chart of obtaining training data in one embodiment of the invention;
FIG. 3 is a flow chart of model training in one embodiment of the invention;
FIG. 4 is a flow chart of performing model testing in one embodiment of the invention;
FIG. 5 is a flow diagram of training the G2P model in one embodiment of the invention;
FIG. 6 is a block diagram of a Chinese punctuation addition system according to an embodiment of the present invention;
fig. 7 is a block diagram of a computer device according to an embodiment of the present invention.
Detailed Description
In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The terms "first," "second," "third," and the like in the description and in the claims, and in the above-described drawings, are used for distinguishing between different objects and not for describing a particular order. Furthermore, the terms "include" and "have," as well as any variations thereof, are intended to cover non-exclusive inclusions. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those steps or elements listed, but may alternatively include other steps or elements not listed, or inherent to such process, method, article, or apparatus.
The following will explain details by way of specific examples.
As mentioned above, the emerging new vocabulary leads to the OOV (out of set word) problem inevitably encountered by the speech recognition system, and the solution commonly adopted in the prior art is to obtain new data to retrain the model, which is not only time-consuming and labor-consuming, but also cannot avoid the OOV problem.
In addition, applicants have discovered that Chinese punctuation marks have some relationship with the pronunciation of sentences. For example, "find an o in good joy today", "you are like a cow and" you are like a stick ", the whole pinyin intonation in the exclamation sentences is heavy, the pinyin is more specific to the second and fourth voices, and most of the pinyin contains vowels such as" a "," ang ", etc., so that the punctuation of the sentence is associated with the pronunciation of the sentence.
The invention provides a Chinese punctuation mark adding method aiming at OOV problem and the rule between pronunciation and punctuation of sentences. The invention also provides a corresponding system and a related device.
The invention provides a Chinese punctuation mark adding method (system) based on multilevel characteristics, which mainly comprises three major parts: one to obtain training data, one to model training, and the last to model testing. The training data part is obtained mainly by obtaining text information in the internet, such as Wikipedia (Wikipedia) and the like, and then processing the data into data required by the model by using a standardized algorithm. The model training part is mainly used for training a prediction model for adding a mark symbol to an input text through large-scale training data so as to input a text content when the prediction model is used later, output the text content with the mark symbol and feed back the text content to the voice recognition system. And the model testing part is mainly used for receiving the text input by the voice recognition system when the model training is finished, obtaining an output text through the model and returning the output text to the voice recognition system.
Referring to fig. 1, a flow chart of a method for adding a chinese punctuation mark according to an embodiment of the present invention is shown, which includes the following steps:
s1, acquiring training data:
acquiring text information and preprocessing the text information to obtain training data comprising word sequences and corresponding label sequences, wherein labels in the label sequences represent punctuation marks which are added behind corresponding words in the word sequences; the pre-processing includes data cleansing and data normalization.
S2, model training:
the method comprises the steps of converting a word sequence of training data into phoneme-level feature vectors, word-level feature vectors and word-level feature vectors by performing phoneme-level coding, word-level coding and word-level coding on the training data, and performing superposition fusion to obtain feature vectors fusing three levels of information;
and performing feature extraction and classifier training based on the feature vector fused with the three levels of information to obtain a prediction model, wherein the prediction model is used for adding Chinese punctuation marks to the input text information.
S3, model test:
and receiving text information input by a voice recognition system, and outputting the text information added with the Chinese punctuation marks to the voice recognition system by utilizing the prediction model.
The above steps will be described in detail below.
【1】 Training data is acquired.
The process of acquiring training data is shown in fig. 2, and includes the following steps.
1.1 obtaining text information.
A) There is a large amount of textual information on the internet. According to the task requirement, for a news webpage (a news article generally has a standard punctuation mark and is called a target webpage) in a specified range, firstly, original webpage information with large text can be captured by using a web crawler and the like, and the method is realized by using the prior art.
B) For various information contained in the web page, a text part in the web page is extracted first. The texts in the web pages are divided into two types, one is a non-punctuation text such as various titles, buttons and the like. The other is the text of a large piece with punctuation.
C) The text extraction process is as follows (prior art can be utilized):
c.1) reading in an HTML document (DH1) of the webpage, and converting the HTML into a tree structure (DTR1) by utilizing an existing tool, such as Microsoft XML Parser or other tools;
and C.2) if the website is a common website, extracting corresponding parts of the HTML document (DTR1) of the tree structure and the text in the TPL by using a pre-defined Template (TPL), namely the text, and storing the text in a txt file to obtain text information.
1.2 preprocessing text information.
A) First, data cleaning is performed on the text data.
A.1) a large number of special characters exist in the crawled text lines, and data cleaning can be carried out according to a given word bank.
A.2) deleting lines without punctuation marks in the text and lines without punctuation marks at the end.
A.3) the replacement is performed according to the designated reserved punctuation marks (i.e. only the punctuation desired to be predicted is reserved). Alternatively, only commas, periods, question marks and exclamation marks in the text message may be retained. These four punctuations are sufficient for us to understand the text content. Thus, when punctuation marks in the text message are replaced, other punctuation marks can be replaced by commas, periods, question marks or exclamation marks, for example, "; "replace with", "replace". -. ", in addition, all but comma, period, question mark and exclamation mark may be deleted.
B) And (6) standardizing data.
In the text-based punctuation prediction research, most methods use a sequence tag prediction method, and the same is true here. The method of sequence label prediction in punctuation prediction research is to correspond each word in text information with a label indicating what punctuation mark follows the word.
Optionally, herein, the tag comprises: a tag C representing a comma, a tag P representing a period, a tag Q representing a question mark, a tag E representing an exclamation mark, and an unsigned tag O.
For example: the original text is "hello, i want to tell you something. "when converted to a sequence tag, its word sequence is" you are you want to tell you something "and its tag sequence is" O C O O O O O O O O P ". The label "C" corresponding to the word "good" indicates that "good" is followed by a comma. Similarly, the label "P" corresponding to the word "thing" indicates that "thing" is followed by a period ". ". The label "O" indicates nothing after this word. Herein, only four punctuation points can be predicted: comma, period, question mark and exclamation mark, the corresponding label is: C. p, Q and E, plus unsigned label O, a total of five labels are used.
Optionally, in some embodiments, the data preprocessing specifically may include: firstly, the text information after data cleaning is processed according to the line number proportion of 3: 1: 1, segmenting, and dividing the data into a training data set, a verification data set and a test data set; and respectively standardizing the text data in the three data sets according to the format of the sequence tags.
After the above-described data washing and normalization, the resulting data includes word sequences and corresponding tag sequences. And part of the obtained data is used as training data for the next stage of model training. Other parts in the obtained data can be divided into verification data and test data to be used for verifying and testing the model obtained by subsequent training.
【2】 Model training
The flow of the model training phase is shown in fig. 3 and includes the following steps.
2.1 input feature representation
The word sequence of the training data is in a plain text form, and the text data needs to be encoded and then input into the model in order to be converted into a numerical value which can be calculated by a computer. In the present invention, three kinds of information need to be added: phoneme level, word level, and word level. The invention converts the input text into a form of combining three coded vectors, namely: phoneme level coding, word level coding and word level coding, overlapping and fusing the obtained phoneme level feature vector, word level feature vector and word level feature vector to obtain a feature vector fusing information of three levels, and then performing model training.
1) Phoneme-level coding: the word sequence is converted into a phoneme sequence and then converted into a phoneme-level feature vector.
The applicant finds some rules of the punctuation marks and the pinyin or pronunciation of the text, so the applicant wants to use the pronunciation rules to improve the prediction accuracy of the punctuation marks. In the present invention, chinese pronunciation phoneme information is used, i.e. each word gets a phoneme sequence, for example: the phoneme sequence corresponding to "you" is "n i 3", and the phoneme sequence corresponding to "o" has four kinds of "aa a 1", "aa a 2", "aa a 4" and "aa a 5", and because "a" sounds differently in different cases, there are a plurality of phonemes. Each phoneme sequence consists of two parts: initial consonants and vowels. Since the text data is a Chinese character, the Chinese character needs to be converted into phoneme information, but from the perspective of the character alone, many Chinese characters are found to be polyphone, and how to select the phoneme sequence of the current character is not clear, so that one of the phonemes cannot be simply randomly selected.
In the present invention, a G2P (Grapheme-to-Phoneme, word-sound conversion, prior art) tool was selected. In the present invention, the G2P tool is trained by using the words in the chinese characters and the corresponding phoneme sequence (e.g., chinese pronunciation phoneme data provided by AISHELL) as training data to obtain a G2P model, and the training process is shown in fig. 5. Thus, a trained G2P model can be used to obtain the phoneme sequence of each word in the sentence.
And (3) converting the word sequence in the training data obtained in the step (1) into a phoneme sequence after the G2P tool is trained. For example: the word sequence "how you have eaten today" is the corresponding phoneme sequence "j in1 t ian1 n i3 ch ix1 f an4 l e5 m e 5". Because the phoneme sequence corresponding to each word is composed of two parts, the obtained phonemes of a single word need to combine the two parts, for example, the phoneme sequence corresponding to "j in 1" in the above example is "j in 1" to "jin 1" in "j in1 t ian1 n i3 ch ix1 f an4 l e5 m e 5".
Then, the obtained phoneme sequence may be encoded by using a one-hot encoding (i.e., one-hot encoding) manner, so as to obtain a phoneme-level feature vector. In the phone library, there are 27 initial consonants and 189 final consonants, so that there may be 5103 all components, so that we can encode all the phonemes corresponding to the word, and this document adopts one-hot encoding to encode the 5103 initial and final phonemes. In this way we can obtain phoneme-level feature vectors for all word sequences in the training data.
2) Word-level coding: vector features of information used to represent words.
As with most Chinese punctuation prediction research, one-hot codes can be used to encode all Chinese characters, each code representing a Chinese character, and we can obtain the code vector of each character. Thereby converting the word sequence of the training data into word-level feature vectors.
3) And (3) word level coding: vector features that indicate what part of speech the current word belongs to.
To solve the OOV problem, i.e., when the training data does not contain a word, the punctuation prediction model may add punctuation symbols to the middle of the word, resulting in the problem of separating the word. The invention adopts a Chinese word segmentation tool (and a Paddle part-of-speech tagging mode of the tool) to perform word segmentation and part-of-speech tagging on training data to obtain a word segmentation sequence and a corresponding part-of-speech sequence. For example, "how you have eaten today" corresponding participle sequence "how you have eaten today" and part-of-speech sequence "t t r v v xc xc". From the word segmentation sequence, we can find that the words of "today" and "eat" are connected together and should not be separated, and in addition, "today" in the word segmentation sequence corresponds to "t t", "t" represents time, and "r" represents pronouns. In the invention, the word part and the word segmentation sequence are further merged by using a BMES label, namely, word segmentation information is added into the word part sequence. The "BMES" tag includes four kinds of tags, i.e., B denotes a prefix value of a word, M denotes a middle position of a word, E denotes an end position of a word, and S denotes a single word.
For example, the part-of-speech sequence "t t r v v v xc xc" in the above example can be converted into "tB tE rS vB vE xcS xcS", where "today" originally corresponds to the part-of-speech sequence "t t" and now corresponds to "tB tE". "B" represents the beginning of a word, "E" represents the end of a word, and when the number of words of a word is more than 2 words, the middle words are all "M"; otherwise, when the word is simply a word, it is "S", for example, the sequence corresponding to "you" in the above example is "rS".
In the paddlepart-of-speech tagging mode, 28 parts of speech are total, and 4 word segmentation tags BMES are added, so that 112 tags are total, and similarly, one-hot encoding can be performed on each tag. Thus, word coding can be carried out on the word sequence of the training data to obtain word level information; namely, the part-of-speech sequence added with the word segmentation information is coded to obtain a word-level feature vector.
The word level information is added during training, which is equivalent to telling the beginning and the end of a certain word of the network, so that the network model can learn the knowledge, and the new words can be directly added into the word segmentation tool during testing. The model will not normally generate predicted punctuation marks in the middle of words.
In summary, the word sequence of the training data is converted into phoneme-level, word-level, and word-level feature vectors, and the three kinds of information are fused in a vector addition manner, so as to finally obtain the feature vectors fused with the three levels of information as the input features.
2.2 feature extraction
As above, after the input feature representation is complete, the feature vector for this input may be retrieved. Optionally, the feature vector may be input into a BiLSTM layer (bidirectional LSTM) to further extract features, and after passing through the BiLSTM layer, a more robust feature vector is obtained for later classification. It should be noted that other feature extraction networks besides BiLSTM may also be used herein. Specifically, what kind of feature extraction network is used for feature extraction, which is not limited herein.
2.3 classifier
Herein, a CRF (conditional random field) is preferably selected as the classifier. Based on a certain relation between continuous words, CRF is selected to score the whole sequence prediction, and through model training optimization, the CRF selects an optimal label sequence result. And comparing the training sequence with the real label sequence so as to optimize the whole model.
In general, in the training stage, phoneme level, word level and word level information are selected, BilSTM is selected as a feature extraction tool, CRF is finally used as a classifier to realize a training punctuation prediction algorithm, and a prediction model for adding Chinese punctuation symbols to input text information is obtained through training.
It should be noted that classifiers other than CRF may be used. The specific classifier is used for model training, and the text is not limited.
【3】 Model testing
The flow of the model test phase is shown in fig. 4.
The testing phase, much the same as the model training phase, inputs a piece of unsigned text from the speech recognition system. The data stream passes through a feature representation layer and a BilSTM layer to a CRF classifier to obtain a label sequence. Different from the model training stage, in the testing stage, the prediction model does not participate in training, all parameters in the model are fixed, after a label sequence is obtained, the label is required to be converted into a Chinese punctuation mark and added into a word sequence to obtain final text information with the punctuation mark, and then the final text information is returned to the voice recognition system.
In addition, if a new word exists, the new word is only required to be added into the jieba Chinese word segmentation tool, the tool can segment the new word in one sentence together, so that the boundary information of the word is obtained, and under the general condition, the model cannot predict punctuation marks in the middle of the word. Compared with the conventional punctuation prediction algorithm, the method does not need to acquire training data again, does not need to retrain the model again, and is time-saving and labor-saving.
The method for adding Chinese punctuation marks disclosed in the present invention is explained in detail above. It is worth to be noted that the key points of the present invention are:
1. a Chinese punctuation symbol prediction method based on multilevel characteristics is provided in a Chinese punctuation prediction task, and model training is carried out on the basis of phoneme-level, word-level and word-level characteristic vectors.
2. When the phoneme-level feature vector coding is carried out, a simple Pinyin sequence is not adopted, but a G2P tool is used for finding the phoneme sequence of the Pinyin sequence according to the sentence, so that the phoneme sequence of the whole sentence is obtained. This part has the advantage that it is more accurate to process polyphonic words than to look up directly with the dictionary.
3. When the word-level feature vector coding is carried out, a jieba Chinese word segmentation tool is adopted to obtain word segmentation and part-of-speech information, and the information of the word segmentation and the part-of-speech information is fused by using a BMES label, so that the word-level information is obtained.
Referring to fig. 6, an embodiment of the present invention further provides a system for adding chinese punctuation marks, which may include:
the data processing module 61 is configured to obtain text information and perform standardization to obtain training data including a word sequence and a corresponding tag sequence, where a tag in the tag sequence represents a punctuation mark to be added after a corresponding word in the word sequence;
the model training module 62 is configured to perform phoneme-level coding, word-level coding and word-level coding on training data, convert a word sequence of the training data into a phoneme-level feature vector, a word-level feature vector and a word-level feature vector, and perform superposition and fusion to obtain a feature vector fusing information of three levels; based on the feature vector fusing the three levels of information, performing feature extraction and classifier training to obtain a prediction model, wherein the prediction model is used for adding Chinese punctuation marks to input text information;
and the number of the first and second groups,
and the model test module 63 is configured to receive text information input by the speech recognition system, and output the text information added with the chinese punctuation marks to the speech recognition system by using the prediction model.
For a more detailed description of the system, reference is made to the description of the method embodiments above.
Referring to fig. 7, an embodiment of the present invention further provides a computer device 70, which includes a processor 71 and a memory 72, where the memory 72 stores a program, and the program includes computer-executable instructions, and when the computer device 70 runs, the processor 71 executes the computer-executable instructions stored in the memory 72, so as to cause the computer device 70 to execute the method for adding a chinese punctuation mark as described above.
An embodiment of the present invention also provides a computer-readable storage medium storing one or more programs, the one or more programs comprising computer-executable instructions, which when executed by a computer device, cause the computer device to perform the chinese punctuation mark addition method as recited in any one of claims 1 to 7.
To sum up, the embodiment of the invention discloses a method, a system and equipment for adding Chinese punctuation marks. According to the technical scheme, the embodiment of the invention has the following technical effects:
in the model training stage, the invention extracts the characteristics of the training data by adopting the characteristics of three levels of information instead of only adopting word level information, and simultaneously carries out phoneme level coding, word level coding and word level coding.
Firstly, the word-level coding is carried out in the model training stage, and the word-level information is adopted, so that when a new word is encountered later, only the new word needs to be added to the word segmentation tool, and thus, the prediction model can know that several characters included in the new word are an integer without retraining, and punctuation marks are not added in the middle, so that the retraining of the model for adding the new word is avoided, the labor is saved, and the cost is reduced.
Secondly, phoneme-level coding is carried out in a model training stage, and phoneme-level information is added, so that the prediction model can learn the rule between the punctuation marks and the pronunciation of adjacent Chinese characters, and the pronunciation rule is utilized to assist in punctuation mark prediction, thereby improving the accuracy of punctuation prediction.
In a word, the prediction model obtained by performing model training by utilizing multi-level (phoneme level, word level and word level) features can overcome various problems in the conventional Chinese punctuation mark prediction research technology, is favorable for solving the problem of retraining the model caused by the OOV problem and is favorable for improving punctuation prediction accuracy.
In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to the related descriptions of other embodiments.
The above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; those of ordinary skill in the art will understand that: the technical solutions described in the above embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (10)

1. A Chinese punctuation mark adding method is characterized by comprising the following steps:
acquiring text information and preprocessing the text information to obtain training data comprising word sequences and corresponding label sequences, wherein labels in the label sequences represent punctuation marks which are added behind corresponding words in the word sequences;
the method comprises the steps of converting a word sequence of training data into phoneme-level feature vectors, word-level feature vectors and word-level feature vectors by performing phoneme-level coding, word-level coding and word-level coding on the training data, and performing superposition fusion to obtain feature vectors fusing three levels of information;
and performing feature extraction and classifier training based on the feature vector fused with the three levels of information to obtain a prediction model, wherein the prediction model is used for adding Chinese punctuation marks to the input text information.
2. The method of claim 1, wherein the pre-processing comprises data cleansing and data normalization;
the data cleansing includes: replacing punctuation marks in the text information, and replacing punctuation marks except commas, periods, question marks and exclamation marks with commas, periods, question marks or exclamation marks;
the data normalization comprises: corresponding each word in the text message to a label, the label comprising: a tag C representing a comma, a tag P representing a period, a tag Q representing a question mark, a tag E representing an exclamation mark, and an unsigned tag O.
3. The method of claim 1, wherein said phone-level encoding comprises:
converting the word sequence in the training data into a phoneme sequence by using a word-pronunciation conversion G2P model;
and coding the obtained phoneme sequence by adopting a one-hot coding mode to obtain a phoneme-level feature vector.
4. The method of claim 1, wherein the word-level encoding comprises:
and coding the word sequence of the training data by adopting a single hot coding mode to obtain a word-level feature vector.
5. The method of claim 1, wherein said word-level encoding comprises:
performing word segmentation and part-of-speech tagging on training data by adopting a Chinese word segmentation tool, acquiring a word segmentation sequence and a corresponding part-of-speech sequence, and adding word segmentation information into the part-of-speech sequence by utilizing a BMES label;
and coding the part-of-speech sequence added with the word segmentation information by adopting a single-hot coding mode to obtain a word level characteristic vector.
6. The method of claim 5, further comprising:
adding new words in the Chinese word segmentation tool, performing part-of-speech tagging on the new words, and adding segmentation information in part-of-speech sequences of the new words by using BMES labels.
7. The method of any of claims 1-6, further comprising:
and receiving text information input by a voice recognition system, and outputting the text information added with the Chinese punctuation marks to the voice recognition system by utilizing the prediction model.
8. A Chinese punctuation adding system is characterized by comprising:
the data processing module is used for acquiring text information and preprocessing the text information to obtain training data comprising word sequences and corresponding label sequences, wherein labels in the label sequences represent punctuation marks which are added behind corresponding words in the word sequences;
the model training module is used for converting a word sequence of the training data into a phoneme-level feature vector, a word-level feature vector and a word-level feature vector by performing phoneme-level coding, word-level coding and word-level coding on the training data and performing superposition fusion to obtain a feature vector fusing information of three levels; and performing feature extraction and classifier training based on the feature vector fused with the three levels of information to obtain a prediction model, wherein the prediction model is used for adding Chinese punctuation marks to the input text information.
9. A computer device comprising a processor and a memory, the memory having stored therein a program comprising computer-executable instructions that, when executed by the computer device, the processor executes the computer-executable instructions stored by the memory to cause the computer device to perform the chinese punctuation addition method according to any one of claims 1 to 7.
10. A computer readable storage medium storing one or more programs, the one or more programs comprising computer executable instructions, which when executed by a computer device, cause the computer device to perform the method of chinese punctuation addition according to any one of claims 1 to 7.
CN202010958997.7A 2020-09-14 2020-09-14 Chinese punctuation adding method, system and equipment Pending CN112069816A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010958997.7A CN112069816A (en) 2020-09-14 2020-09-14 Chinese punctuation adding method, system and equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010958997.7A CN112069816A (en) 2020-09-14 2020-09-14 Chinese punctuation adding method, system and equipment

Publications (1)

Publication Number Publication Date
CN112069816A true CN112069816A (en) 2020-12-11

Family

ID=73695881

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010958997.7A Pending CN112069816A (en) 2020-09-14 2020-09-14 Chinese punctuation adding method, system and equipment

Country Status (1)

Country Link
CN (1) CN112069816A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113435179A (en) * 2021-06-24 2021-09-24 科大讯飞股份有限公司 Composition evaluation method, device, equipment and storage medium
CN115017883A (en) * 2021-12-20 2022-09-06 昆明理工大学 Text punctuation recovery method based on pre-training fusion voice features

Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107945786A (en) * 2017-11-27 2018-04-20 北京百度网讯科技有限公司 Phoneme synthesizing method and device
CN109168051A (en) * 2018-09-11 2019-01-08 天津理工大学 A kind of network direct broadcasting platform supervision evidence-obtaining system based on blue-ray storage
CN109284352A (en) * 2018-09-30 2019-01-29 哈尔滨工业大学 A kind of querying method of the assessment class document random length words and phrases based on inverted index
CN109767755A (en) * 2019-03-01 2019-05-17 广州多益网络股份有限公司 A kind of phoneme synthesizing method and system
CN109858038A (en) * 2019-03-01 2019-06-07 科大讯飞股份有限公司 A kind of text punctuate determines method and device
CN109918666A (en) * 2019-03-06 2019-06-21 北京工商大学 A kind of Chinese punctuation mark adding method neural network based
CN110310619A (en) * 2019-05-16 2019-10-08 平安科技(深圳)有限公司 Polyphone prediction technique, device, equipment and computer readable storage medium
JP6605105B1 (en) * 2018-10-15 2019-11-13 株式会社野村総合研究所 Sentence symbol insertion apparatus and method
CN110808026A (en) * 2019-11-04 2020-02-18 金华航大北斗应用技术有限公司 Electroglottography voice conversion method based on LSTM
CN111147444A (en) * 2019-11-20 2020-05-12 维沃移动通信有限公司 Interaction method and electronic equipment
CN111241233A (en) * 2019-12-24 2020-06-05 浙江大学 Service robot instruction analysis method based on key verb feature full-density transmission
CN111460807A (en) * 2020-03-13 2020-07-28 平安科技(深圳)有限公司 Sequence labeling method and device, computer equipment and storage medium
CN111597807A (en) * 2020-04-30 2020-08-28 腾讯科技(深圳)有限公司 Method, device and equipment for generating word segmentation data set and storage medium thereof

Patent Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107945786A (en) * 2017-11-27 2018-04-20 北京百度网讯科技有限公司 Phoneme synthesizing method and device
CN109168051A (en) * 2018-09-11 2019-01-08 天津理工大学 A kind of network direct broadcasting platform supervision evidence-obtaining system based on blue-ray storage
CN109284352A (en) * 2018-09-30 2019-01-29 哈尔滨工业大学 A kind of querying method of the assessment class document random length words and phrases based on inverted index
JP6605105B1 (en) * 2018-10-15 2019-11-13 株式会社野村総合研究所 Sentence symbol insertion apparatus and method
CN109767755A (en) * 2019-03-01 2019-05-17 广州多益网络股份有限公司 A kind of phoneme synthesizing method and system
CN109858038A (en) * 2019-03-01 2019-06-07 科大讯飞股份有限公司 A kind of text punctuate determines method and device
CN109918666A (en) * 2019-03-06 2019-06-21 北京工商大学 A kind of Chinese punctuation mark adding method neural network based
CN110310619A (en) * 2019-05-16 2019-10-08 平安科技(深圳)有限公司 Polyphone prediction technique, device, equipment and computer readable storage medium
CN110808026A (en) * 2019-11-04 2020-02-18 金华航大北斗应用技术有限公司 Electroglottography voice conversion method based on LSTM
CN111147444A (en) * 2019-11-20 2020-05-12 维沃移动通信有限公司 Interaction method and electronic equipment
CN111241233A (en) * 2019-12-24 2020-06-05 浙江大学 Service robot instruction analysis method based on key verb feature full-density transmission
CN111460807A (en) * 2020-03-13 2020-07-28 平安科技(深圳)有限公司 Sequence labeling method and device, computer equipment and storage medium
CN111597807A (en) * 2020-04-30 2020-08-28 腾讯科技(深圳)有限公司 Method, device and equipment for generating word segmentation data set and storage medium thereof

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
张钹、党建武、俞凯: "《听觉信息处理研究前沿》", 上海交通大学出版社, pages: 399 - 400 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113435179A (en) * 2021-06-24 2021-09-24 科大讯飞股份有限公司 Composition evaluation method, device, equipment and storage medium
CN113435179B (en) * 2021-06-24 2024-04-30 科大讯飞股份有限公司 Composition review method, device, equipment and storage medium
CN115017883A (en) * 2021-12-20 2022-09-06 昆明理工大学 Text punctuation recovery method based on pre-training fusion voice features
CN115017883B (en) * 2021-12-20 2023-03-07 昆明理工大学 Text punctuation recovery method based on pre-training fusion voice features

Similar Documents

Publication Publication Date Title
CN110135457B (en) Event trigger word extraction method and system based on self-encoder fusion document information
CN109979429A (en) A kind of method and system of TTS
CN112016320A (en) English punctuation adding method, system and equipment based on data enhancement
CN105975454A (en) Chinese word segmentation method and device of webpage text
CN111341293B (en) Text voice front-end conversion method, device, equipment and storage medium
CN110767213A (en) Rhythm prediction method and device
CN110852040B (en) Punctuation prediction model training method and text punctuation determination method
CN112116907A (en) Speech recognition model establishing method, speech recognition device, speech recognition equipment and medium
CN112069816A (en) Chinese punctuation adding method, system and equipment
CN112216267A (en) Rhythm prediction method, device, equipment and storage medium
CN116483991A (en) Dialogue abstract generation method and system
CN115455946A (en) Voice recognition error correction method and device, electronic equipment and storage medium
CN111401012A (en) Text error correction method, electronic device and computer readable storage medium
CN112151019A (en) Text processing method and device and computing equipment
CN111737424A (en) Question matching method, device, equipment and storage medium
CN116562291A (en) Chinese nested named entity recognition method based on boundary detection
CN112634878B (en) Speech recognition post-processing method and system and related equipment
CN115240712A (en) Multi-mode-based emotion classification method, device, equipment and storage medium
CN113486160B (en) Dialogue method and system based on cross-language knowledge
CN115221284A (en) Text similarity calculation method and device, electronic equipment and storage medium
CN114299930A (en) End-to-end speech recognition model processing method, speech recognition method and related device
CN114611529A (en) Intention recognition method and device, electronic equipment and storage medium
CN114036908A (en) Chinese chapter-level event extraction method and device integrated with word list knowledge
CN110858268B (en) Method and system for detecting unsmooth phenomenon in voice translation system
CN114298032A (en) Text punctuation detection method, computer device and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination