WO2020224121A1 - 用于语音识别训练的语料筛选方法、装置及计算机设备 - Google Patents

用于语音识别训练的语料筛选方法、装置及计算机设备 Download PDF

Info

Publication number
WO2020224121A1
WO2020224121A1 PCT/CN2019/103470 CN2019103470W WO2020224121A1 WO 2020224121 A1 WO2020224121 A1 WO 2020224121A1 CN 2019103470 W CN2019103470 W CN 2019103470W WO 2020224121 A1 WO2020224121 A1 WO 2020224121A1
Authority
WO
WIPO (PCT)
Prior art keywords
corpus
word
rate
fragments
word sequence
Prior art date
Application number
PCT/CN2019/103470
Other languages
English (en)
French (fr)
Inventor
王涛
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2020224121A1 publication Critical patent/WO2020224121A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/60Information retrieval; Database structures therefor; File system structures therefor of audio data
    • G06F16/63Querying
    • G06F16/635Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks

Definitions

  • This application relates to the field of speech recognition technology, and in particular to a corpus screening method, device, computer equipment and computer-readable storage medium for speech recognition training.
  • a good speech recognition model is inseparable from marked corpus with good marking quality, but the corpus collected and collected through various channels usually cannot guarantee the accuracy of its marking. Therefore, the collected corpus is directly used for the training of the speech recognition model. Incorrect corpus is not only useless for training, but also reduces the accuracy of speech recognition models.
  • the embodiments of this application provide a corpus screening method, device, computer equipment, and computer readable storage medium for speech recognition training, which can solve the problem of inaccurate speech recognition models caused by inaccurate corpus during speech recognition in traditional technology.
  • the problem can solve the problem of inaccurate speech recognition models caused by inaccurate corpus during speech recognition in traditional technology.
  • an embodiment of the present application provides a corpus screening method for speech recognition training.
  • the method includes: timestamp labeling the corpus to obtain multiple corpus fragments, and the multiple corpus fragments are combined into a first A corpus; use the first corpus to train a speech recognition model to obtain a first speech recognition model; use the first speech recognition model to recognize each of the corpus segments in the first corpus to obtain The first word sequence corresponding to each corpus fragment; compare each first word sequence with the standard word sequence corresponding to each first word sequence to count the first word sequence of each corpus fragment Word recognition rate, the first word recognition rate includes word error rate or word correct rate; judge whether the first word recognition rate of each corpus segment meets the preset condition of the first word recognition rate; The corpus fragments corresponding to the first word recognition rate under the preset condition of the first word recognition rate are stored to form a second corpus set after screening.
  • an embodiment of the present application also provides a corpus screening device for speech recognition training, including: a tagging unit for time-stamping the corpus to obtain multiple corpus fragments, and combining the multiple corpus fragments Compose a first corpus; a first training unit, used to train a speech recognition model using the first corpus to obtain a first speech recognition model; a first decoding unit, used to use the first speech recognition model to Each of the corpus fragments in the first corpus set is identified to obtain the first word sequence corresponding to each of the corpus fragments; the first statistical unit is used to compare each first word sequence and each The standard word sequence corresponding to the first word sequence is compared to calculate the first word recognition rate of each corpus segment, and the first word recognition rate includes word error rate or word correctness rate; the first judgment unit uses To determine whether the first word recognition rate of each of the corpus fragments meets the preset condition of the first word recognition rate; the first screening unit is configured to select all those that meet the preset condition of the first word recognition
  • an embodiment of the present application also provides a computer device, which includes a memory and a processor, the memory is stored with a computer program, and the processor implements the training for speech recognition when the computer program is executed. Corpus screening method.
  • the embodiments of the present application also provide a computer-readable storage medium, the computer-readable storage medium stores a computer program, and when the computer program is executed by a processor, the processor executes the The corpus screening method for speech recognition training.
  • FIG. 1 is a schematic diagram of an application scenario of a corpus screening method for speech recognition training provided by an embodiment of this application;
  • FIG. 2 is a schematic flowchart of a corpus screening method for speech recognition training provided by an embodiment of this application;
  • FIG. 3 is a schematic diagram of the corpus being time stamped in the corpus screening for speech recognition training provided by an embodiment of the application;
  • FIG. 5 is a schematic diagram of voice coding in a corpus screening method for speech recognition training provided by an embodiment of the application;
  • FIG. 6 is a schematic block diagram of a corpus screening device for speech recognition training provided by an embodiment of this application.
  • FIG. 7 is another schematic block diagram of a corpus screening device for speech recognition training provided by an embodiment of the application.
  • FIG. 8 is a schematic block diagram of a computer device provided by an embodiment of the application.
  • FIG. 1 is a schematic diagram of an application scenario of a corpus screening method for speech recognition training provided by an embodiment of the application.
  • the application scenarios include:
  • the terminal can also be called the front end.
  • the terminal collects or collects the corpus for training the speech recognition model.
  • the terminal can be an electronic device such as a notebook computer, a smart watch, a tablet computer, or a desktop computer. Server connection.
  • the server performs voice recognition.
  • the server can be a single server, a server cluster or a cloud server. If the server is a server cluster, it can also include a master server and a slave server.
  • the steps of the server-side execution of the corpus screening method for speech recognition training are mainly taken as an example to explain the corpus screening method used for speech recognition training of this application.
  • the technical solution, the working process of each subject in Figure 1 is as follows: the terminal collects or collects the corpus for speech recognition model training, and sends the corpus to the server so that the server can filter the corpus; the server marks the corpus by timestamp to obtain Multiple pieces of corpus fragments, and multiple pieces of the corpus fragments are formed into a first corpus, the first corpus is used to train a speech recognition model to obtain a first speech recognition model, and the first speech recognition model is used for the first speech recognition model.
  • Each of the corpus fragments in a corpus is identified to obtain the first word sequence corresponding to each of the corpus fragments, and each of the first word sequences and the standard word sequence corresponding to each of the first word sequences are The comparison is performed to calculate the first word recognition rate of each of the corpus fragments.
  • the first word recognition rate includes the word error rate or the word correct rate.
  • the corpus screening method for speech recognition training in the embodiment of the present application can be applied to a terminal or a server, as long as the training corpus is processed before the server recognizes speech.
  • the application environment of the corpus screening method for speech recognition training in the embodiments of this application is not limited to the application environment shown in FIG. 1, and the corpus screening method for speech recognition training and speech recognition can also be applied together.
  • computer equipment such as terminals, it only needs to be performed before the computer equipment performs speech recognition.
  • the application scenarios of the corpus screening method for speech recognition training are only used to illustrate the technical solutions of this application, and are not used to limit the technical solutions of this application.
  • the connection relationship can also have other forms.
  • FIG. 2 is a schematic flowchart of a corpus screening method for speech recognition training provided by an embodiment of the application.
  • the corpus screening method for speech recognition training is applied to the server in FIG. 1 to complete all or part of the functions of the corpus screening method for speech recognition training.
  • FIG. 2 is a schematic flowchart of a corpus screening method for speech recognition training provided by an embodiment of the present application. As shown in Figure 2, the method includes the following steps S210-S270:
  • S210 Mark the corpus with timestamps to obtain multiple corpus fragments, and combine the multiple corpus fragments into a first corpus set.
  • the corpus fragment which can also be called segmentation
  • segmentation is English as Segment, which refers to the corpus marked segment obtained by marking the corpus with a timestamp, and each marked segment is a segment.
  • the corpus for training a speech recognition model generally includes speech and text corresponding to the speech. The accuracy of speech recognition by the speech recognition model is judged by comparing the word sequence recognized by the speech recognition model with the text corresponding to the speech. Marking the corpus can also be referred to as labeling the corpus, which refers to matching the voice and the text expressed by the voice.
  • a piece of text corresponds to a standard pronunciation, that is, a piece of text corresponds to a piece of standard speech, but in actual speech recognition, due to the different pronunciation of each person and/or the diversity of the background environment, even the same paragraph Text, the voice produced by different people is different, which will cause the text and the voice to not match exactly.
  • different people express it by voice, due to different pronunciations or different background noises, different voices will be formed.
  • voice recognition although it is originally the same text, it is due to different people’s Voice differences caused by differences in pronunciation or background environments will recognize different text content due to differences in voice during voice recognition.
  • the speech recognition model when training the speech recognition model, it is necessary to train the speech recognition model with the corpus that exactly matches the speech and text as much as possible, in order to train a speech recognition model with a better recognition effect.
  • the speech and speech expression contained in the corpus The accuracy of the text matching is called the labeling accuracy of the corpus, or the labeling accuracy of the corpus.
  • a timestamp English as Timestamp, is a complete, verifiable data that can indicate that a piece of data has existed before a certain time. It is usually a sequence of characters that uniquely identifies the time at a certain moment.
  • the corpus used for speech recognition model training generally includes speech and text corresponding to the speech, and the corpus for training the speech recognition model is generally referred to as annotated corpus or tagged corpus.
  • annotated corpus or tagged corpus For the convenience of recording, one way of corpus labeling is to label a long speech with timestamp. Each labeling segment is a segment.
  • the training corpus of the long speech is timestamped to form multiple labeling segments, one label A segment corresponds to a corpus segment, and a corpus segment contains a voice and a text description corresponding to the voice.
  • Multiple corpus segments segmented according to the timestamp can be obtained, and the multiple corpus segments are formed into the first speech recognition training corpus.
  • FIG. 3 is a schematic diagram of the corpus being time-stamped in the corpus screening for speech recognition training provided by an embodiment of the application.
  • the corpus L is marked as 5 with five time stamps. Segmentation, that is, the corpus L is divided into five corpus fragments through timestamp 1 to timestamp 5, and the five corpus fragments form the first corpus set.
  • the methods for time stamping video and audio respectively include:
  • pts refers to the display time
  • inc is a static state, the initial value is 0, and each time the timestamp is finished, inc is incremented by 1
  • fps Framework Per Second
  • FFMpeg which is Fast Forward Mpeg in English
  • FFMpeg is an open source computer program that can be used to record and convert digital audio and video, and convert them into streams.
  • pts refers to the display time
  • inc is a static, the initial value is 0, each time the timestamp inc is increased by 1
  • frame_size is the actual resolution of the screen
  • sample_rate refers to the sampling rate, also known as sample rate or sampling Speed
  • sampling frequency refers to the number of times of sampling the amplitude of the sound wave per second when the analog sound waveform is digitized.
  • the current Unix timestamp (also called Unix timestamp) can be obtained in different programming languages.
  • the time method is adopted in Java
  • the method adopted in JavaScript is:
  • S220 Use the first corpus to train a speech recognition model to obtain a first speech recognition model.
  • ASR Automatic Speech Recognition
  • the voice recognition system includes a voice recognition model.
  • the voice recognition system not only includes a voice recognition model, but also includes some other related content that provides service support for the voice recognition model.
  • the analog signal is converted into a digital signal, and then the digital signal is converted into
  • the process of training the speech recognition model is a process in which the speech recognition model automatically adjusts the parameters of the speech recognition model according to the training corpus.
  • the speech recognition model automatically adjusts the parameters of the speech recognition model according to the training corpus to adapt to the training corpus.
  • the data and the model are matched. Therefore, each training of the speech recognition model using different training corpora will cause changes in the parameters of the speech recognition model.
  • the acoustic modeling of speech recognition based on deep neural networks (DNN) is not only different The network structure and optimization strategy can greatly improve the performance of the acoustic model.
  • DNN deep neural networks
  • the training corpus can be used to train the speech recognition model through supervised learning. Each time the training corpus is different, it will lead to The parameters of the speech recognition model are changed. Therefore, in the embodiment of the present application, in the process of training the speech recognition model in an iterative manner, each time the training corpus is screened will cause the adjustment of the parameters in the speech recognition model, which in turn causes the speech recognition Optimization of the model.
  • FIG. 4 is a flowchart of the principle of speech recognition in the corpus screening method for speech recognition training provided by an embodiment of the application.
  • Each training of a speech recognition model will go through the following process, in the following training corpus
  • the parameters in the speech recognition model will be changed according to the different training corpus, so as to realize the adjustment and optimization of the speech recognition model to improve the accuracy of the speech recognition model.
  • the principle process of speech recognition includes the following processes: 1) Voice input refers to acquiring speech, for example, acquiring collected or collected training speech corpus; 2) Encoding refers to encoding the input speech through encoding Perform feature extraction on speech, such as encoding and extracting speech corpus; 3) Decoding, decoding the extracted speech features through acoustic models and language models, and the acoustic model is trained by training data 1 to achieve the desired effect, so The language model is trained by training data 2 to meet the requirements. Speech recognition converts voice sound waves into text. Given the training data of the target speech, a statistical model for recognition can be trained; 4) Text output, combining the acoustic model with The speech features decoded by the language model are converted into text output. For example, the training speech corpus is converted into text to realize speech recognition, thereby realizing speech recognition that converts speech to text.
  • the acoustic model is Acoustic model in English, and most of the current mainstream systems use hidden Markov models for modeling.
  • Language model is an abstract mathematical modeling of language based on objective facts of language, which is a kind of correspondence.
  • the relationship between language models and objective facts of language is like the relationship between abstract lines and concrete lines in mathematics.
  • FIG. 5 is a schematic diagram of voice encoding in the corpus screening method for speech recognition training provided by an embodiment of the application. As shown in FIG. 5, it generally requires three steps of acquisition, quantization, and encoding.
  • Sound decoding is the process of converting and outputting digitized speech signals into analog speech signals.
  • the decoding process is the process of finding the most likely corresponding phrase given the acoustic characteristics.
  • S230 Recognizing each of the corpus segments in the first corpus set through the first speech recognition model to obtain a first word sequence corresponding to each of the corpus segments.
  • the first speech recognition model is then used to perform operations on each of the corpus fragments in the first corpus.
  • Recognition that is, in the case of the acoustic characteristics of each corpus segment, find the most likely phrase corresponding to each corpus segment to obtain the first word sequence corresponding to each corpus segment.
  • the word recognition rate refers to the speech recognition model for each corpus segment to recognize the correct words or the incorrectly recognized words in the proportion of the total number of words in the standard word sequence corresponding to the corpus segment.
  • the recognition rate includes word error rate and word correct rate.
  • the word error rate in English, Word Error Rate, or WER for short, refers to the proportion of words that are incorrectly recognized for each corpus segment to the total number of words in the standard word sequence corresponding to the corpus segment.
  • the word accuracy rate refers to the proportion of the words correctly recognized for each corpus segment to the total number of words in the standard word sequence corresponding to the corpus segment.
  • each of the first word sequence and the standard word sequence corresponding to each of the first word sequences are compared to calculate the first word error rate or the first word correct rate of each of the corpus fragments.
  • the first word sequence of each of the first word sequence is compared with the standard word sequence corresponding to each of the first word sequence to count the first word errors of each of the corpus fragments
  • the rate step includes: comparing each of the first word sequence and the corresponding words in the standard word sequence corresponding to each of the first word sequences one by one according to the order of the word sequence to obtain the first word sequence to be adjusted to The insertion words, replacement words, and deletion words of the standard word sequence; calculating the ratio of the sum of the number of insertion words, the replacement words, and the deletion words to the number of words in the standard word sequence to obtain the first The word error rate.
  • S is the abbreviation of English Substitution, which refers to the replacement word, which refers to the word that needs to be replaced in order to keep the recognized word sequence consistent with the standard word sequence
  • D is the abbreviation of English Deletion, which refers to the deletion word.
  • I is the abbreviation of English Insertion, which refers to the insertion word, which refers to the difference between the recognized word sequence and the standard word sequence The words that need to be inserted are consistent between them;
  • N is the abbreviation of English Number, which refers to the number of words, which refers to the number of words in the standard word sequence;
  • Accuracy is the accuracy rate, which can also be called the correct rate, which is in speech recognition Words that are accurately recognized.
  • S250 Determine whether the first word recognition rate of each corpus segment meets a preset condition of the first word recognition rate.
  • the preset condition of the first word recognition rate refers to a condition that satisfies the preset threshold of the first word recognition rate. For example, if the first word recognition rate is the first word error rate, the preset condition of the first word recognition rate is less than or equal to the first preset word error rate threshold, and if the first word recognition rate is the first word The correct rate, the preset condition of the first word recognition rate is greater than or equal to the first preset word correct rate threshold.
  • a preset threshold of word recognition rate is set to filter the corpus fragments to filter out training corpus fragments that do not meet the labeling accuracy requirements, and to filter out corpus fragments that meet the labeling accuracy requirements to obtain effective training corpus .
  • the corpus segment corresponding to the first word recognition rate is retained and stored to form
  • the filtered second corpus that is, the corpus fragment corresponding to the first word recognition rate that meets the preset condition of the first word recognition rate is stored to form the filtered second corpus, as Finally, the effective sentences obtained are screened, and the speech recognition model is further trained through the screened effective sentences.
  • the corpus segment does not meet the labeling accuracy
  • the corpus segment corresponding to the first word recognition rate is filtered out, and the corpus segment corresponding to the first word recognition rate that does not meet the preset condition of the first word recognition rate is eliminated to Complete the screening of the training corpus of the speech recognition model.
  • the corpus for the speech recognition model training is screened in advance, the corpus is time stamped to obtain multiple corpus fragments, and the multiple corpus fragments are formed into a first corpus set, Use the first corpus to train a speech recognition model to obtain a first speech recognition model, and use the first speech recognition model to recognize each of the corpus segments in the first corpus to obtain each The first word sequence corresponding to the predicate corpus fragment, and each of the first word sequence and the standard word sequence corresponding to each of the first word sequences are compared to calculate the first word recognition rate of each corpus fragment
  • the first word recognition rate includes a word error rate or a word correct rate, and it is judged whether the first word recognition rate of each corpus segment meets the preset condition of the first word recognition rate, and the first word recognition rate will be satisfied.
  • the corpus fragments corresponding to the first word recognition rate of the preset condition of a word recognition rate are stored to form a second corpus set after screening.
  • the step of storing the corpus fragments corresponding to the first word recognition rate satisfying the preset condition of the first word recognition rate to form a second corpus set after screening Also includes:
  • the speech recognition model is retrained using the corpus fragments after the first screening, that is, the first speech recognition model is trained using the second corpus set to obtain a second speech recognition model, and the second speech recognition model is obtained through the first
  • the second speech recognition model recognizes each of the corpus fragments in the second corpus to obtain the second word sequence of each of the corpus fragments, and compares each second word sequence with each second
  • the standard word sequence corresponding to the word sequence is compared to calculate the second word recognition rate of each corpus fragment.
  • the second word recognition rate includes word error rate or word correctness rate.
  • the above steps are iterated until all the corpus fragments satisfying the preset condition of the preset word recognition rate are obtained to form the screened corpus, until finally the corpus that meets the requirements is obtained. For example, if the WER threshold of the corpus is required to be less than 5%, the corpus with the WER threshold less than 5% can be screened out, which can effectively filter out the corpus marked in the form of segments, and obtain the training corpus whose labeling accuracy meets the requirements, thereby improving training speech recognition The accuracy of the model.
  • the method for iterative screening of corpus in speech recognition selects the corpus by training and decoding the speech recognition, and then uses the filtered corpus to train the speech recognition model, and iterates repeatedly to finally obtain the filtered corpus with high accuracy. , Can effectively filter out the corpus marked in the form of Segment, and obtain the training corpus whose labeling accuracy meets the requirements.
  • the first word recognition rate is a first word error rate.
  • the step of judging whether the first word recognition rate of each corpus segment meets a preset condition of the first word recognition rate includes: determining whether the first word error rate of each corpus segment is less than Or it is equal to the first preset word error rate threshold for judgment.
  • the step of storing the corpus fragments corresponding to the first word recognition rate satisfying the preset condition of the first word recognition rate to form a second corpus after screening includes: meeting the first word recognition rate
  • the corpus fragments corresponding to the first word error rate whose word error rate is less than or equal to the first preset word error rate threshold are stored to form a second corpus set after screening.
  • the first word recognition rate is a first word error rate
  • whether the first word error rate of each of the corpus fragments is less than or equal to a first preset word error rate threshold is determined, which will satisfy all
  • the corpus fragments corresponding to the first word error rate whose first word error rate is less than or equal to the first preset word error rate threshold are stored to form a filtered second corpus set, if the corpus The first word error rate of the fragment is greater than the first preset word error rate threshold, and the corpus fragments corresponding to the first word error rate are filtered out to eliminate the corpus fragments that do not meet the requirements.
  • the specific calculation method can refer to Formula (1) in the first embodiment.
  • Set a WER threshold to filter the segments. For example, set the WER threshold to 25% and filter out those with a word error rate greater than 25%, leaving the training corpus with a word error rate less than or equal to 25%, so as to obtain those that meet the requirements Corpus.
  • the first word recognition rate is the first word correct rate.
  • the step of judging whether the first word recognition rate of each corpus segment satisfies a preset condition of the first word recognition rate includes: determining whether the first word accuracy rate of each corpus segment is greater than Or it is equal to the first preset word correct rate threshold for judgment.
  • the step of storing the corpus fragments corresponding to the first word recognition rate satisfying the preset condition of the first word recognition rate to form a second corpus after screening includes: meeting the first word recognition rate
  • the corpus fragments corresponding to the first word accuracy rate whose word accuracy rate is greater than or equal to the first preset word accuracy rate threshold are stored to form a second corpus set after screening.
  • the corpus fragments that do not meet the requirements be filtered out according to the word error rate of the words recognized in the corpus fragments to filter out the corpus fragments that meet the requirements, but also the word accuracy rate of the words recognized by the corpus fragments can be directly screened to meet the requirements That is, the first word recognition rate is the first word correct rate, and whether the correct rate of the first word of each corpus fragment is greater than or equal to the first preset word correct rate threshold is judged, If the correct rate of the first word of the corpus segment is greater than or equal to the first preset word correct rate threshold, retain and store all the corpus segments corresponding to the correct rate of the first word to form the filtered first word In the second corpus, if the correct rate of the first word of the corpus fragment is less than the first preset word correct rate threshold, filter out the corpus fragment corresponding to the first word correct rate to eliminate those that do not meet the requirements Corpus fragments, so as to screen out the effective corpus fragments that meet the requirements as the final training corpus
  • the step of comparing each of the first word sequence with the standard word sequence corresponding to each of the first word sequence to count the correct rate of the first word of each of the corpus fragments include:
  • the corresponding words in each of the first word sequence and the standard word sequence corresponding to each of the first word sequences are compared one by one according to the order of the word sequence to obtain the first word sequence and adjusted to the standard word sequence And calculating the ratio of the number of words in the match and the standard word sequence to obtain the correct rate of the first word.
  • the first word sequence of each of the first word sequence is compared with the standard word sequence corresponding to each of the first word sequence to calculate the correct rate of the first word of each of the corpus fragments
  • the steps include: comparing each of the first word sequence and the corresponding words in the standard word sequence corresponding to each of the first word sequences one by one in the order of the word sequence to obtain the first word sequence to be adjusted State the insertion word, replacement word and deletion word of the standard word sequence; calculate the ratio of the sum of the number of the insertion word, the replacement word and the deletion word to the number of words in the standard word sequence to obtain the first word error Rate; and obtaining the first word correct rate corresponding to the corpus segment according to the first word error rate.
  • each of the first word sequence and the corresponding words in the standard word sequence corresponding to each of the first word sequences are compared one by one according to the order of the word sequence to obtain the first word sequence to be adjusted to the
  • the matching word of the standard word sequence is to recognize the correct word, or it is called the accurate word recognition, and the ratio of the number of words in the matching word to the standard word sequence is calculated to obtain the correct rate of the first word.
  • the word error rate is first counted, and the word correct rate is calculated according to the word error rate, that is, the corresponding words in each first word sequence and the standard word sequence corresponding to each first word sequence are calculated according to The order of the word sequence is compared one by one to obtain the insertion word, the replacement word, and the deletion word of the first word sequence adjusted to the standard word sequence, and the number of the insertion word, the replacement word and the deletion word is calculated And the ratio of the number of words in the standard word sequence to obtain the first word error rate, and obtain the first word correct rate corresponding to the corpus fragment according to the first word error rate.
  • the specific calculation method can refer to the first implementation The formula (1) and formula (2) in the example.
  • the method before the step of performing timestamp labeling on the corpus to obtain multiple pieces of corpus fragments, and forming the multiple pieces of the corpus fragments into a first corpus set, the method further includes: acquiring a plurality of pieces of corpus that carry preset order identifier The corpus segment is obtained by cutting the corpus according to a preset size.
  • the step of performing timestamp labeling on the corpus to obtain multiple corpus fragments, and forming the multiple corpus fragments into a first corpus set includes: using a distributed system to separately time stamp each corpus in a parallel manner Annotate to obtain a first corpus set composed of multiple corpus fragments that are segmented according to the timestamp and carry the preset order identifier.
  • the preset sequence identifier refers to the identifier used to describe the position of the corpus segment in the entire long speech corpus. Including sequence number, such as A, B, C or 1, 2, 3, etc.
  • corpus screening as an audio file will reduce the efficiency of screening because the audio file is too large.
  • the corpus can be cut according to a preset size to obtain multiple corpus segments, and the corpus segment carries There is a preset order mark to describe the position of the corpus in the long speech corpus to facilitate subsequent recognition of the corpus.
  • the corpus is cut according to a preset size to obtain multiple corpus segments carrying preset order identifiers, and each of the corpus segments is marked by time stamps in a parallel manner using a distributed system, and the processing is performed according to the time stamps.
  • Segment multiple corpus fragments included in each corpus segment that carries the preset order identifier and then combine multiple corpus fragments of each corpus segment into a first corpus set, and then filter the corpus fragments in the first corpus set.
  • different methods can be used in different programming languages. For example, in C language, you can use the string cutting function Split in C language to cut, and in JAVA, you can use the CUT method to cut.
  • voice activity detection can also be used to eliminate the silent period signal in the corpus.
  • voice activity detection English is Voice Activity Detection, abbreviated as VAD.
  • VAD Voice Activity Detection
  • VAD can identify and eliminate long-term silent periods from the sound signal stream.
  • the introduction of VAD to eliminate the silent period signals in the corpus can further improve the accuracy of the corpus, thereby further improving the corpus Quality, by improving the quality of the corpus to further improve the accuracy of the effective corpus training speech recognition model.
  • FIG. 6 is a schematic block diagram of a corpus screening device for speech recognition training provided by an embodiment of the application.
  • an embodiment of the present application also provides a corpus screening device for speech recognition training.
  • the corpus screening device for speech recognition training includes a unit for executing the aforementioned corpus screening method for speech recognition training.
  • the device can be configured in a computer device such as a server.
  • the corpus screening device 600 for speech recognition training includes a labeling unit 601, a first training unit 602, a first decoding unit 603, a first statistics unit 604, a first judgment unit 605, and a first Screening unit 606.
  • the tagging unit 601 is used for time stamping the corpus to obtain multiple corpus fragments, and the multiple corpus fragments are formed into a first corpus set;
  • the first training unit 602 is used for using the first corpus set to recognize speech
  • the model is trained to obtain a first speech recognition model;
  • the first decoding unit 603 is configured to recognize each of the corpus fragments in the first corpus through the first speech recognition model to obtain each of the corpus fragments Corresponding first word sequence;
  • the first statistical unit 604 is configured to compare each first word sequence with the standard word sequence corresponding to each first word sequence to count the first word sequence of each corpus fragment A word recognition rate, where the first word recognition rate includes a word error rate or a word correct rate;
  • the first determining unit 605 is configured to determine whether the first word recognition rate of each corpus segment meets the first word recognition rate
  • the first screening unit 606 is configured to store the corpus fragments corresponding to the first word recognition rate satisfying the preset condition of the first word recognition rate to form
  • FIG. 7 is another schematic block diagram of the corpus screening device for speech recognition training provided by an embodiment of the application.
  • the corpus screening device 600 for speech recognition training further includes: a second training unit 607, configured to use the second corpus set to compare the first speech recognition model Perform training to obtain a second speech recognition model; a second decoding unit 608, configured to recognize each of the corpus segments in the second corpus through the second speech recognition model to obtain each of the corpus segments The second word sequence; the second statistical unit 609, used to compare each second word sequence with the standard word sequence corresponding to each second word sequence to count the first word sequence of each corpus fragment Two-word recognition rate, where the second word recognition rate includes a word error rate or a word correct rate; the second judgment unit 610 is configured to determine whether the second word recognition rate of each corpus segment satisfies the second word recognition
  • the second screening unit 611 is configured to store the corpus fragments corresponding to the second word recognition rate that meets the second word recognition rate preset condition
  • the first word recognition rate is a first word error rate.
  • the first judging unit 605 is configured to judge whether the first word error rate of each corpus segment is less than or equal to a first preset word error rate threshold.
  • the first screening unit 606 is configured to store the corpus fragments corresponding to the first word error rate satisfying the first word error rate less than or equal to the first preset word error rate threshold to store Form the second corpus after screening.
  • the first statistical unit 604 includes: a first comparison subunit, configured to compare each of the first word sequence and the standard word sequence corresponding to each of the first word sequence The words are compared one by one according to the order of the word sequence to obtain the insertion words, replacement words, and deletion words of the first word sequence adjusted to the standard word sequence; the calculation subunit is used to calculate the insertion words and the replacement words And the ratio of the sum of the number of deleted words to the number of words in the standard word sequence to obtain the first word error rate.
  • a first comparison subunit configured to compare each of the first word sequence and the standard word sequence corresponding to each of the first word sequence The words are compared one by one according to the order of the word sequence to obtain the insertion words, replacement words, and deletion words of the first word sequence adjusted to the standard word sequence
  • the calculation subunit is used to calculate the insertion words and the replacement words
  • the ratio of the sum of the number of deleted words to the number of words in the standard word sequence to obtain the first word error rate.
  • the first word recognition rate is the first word correct rate.
  • the first judgment unit 605 is configured to judge whether the first word correctness rate of each corpus segment is greater than or equal to a first preset word correctness rate threshold.
  • the first screening unit 606 is configured to store the corpus fragments corresponding to the first word correct rate that meets the first word correct rate greater than or equal to the first preset word correct rate threshold to store Form the second corpus after screening.
  • the first statistical unit 604 includes: a second comparison subunit, configured to compare each of the first word sequence and the standard word sequence corresponding to each of the first word sequence The words are compared one by one according to the order of the word sequence to obtain the matching words of the first word sequence adjusted to the standard word sequence; the second calculation subunit is used to calculate the number of the matching words and the number of words in the standard word sequence The ratio of to get the first word correct rate.
  • the first statistical unit 604 includes: a third comparison subunit, configured to compare each of the first word sequence and the standard word sequence corresponding to each of the first word sequence The words are compared one by one according to the order of the word sequence to obtain the insertion words, replacement words and deletion words of the first word sequence adjusted to the standard word sequence; the third calculation subunit is used to calculate the insertion words, the The ratio of the sum of the number of replacement words and the number of deleted words to the number of words in the standard word sequence to obtain the first word error rate; the obtaining subunit is used to obtain the corresponding corpus fragment according to the first word error rate The correct rate of the first word.
  • the corpus screening device 600 for speech recognition training further includes: an acquiring unit 613, configured to acquire a plurality of corpora carrying preset order identifiers Segment, the corpus segment is obtained by cutting the corpus according to a preset size.
  • the tagging unit 601 is configured to use a distributed system to tag each corpus segment by time stamp in a parallel manner to obtain multiple segments of corpus that are segmented according to the time stamp and carry the preset order identifier The first corpus composed of fragments.
  • the division and connection of the units in the corpus screening device for speech recognition training are only for illustration.
  • the corpus screening device for speech recognition training can be divided into different units as needed. It is also possible to adopt different connection sequences and methods for each unit of the corpus screening device used for speech recognition training to complete all or part of the functions of the corpus screening device used for speech recognition training.
  • the above-mentioned corpus screening device for speech recognition training can be implemented in the form of a computer program, and the computer program can be run on a computer device as shown in FIG. 8.
  • FIG. 8 is a schematic block diagram of a computer device according to an embodiment of the present application.
  • the computer device 800 may be a computer device such as a desktop computer or a server, or may be a component or component in other devices.
  • the computer device 800 includes a processor 802, a memory, and a network interface 805 connected through a system bus 801, where the memory may include a non-volatile storage medium 803 and an internal memory 804.
  • the non-volatile storage medium 803 can store an operating system 8031 and a computer program 8032.
  • the processor 802 can execute a corpus screening method for speech recognition training.
  • the processor 802 is used to provide calculation and control capabilities to support the operation of the entire computer device 800.
  • the internal memory 804 provides an environment for the operation of the computer program 8032 in the non-volatile storage medium 803.
  • the processor 802 can perform the aforementioned corpus screening for speech recognition training. method.
  • the network interface 805 is used for network communication with other devices.
  • the structure shown in FIG. 8 is only a block diagram of part of the structure related to the solution of the present application, and does not constitute a limitation on the computer device 800 to which the solution of the present application is applied.
  • the specific computer device 800 may include more or fewer components than shown in the figure, or combine certain components, or have a different component arrangement.
  • the computer device may only include a memory and a processor. In such an embodiment, the structures and functions of the memory and the processor are consistent with the embodiment shown in FIG. 8 and will not be repeated here.
  • the processor 802 is configured to run a computer program 8032 stored in a memory to implement the corpus screening method for speech recognition training in the embodiment of the present application.
  • the processor 802 may be a central processing unit (Central Processing Unit, CPU), and the processor 802 may also be other general-purpose processors, digital signal processors (Digital Signal Processors, DSPs), Application Specific Integrated Circuit (ASIC), Field-Programmable Gate Array (FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc.
  • the general-purpose processor may be a microprocessor or the processor may also be any conventional processor.
  • a person of ordinary skill in the art can understand that all or part of the processes in the methods of the foregoing embodiments can be implemented by a computer program, and the computer program can be stored in a computer-readable storage medium.
  • the computer program is executed by at least one processor in the computer system to implement the steps of the embodiment of the corpus screening method for speech recognition training.
  • the embodiment of the present application also provides a computer-readable storage medium.
  • the computer-readable storage medium may be a non-volatile computer-readable storage medium, the computer-readable storage medium stores a computer program, and when the computer program is executed by a processor, the processor executes the following steps:
  • a computer program product when it runs on a computer, causes the computer to execute the steps of the corpus screening method for speech recognition training described in the above embodiments.
  • the computer-readable storage medium may be an internal storage unit of the aforementioned device, such as a hard disk or memory of the device.
  • the computer-readable storage medium may also be an external storage device of the device, such as a plug-in hard disk equipped on the device, a Smart Media Card (SMC), or a Secure Digital (SD) card , Flash Card, etc.
  • the computer-readable storage medium may also include both an internal storage unit of the device and an external storage device.
  • the storage medium is a physical, non-transitory storage medium, such as a U disk, a mobile hard disk, a read-only memory (Read-Only Memory, ROM), a magnetic disk or an optical disk, and other physical storage that can store computer programs. medium.
  • a physical, non-transitory storage medium such as a U disk, a mobile hard disk, a read-only memory (Read-Only Memory, ROM), a magnetic disk or an optical disk, and other physical storage that can store computer programs. medium.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Theoretical Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Electrically Operated Instructional Devices (AREA)
  • Machine Translation (AREA)

Abstract

本申请实施例提供了一种用于语音识别训练的语料筛选方法、装置、计算机设备及计算机可读存储介质。本申请实施例属于语音识别技术领域,通过对语料进行分时间戳标注以得到第一语料集,使用第一语料集对语音识别模型进行训练以得到第一语音识别模型,通过第一语音识别模型对第一语料集中的每个语料片段进行解码以得到每个语料片段对应的第一词序列,将每个第一词序列和每个第一词序列对应的标准词序列进行比对以统计每个语料片段的第一词识别率,对每个语料片段的第一词识别率是否满足第一词识别率预设条件进行判断,将满足第一词识别率预设条件的第一词识别率对应的语料片段进行存储以形成筛选后的第二语料集。

Description

用于语音识别训练的语料筛选方法、装置及计算机设备
本申请要求于2019年5月6日提交中国专利局、申请号为201910372331.0、申请名称为“用于语音识别训练的语料筛选方法、装置及计算机设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及语音识别技术领域,尤其涉及一种用于语音识别训练的语料筛选方法、装置、计算机设备及计算机可读存储介质。
背景技术
一个好的语音识别模型离不开标记质量良好的标记语料,但是通过各种渠道采集和收集的语料通常无法保证其标记的准确性,因此直接使用收集的语料进行语音识别模型的训练,其中标记不正确的语料不仅对训练无益,而且会降低语音识别模型的准确性。
发明内容
本申请实施例提供了一种用于语音识别训练的语料筛选方法、装置、计算机设备及计算机可读存储介质,能够解决传统技术中语音识别时由于语料的不准确导致语音识别模型准确性不高的问题。
第一方面,本申请实施例提供了一种用于语音识别训练的语料筛选方法,所述方法包括:对语料进行分时间戳标注以得到多段语料片段,并将多段所述语料片段组成第一语料集;使用所述第一语料集对语音识别模型进行训练以得到第一语音识别模型;通过所述第一语音识别模型对所述第一语料集中的每个所述语料片段进行识别以得到每个所述语料片段对应的第一词序列;将每个所述第一词序列和每个所述第一词序列对应的标准词序列进行比对以统计每个所述语料片段的第一词识别率,所述第一词识别率包括词错误率或者词正确率;对每个所述语料片段的所述第一词识别率是否满足第一词识别率预设条件进行 判断;将满足所述第一词识别率预设条件的所述第一词识别率所对应的所述语料片段进行存储以形成筛选后的第二语料集。
第二方面,本申请实施例还提供了一种用于语音识别训练的语料筛选装置,包括:标注单元,用于对语料进行分时间戳标注以得到多段语料片段,并将多段所述语料片段组成第一语料集;第一训练单元,用于使用所述第一语料集对语音识别模型进行训练以得到第一语音识别模型;第一解码单元,用于通过所述第一语音识别模型对所述第一语料集中的每个所述语料片段进行识别以得到每个所述语料片段对应的第一词序列;第一统计单元,用于将每个所述第一词序列和每个所述第一词序列对应的标准词序列进行比对以统计每个所述语料片段的第一词识别率,所述第一词识别率包括词错误率或者词正确率;第一判断单元,用于对每个所述语料片段的所述第一词识别率是否满足第一词识别率预设条件进行判断;第一筛选单元,用于将满足所述第一词识别率预设条件的所述第一词识别率所对应的所述语料片段进行存储以形成筛选后的第二语料集。
第三方面,本申请实施例还提供了一种计算机设备,其包括存储器及处理器,所述存储器上存储有计算机程序,所述处理器执行所述计算机程序时实现所述用于语音识别训练的语料筛选方法。
第四方面,本申请实施例还提供了一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,所述计算机程序被处理器执行时使所述处理器执行所述用于语音识别训练的语料筛选方法。
附图说明
为了更清楚地说明本申请实施例技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1为本申请实施例提供的用于语音识别训练的语料筛选方法的应用场景示意图;
图2为本申请实施例提供的用于语音识别训练的语料筛选方法的流程示意 图;
图3为本申请实施例提供的用于语音识别训练的语料筛选中将语料进行分时间戳标注的示意图;
图4为本申请实施例提供的用于语音识别训练的语料筛选方法中语音识别原理流程图;
图5为本申请实施例提供的用于语音识别训练的语料筛选方法中声音编码示意图;
图6为本申请实施例提供的用于语音识别训练的语料筛选装置的示意性框图;
图7为本申请实施例提供的用于语音识别训练的语料筛选装置的另一个示意性框图;以及
图8为本申请实施例提供的计算机设备的示意性框图。
具体实施方式
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
请参阅图1,图1为本申请实施例提供的用于语音识别训练的语料筛选方法的应用场景示意图。所述应用场景包括:
(1)终端,终端也可以称为前端,通过终端采集或者收集训练语音识别模型的语料,所述终端可以为笔记本电脑、智能手表、平板电脑或者台式电脑等电子设备,图1中的终端与服务器连接。
(2)服务器,服务器进行语音识别,服务器可以为单台服务器、服务器集群或者云服务器,服务器若为服务器集群还可以包括主服务器和从服务器。
请继续参阅图1,如图1所示,在本申请实施例中,主要以服务器端执行用于语音识别训练的语料筛选方法的步骤为例来解释本申请用于语音识别训练的语料筛选方法的技术方案,图1中的各个主体工作过程如下:终端采集或者收 集进行语音识别模型训练的语料,并将语料发送至服务器以使服务器对语料进行筛选;服务器对语料进行分时间戳标注以得到多段语料片段,并将多段所述语料片段组成第一语料集,使用所述第一语料集对语音识别模型进行训练以得到第一语音识别模型,通过所述第一语音识别模型对所述第一语料集中的每个所述语料片段进行识别以得到每个所述语料片段对应的第一词序列,将每个所述第一词序列和每个所述第一词序列对应的标准词序列进行比对以统计每个所述语料片段的第一词识别率,所述第一词识别率包括词错误率或者词正确率,对每个所述语料片段的所述第一词识别率是否满足第一词识别率预设条件进行判断,将满足所述第一词识别率预设条件的所述第一词识别率所对应的所述语料片段进行存储以形成筛选后的第二语料集,使用第二语料集对语音识别模型进行训练以提高语音识别模型训练的准确性。
需要说明的是,本申请实施例中的用于语音识别训练的语料筛选方法可以应用于终端,也可以应用于服务器,只要在服务器对语音进行识别前对训练语料进行处理即可。同时,本申请实施例中的用于语音识别训练的语料筛选方法的应用环境并不局限于图1所示的应用环境,也可以将用于语音识别训练的语料筛选方法及语音识别一起应用在终端等计算机设备中,只要在计算机设备进行语音识别前进行即可,上述用于语音识别训练的语料筛选方法的应用场景仅仅用于说明本申请技术方案,并不用于限定本申请技术方案,上述连接关系还可以有其他形式。
图2为本申请实施例提供的用于语音识别训练的语料筛选方法的示意性流程图。该用于语音识别训练的语料筛选方法应用于图1中服务器中,以完成用于语音识别训练的语料筛选方法的全部或者部分功能。
请参阅图2,图2是本申请实施例提供的用于语音识别训练的语料筛选方法的流程示意图。如图2所示,该方法包括以下步骤S210-S270:
S210、对语料进行分时间戳标注以得到多段语料片段,并将多段所述语料片段组成第一语料集。
其中,语料片段,又可以称为分段,英文为Segment,是指使用时间戳对语料进行标注得到的语料标注段,每一个标注段是一个Segment。训练语音识别模 型的语料一般包括语音及该语音对应的文字,通过对语音识别模型识别出来的词序列和该语音对应的文字进行比对以判断该语音识别模型进行语音识别的准确性。对语料进行标注,又可以称为对语料进行标记,是指将语音和语音表达的文字对应匹配起来。一般情况下,一段文字对应一段标准发音的语音,也就是一段文字是对应一段标准语音的,但是在实际语音识别中,由于每个人的发音不同和/或背景环境的多样性,即使是同一段文字,不同的人产生的语音是不一样的,会导致文字和语音不完全一致匹配。比如,对同一段文字内容,不同的人用语音表达出来,由于发音不同,或者背景噪声的不同,会形成不同的语音,当进行语音识别时,虽然原本是同一段文字,但是由于不同人的发音或者背景环境的不同形成的语音差异,进行语音识别时会由于语音的差异识别出来不同的文字内容。但进行语音识别模型的训练时,要尽可能使用语音和文字完全匹配一致的语料进行语音识别模型的训练,才能训练出识别效果比较好的语音识别模型,其中,语料中包含的语音和语音表达的文字匹配的准确度称为该语料的标注准确度,或者称为该语料的标记准确度。
时间戳,英文为Timestamp,是一个能表示一份数据在某个特定时间之前已经存在的、完整的、可验证的数据,通常是一个字符序列,唯一地标识某一刻的时间。
具体地,用于语音识别模型训练的语料,一般包含语音及该语音对应的文字,一般称训练语音识别模型的语料为标注语料或者标记语料。为了录音的便利性,一种语料标记的方式是对一段长语音进行分时间戳标注,每一个标注段是一个Segment,将长语音的训练语料进行分时间戳标注形成多个标注段,一个标注段对应一个语料片段,一个语料片段包含语音及该语音对应的文字描述,就可以获得多个根据所述时间戳进行分段的多段语料片段,并将多个语料片段组成语音识别训练语料的第一语料集。请参阅图3,图3为本申请实施例提供的用于语音识别训练的语料筛选中将语料进行分时间戳标注的示意图,如图3所示,通过五个时间戳将语料L标注为5分段,也就是通过时间戳1至时间戳5将语料L分为五段语料片段,五段语料片段组成第一语料集。
进一步地,给视频及音频打时间戳的方法分别包括:
1.)视频时间戳。
pts=inc++*(1000/fps);
其中,pts是指显示时间;inc是一个静态的,初始值为0,每次打完时间戳inc加1;fps(每秒传输帧数(Frames Per Second))是速度单位。
其中,FFMpeg,英文为Fast Forward Mpeg,是一套可以用来记录、转换数字音频、视频,并能将其转化为流的开源计算机程序。
2)音频时间戳。
pts=inc++*(frame_size*1000/sample_rate);
其中,pts是指显示时间;inc是一个静态的,初始值为0,每次打完时间戳inc加1;frame_size就是屏幕的实际分辨率;sample_rate是指抽样率,又称为样本率或者采样速度,采样频率是指将模拟声音波形进行数字化时,每秒钟抽取声波幅度样本的次数。
在不同编程语言中均可以获取现在的Unix时间戳(又可以称为Unix timestamp),比如,Java中采取time方法,JavaScript中采取方法为:
Math.round(newDate().getTime()/1000)getTime(),返回数值的单位是毫秒。
S220、使用所述第一语料集对语音识别模型进行训练以得到第一语音识别模型。
其中,语音识别,英文为Automatic Speech Recognition,一般简称ASR,是将声音转化为文字的过程。
具体地,语音识别系统包括语音识别模型,语音识别系统不仅包括语音识别模型,还包括对语音识别模型提供服务支持的其他一些相关内容。使用第一语料集中原始的所有通过分时间戳标注的语料片段进行语音识别模型训练,也就是将Segment方式标记的训练语料输入至语音识别模型,语音识别模型获取所述Segment方式标记的训练语料后,将所述Segment方式标记的训练语料从模拟信号经过采样和量化处理,转换成数字化语音信号,以完成所述Segment方式标记的训练语料的特征提取,实现语音识别中对所述Segment方式标记的训练语料的编码,以便语音识别模型再将获得的数字化语音信号转换输出为模拟语音信号,实现语音识别中对所述Segment方式标记的训练语料数字化信号 的解码,将所述Segment方式标记的训练语料由模拟信号转换为数字信号,再将数字信号转换为模拟信号,从而实现语音识别,获得当前语音识别模型ASR,通过所述Segment方式标记的训练语料的原始模拟信号和经过语音识别模型转换后的模拟信号的对比,判断语音识别模型对语句识别的准确度。其中,对语音识别模型进行训练的过程是语音识别模型根据训练语料对语音识别模型中参数进行自动调整的过程,语音识别模型根据训练语料的不同会自动调整语音识别模型中的参数以适应训练语料数据和模型的匹配,因此,使用不同的训练语料对语音识别模型的每一次训练都会引起语音识别模型中参数的变化,比如,基于深度神经网络(DNN)的语音识别声学建模,不但不同的网络结构以及优化策略可以极大提升声学模型的性能,在结合深度神经网络的语音识别模型中,而且可以使用训练语料通过有监督学习的方式训练语音识别模型,每一次训练语料的不同,都会导致语音识别模型中参数的改变,所以在本申请实施例中,通过迭代的方式进行语音识别模型的训练的过程中,每一次训练语料的筛选都会引起语音识别模型中参数的调整,进而引起语音识别模型的优化。
进一步地,请参阅图4,图4为本申请实施例提供的用于语音识别训练的语料筛选方法中语音识别原理流程图,每一次对语音识别模型的训练都会经过以下过程,在以下训练语料模型的过程中,根据训练语料的不同会引起语音识别模型中参数的改变,从而实现对语音识别模型的调整和优化以提高语音识别模型对语音识别的准确性。
如图4所示,语音识别原理流程包括以下过程:1)语音输入,是指获取语音,比如,获取采集或者收集的训练语音语料;2)编码,是指对输入的语音进行编码,通过编码对语音进行特征提取,比如,对语音语料进行编码提取;3)解码,通过声学模型和语言模型对提取的语音特征进行解码,所述声学模型经训练数据1训练以达到满足要求的效果,所述语言模型经训练数据2训练以达到满足要求的效果,语音识别是把语音声波转换成文字,给定目标语音的训练数据,可以训练一个识别的统计模型;4)文字输出,将声学模型和语言模型解码的语音特征转换为文字输出,比如,将训练语音语料为文字以实现语音识别,从而实现将语音转换为文字的语音识别。
其中,声学模型,英文为Acoustic model,目前的主流系统多采用隐马尔科夫模型进行建模。
语言模型是根据语言客观事实而进行的语言抽象数学建模,是一种对应关系。语言模型与语言客观事实之间的关系,如同数学上的抽象直线与具体直线之间的关系。
声音编码就是将模拟语音信号转换成数字化语音信号的过程,将模拟连续的声音信号转换成数字信号这个过程叫做音频数字化。请参阅图5,图5为本申请实施例提供的用于语音识别训练的语料筛选方法中声音编码示意图,如图5所示,它一般需要完成采集,量化,编码三个步骤。
声音解码就是将数字化语音信号转换输出为模拟语音信号的过程,解码的过程就是在给定声学特征的情况下,找到最可能对应的词组的过程。
S230、通过所述第一语音识别模型对所述第一语料集中的每个所述语料片段进行识别以得到每个所述语料片段对应的第一词序列。
具体地,在使用所述第一语料集对语音识别模型进行训练以得到第一语音识别模型后,再通过所述第一语音识别模型对所述第一语料集中的每个所述语料片段进行识别,也就是在每个语料片段声学特征的情况下,找到每个语料片段最可能对应的词组以得到所述每个语料片段对应的第一词序列。
S240、将每个所述第一词序列和每个所述第一词序列对应的标准词序列进行比对以统计每个所述语料片段的第一词识别率,所述第一词识别率包括词错误率或者词正确率。
其中,词识别率是指语音识别模型对每个语料片段进行语音识别,对每个语料片段识别正确的词或者识别错误的词占该语料片段对应的标准词序列中总词数量的比例,词识别率包括词错误率和词正确率。
词错误率,英文为Word Error Rate,简称为WER,是指对每个语料片段识别错误的词占该语料片段对应的标准词序列中总词数量的比例。
词正确率是指对每个语料片段识别正确的词占该语料片段对应的标准词序列中总词数量的比例。
具体地,将每个所述第一词序列和每个所述第一词序列对应的标准词序列 进行比对以统计每个所述语料片段的第一词错误率或者第一词正确率。
进一步地,在一个实施例中,所述将每个所述第一词序列和每个所述第一词序列对应的标准词序列进行比对以统计每个所述语料片段的第一词错误率的步骤包括:将每个所述第一词序列和每个所述第一词序列对应的标准词序列中对应的词按照词序列的顺序逐个比对以得到所述第一词序列调整成所述标准词序列的插入词、替换词及删除词;计算所述插入词、所述替换词及所述删除词的数量之和与所述标准词序列中词数量的比值以得到所述第一词错误率。
具体地,为了使识别出来的词序列和标准的词序列之间保持一致,需要进行替换、删除或者插入某些词,这些插入、替换或删除的词的总个数,除以标准的词序列中词的总个数的百分比,即为WER。
公式为:
Figure PCTCN2019103470-appb-000001
Accuracy=100-WER%   (2)
其中,S为英文Substitution的缩写,指替换词,是指为了使识别出来的词序列和标准的词序列之间保持一致,需要进行替换的词;D为英文Deletion的缩写,指删除词,是指为了使识别出来的词序列和标准的词序列之间保持一致,需要进行删除的词;I为英文Insertion的缩写,指插入词,是指为了使识别出来的词序列和标准的词序列之间保持一致,需要进行插入的词;N为英文Number的缩写,是指单词数量,是指标准的词序列中单词的数量;Accuracy为准确率,又可以称为正确率,也就是语音识别中被准确识别的词。
S250、对每个所述语料片段的所述第一词识别率是否满足第一词识别率预设条件进行判断。
S260、将满足所述第一词识别率预设条件的所述第一词识别率所对应的所述语料片段进行存储以形成筛选后的第二语料集。
S270、将不满足所述第一词识别率预设条件的所述第一词识别率所对应的所述语料片段过滤掉。
其中,第一词识别率预设条件是指满足第一词识别率预设阈值的条件。比如,若第一词识别率为第一词错误率,所述第一词识别率预设条件为小于或者 等于第一预设词错误率阈值,若所述第一词识别率为第一词正确率,所述第一词识别率预设条件为大于或者等于第一预设词正确率阈值。
具体地,设定一个词识别率预设阈值对语料片段进行过滤筛选,以过滤掉不符合标注准确度要求的训练语料片段,筛选出符合标注准确度要求的语料片段,从而获得有效的训练语料。通过对每个所述语料片段的所述第一词识别率是否满足第一词识别率预设条件进行判断,可以获知对所述语料片段的识别是否准确,进而判断对该语料片段的标注是否准确,也就是该语料片段是否是高质量的语料片段。若所述第一词识别率满足所述第一词识别率预设条件,也就是该语料片段满足标注准确度的要求,保留并存储所述第一词识别率对应的所述语料片段,形成筛选后的第二语料集,也就是将满足所述第一词识别率预设条件的所述第一词识别率所对应的所述语料片段进行存储以形成筛选后的第二语料集,作为最终筛选获得的有效语句,进一步通过筛选出的有效语句训练语音识别模型,若所述第一词识别率不满足所述第一词识别率预设条件,也就是该语料片段不满足标注准确度的要求,过滤掉所述第一词识别率对应的所述语料片段,将不满足所述第一词识别率预设条件的所述第一词识别率所对应的所述语料片段剔除,以完成对语音识别模型训练语料的筛选。
本申请实施例对语音识别模型进行训练时,预先对进行语音识别模型训练的语料进行筛选,对语料进行分时间戳标注以得到多段语料片段,并将多段所述语料片段组成第一语料集,使用所述第一语料集对语音识别模型进行训练以得到第一语音识别模型,通过所述第一语音识别模型对所述第一语料集中的每个所述语料片段进行识别以得到每个所述语料片段对应的第一词序列,将每个所述第一词序列和每个所述第一词序列对应的标准词序列进行比对以统计每个所述语料片段的第一词识别率,所述第一词识别率包括词错误率或者词正确率,对每个所述语料片段的所述第一词识别率是否满足第一词识别率预设条件进行判断,将满足所述第一词识别率预设条件的所述第一词识别率所对应的所述语料片段进行存储以形成筛选后的第二语料集,通过上述筛选过程可以有效筛选出标注准确度较高的满足要求的训练语料,使用标注准确度较高的有效训练语料训练语音识别模型以提高语音识别训练系统的准确性。
在一个实施例中,所述将满足所述第一词识别率预设条件的所述第一词识别率所对应的所述语料片段进行存储以形成筛选后的第二语料集的步骤之后,还包括:
使用所述第二语料集对所述第一语音识别模型进行训练以得到第二语音识别模型;通过所述第二语音识别模型对所述第二语料集中的每个所述语料片段进行识别以得到每个所述语料片段的第二词序列;将每个所述第二词序列和每个所述第二词序列对应的标准词序列进行比对以统计每个所述语料片段的第二词识别率,所述第二词识别率包括词错误率或者词正确率;对每个所述语料片段的所述第二词识别率是否满足第二词识别率预设条件进行判断;将满足所述第二词识别率预设条件的所述第二词识别率所对应的所述语料片段进行存储以形成筛选后的第三语料集;迭代上述步骤直至得到满足预设词识别率预设条件的所有所述语料片段以形成筛选后的语料集。
具体地,使用经第一次筛选后的语料片段重新训练语音识别模型,也就是使用所述第二语料集对所述第一语音识别模型进行训练以得到第二语音识别模型,通过所述第二语音识别模型对所述第二语料集中的每个所述语料片段进行识别以得到每个所述语料片段的第二词序列,将每个所述第二词序列和每个所述第二词序列对应的标准词序列进行比对以统计每个所述语料片段的第二词识别率,所述第二词识别率包括词错误率或者词正确率,对每个所述语料片段的所述第二词识别率是否满足第二词识别率预设条件进行判断,将满足所述第二词识别率预设条件的所述第二词识别率所对应的所述语料片段进行存储以形成筛选后的第三语料集,迭代上述步骤直至得到满足预设词识别率预设条件的所有所述语料片段以形成筛选后的语料集,直至最终获取满足要求的语料。比如,若要求对语料的WER阈值小于5%,则筛选出WER阈值小于5%的语料,可以有效筛选出以segment形式标记的语料,获得标注准确度满足要求的训练语料,从而提高训练语音识别模型时的准确性。本申请实施例提供的语音识别中语料迭代筛选方法,通过将语音识别训练和解码以筛选语料,再次使用筛选出的语料进行语音识别模型的训练,反复迭代,最终获得准确率高的筛选后语料,可以有效筛选出以Segment形式标记的语料,获得标注准确度满足要求的训练语 料。
在一个实施例中,所述第一词识别率为第一词错误率。所述对每个所述语料片段的所述第一词识别率是否满足第一词识别率预设条件进行判断的步骤包括:对每个所述语料片段的所述第一词错误率是否小于或者等于第一预设词错误率阈值进行判断。所述将满足所述第一词识别率预设条件的所述第一词识别率所对应的所述语料片段进行存储以形成筛选后的第二语料集的步骤包括:将满足所述第一词错误率小于或者等于所述第一预设词错误率阈值的所述第一词错误率所对应的所述语料片段进行存储以形成筛选后的第二语料集。
具体地,所述第一词识别率为第一词错误率,对每个所述语料片段的所述第一词错误率是否小于或者等于第一预设词错误率阈值进行判断,将满足所述第一词错误率小于或者等于所述第一预设词错误率阈值的所述第一词错误率所对应的所述语料片段进行存储以形成筛选后的第二语料集,若所述语料片段的所述第一词错误率大于所述第一预设词错误率阈值,过滤掉所述第一词错误率对应的所述语料片段以剔除不符合要求的语料片段,具体计算方式可以参照第一个实施例中的公式(1)。设定一个WER阈值对segment进行过滤筛选,比如,设定WER阈值为25%,词错误率大于25%的过滤掉,留下词错误率小于或者等于25%的训练语料,从而获取满足要求的语料。
在一个实施例中,所述第一词识别率为第一词正确率。所述对每个所述语料片段的所述第一词识别率是否满足第一词识别率预设条件进行判断的步骤包括:对每个所述语料片段的所述第一词正确率是否大于或者等于第一预设词正确率阈值进行判断。所述将满足所述第一词识别率预设条件的所述第一词识别率所对应的所述语料片段进行存储以形成筛选后的第二语料集的步骤包括:将满足所述第一词正确率大于或者等于所述第一预设词正确率阈值的所述第一词正确率所对应的所述语料片段进行存储以形成筛选后的第二语料集。
具体地,不仅可以根据语料片段识别出词的词错误率过滤掉不符合要求的语料片段以筛选出符合要求的语料片段,还可以根据语料片段识别出的词的词正确率直接筛选出符合要求的语料片段,也就是所述第一词识别率为第一词正确率,对每个所述语料片段的所述第一词正确率是否大于或者等于第一预设词 正确率阈值进行判断,若所述语料片段的所述第一词正确率大于或者等于所述第一预设词正确率阈值,保留并存储所有所述第一词正确率对应的所述语料片段以形成筛选后的第二语料集,若所述语料片段的所述第一词正确率小于所述第一预设词正确率阈值,过滤掉所述第一词正确率对应的所述语料片段以剔除不符合要求的语料片段,从而筛选出符合要求的有效语料片段作为最终的训练语料。
在一个实施例中,所述将每个所述第一词序列和每个所述第一词序列对应的标准词序列进行比对以统计每个所述语料片段的第一词正确率的步骤包括:
将每个所述第一词序列和每个所述第一词序列对应的标准词序列中对应的词按照词序列的顺序逐个比对以得到所述第一词序列调整成所述标准词序列的匹配词;以及计算所述匹配与所述标准词序列中词数量的比值以得到第一词正确率。
或者在另一实施例中,所述将每个所述第一词序列和每个所述第一词序列对应的标准词序列进行比对以统计每个所述语料片段的第一词正确率的步骤包括:将每个所述第一词序列和每个所述第一词序列对应的标准词序列中对应的词按照词序列的顺序逐个比对以得到所述第一词序列调整成所述标准词序列的插入词、替换词及删除词;计算所述插入词、所述替换词及所述删除词的数量之和与所述标准词序列中词数量的比值以得到第一词错误率;以及根据所述第一词错误率获得对应所述语料片段的第一词正确率。
具体地,统计每个所述语料片段的第一词正确率有以下两种方式:
(1)直接统计。
具体地,将每个所述第一词序列和每个所述第一词序列对应的标准词序列中对应的词按照词序列的顺序逐个比对以得到所述第一词序列调整成所述标准词序列的匹配词,就是识别正确的词,或者称为识别准确的词,计算所述匹配词与所述标准词序列中词数量的比值以得到第一词正确率。
(2)间接统计。
具体地,先统计词错误率,根据词错误率,统计出词正确率,也就是将每个所述第一词序列和每个所述第一词序列对应的标准词序列中对应的词按照词 序列的顺序逐个比对以得到所述第一词序列调整成所述标准词序列的插入词、替换词及删除词,计算所述插入词、所述替换词及所述删除词的数量之和与所述标准词序列中词数量的比值以得到第一词错误率,根据所述第一词错误率获得对应所述语料片段的第一词正确率,具体计算方式可以参照第一个实施例中的公式(1)和公式(2)。
在一个实施例中,所述对语料进行分时间戳标注以得到多段语料片段,并将多段所述语料片段组成第一语料集的步骤之前,还包括:获取多个携带有预设顺序标识的语料段,所述语料段由语料按照预设大小进行切割获取。
所述对语料进行分时间戳标注以得到多段语料片段,并将多段所述语料片段组成第一语料集的步骤包括:采用分布式系统通过并行方式对每个所述语料段分别进行分时间戳标注以得到根据所述时间戳进行分段并且携带有所述预设顺序标识的多段语料片段组成的第一语料集。
其中,预设顺序标识是指用来描述语料段在整个长语音语料中的位置的标识。包括顺序编号,比如A、B、C或者1、2、3等方式。
具体地,对于长语音语料,作为一个音频文件进行语料筛选会由于该音频文件太大而降低筛选的效率,可以按照预设大小将语料进行切割以获取多个语料段,所述语料段中携带有预设顺序标识以描述该语料段在长语音语料中的位置,以方便后续识别该语料段。将语料按照预设大小进行切割以获取多个携带有预设顺序标识的语料段,采用分布式系统通过并行方式对每个所述语料段分别进行分时间戳标注,得到根据所述时间戳进行分段并且携带有所述预设顺序标识的每个语料段包含的多段语料片段,再将各个语料段的多个语料片段组成第一语料集,再对第一语料集中的语料片段进行筛选。对语料进行切割,在不同的编程语言中可以使用不同的方法,比如,在C语言中,可以使用C语言中字符串切割函数Split进行切割,JAVA中可以使用CUT方法进行切割。
进一步地,对语料进行切割前,还可以通过语音活动检测以消除语料中的静音期信号。其中,语音活动检测,英文为Voice Activity Detection,缩写为VAD。VAD可以从声音信号流里识别和消除长时间的静音期,引入VAD来消除语料中的静音期信号,从语料中消除长时间的静音期,可以进一步提高语料的准确性, 从而进一步提高语料的质量,通过提高语料的质量以进一步提高有效语料训练语音识别模型的准确度。
需要说明的是,上述各个实施例所述的用于语音识别训练的语料筛选方法,可以根据需要将不同实施例中包含的技术特征重新进行组合,以获取组合后的实施方案,但都在本申请要求的保护范围之内。
请参阅图6,图6为本申请实施例提供的用于语音识别训练的语料筛选装置的示意性框图。对应于上述用于语音识别训练的语料筛选方法,本申请实施例还提供一种用于语音识别训练的语料筛选装置。如图6所示,该用于语音识别训练的语料筛选装置包括用于执行上述用于语音识别训练的语料筛选方法的单元,该装置可以被配置于服务器等计算机设备中。具体地,请参阅图6,该用于语音识别训练的语料筛选装置600包括标注单元601、第一训练单元602、第一解码单元603、第一统计单元604、第一判断单元605及第一筛选单元606。
其中,标注单元601用于对语料进行分时间戳标注以得到多段语料片段,并将多段所述语料片段组成第一语料集;第一训练单元602用于使用所述第一语料集对语音识别模型进行训练以得到第一语音识别模型;第一解码单元603用于通过所述第一语音识别模型对所述第一语料集中的每个所述语料片段进行识别以得到每个所述语料片段对应的第一词序列;第一统计单元604用于将每个所述第一词序列和每个所述第一词序列对应的标准词序列进行比对以统计每个所述语料片段的第一词识别率,所述第一词识别率包括词错误率或者词正确率;第一判断单元605用于对每个所述语料片段的所述第一词识别率是否满足第一词识别率预设条件进行判断;第一筛选单元606用于将满足所述第一词识别率预设条件的所述第一词识别率所对应的所述语料片段进行存储以形成筛选后的第二语料集。
请参阅图7,图7为本申请实施例提供的用于语音识别训练的语料筛选装置的另一个示意性框图。如图7所示,在该实施例中,所述用于语音识别训练的语料筛选装置600还包括:第二训练单元607,用于使用所述第二语料集对所述第一语音识别模型进行训练以得到第二语音识别模型;第二解码单元608,用于通过所述第二语音识别模型对所述第二语料集中的每个所述语料片段进行识别 以得到每个所述语料片段的第二词序列;第二统计单元609,用于将每个所述第二词序列和每个所述第二词序列对应的标准词序列进行比对以统计每个所述语料片段的第二词识别率,所述第二词识别率包括词错误率或者词正确率;第二判断单元610,用于对每个所述语料片段的所述第二词识别率是否满足第二词识别率预设条件进行判断;第二筛选单元611,用于将满足所述第二词识别率预设条件的所述第二词识别率所对应的所述语料片段进行存储以形成筛选后的第三语料集;迭代单元612,用于迭代上述步骤直至得到满足预设词识别率预设条件的所有所述语料片段以形成筛选后的语料集。
在一个实施例中,所述第一词识别率为第一词错误率。所述第一判断单元605,用于对每个所述语料片段的所述第一词错误率是否小于或者等于第一预设词错误率阈值进行判断。所述第一筛选单元606,用于将满足所述第一词错误率小于或者等于所述第一预设词错误率阈值的所述第一词错误率所对应的所述语料片段进行存储以形成筛选后的第二语料集。
在一个实施例中,所述第一统计单元604包括:第一比对子单元,用于将每个所述第一词序列和每个所述第一词序列对应的标准词序列中对应的词按照词序列的顺序逐个比对以得到所述第一词序列调整成所述标准词序列的插入词、替换词及删除词;计算子单元,用于计算所述插入词、所述替换词及所述删除词的数量之和与所述标准词序列中词数量的比值以得到所述第一词错误率。
在一个实施例中,所述第一词识别率为第一词正确率。所述第一判断单元605,用于对每个所述语料片段的所述第一词正确率是否大于或者等于第一预设词正确率阈值进行判断。所述第一筛选单元606,用于将满足所述第一词正确率大于或者等于所述第一预设词正确率阈值的所述第一词正确率所对应的所述语料片段进行存储以形成筛选后的第二语料集。
在一个实施例中,所述第一统计单元604包括:第二比对子单元,用于将每个所述第一词序列和每个所述第一词序列对应的标准词序列中对应的词按照词序列的顺序逐个比对以得到所述第一词序列调整成所述标准词序列的匹配词;第二计算子单元,用于计算所述匹配词与所述标准词序列中词数量的比值以得到第一词正确率。
在一个实施例中,所述第一统计单元604包括:第三比对子单元,用于将每个所述第一词序列和每个所述第一词序列对应的标准词序列中对应的词按照词序列的顺序逐个比对以得到所述第一词序列调整成所述标准词序列的插入词、替换词及删除词;第三计算子单元,用于计算所述插入词、所述替换词及所述删除词的数量之和与所述标准词序列中词数量的比值以得到第一词错误率;获得子单元,用于根据所述第一词错误率获得对应所述语料片段的第一词正确率。
请继续参阅图7,如图7所示,在该实施例中,所述用于语音识别训练的语料筛选装置600还包括:获取单元613,用于获取多个携带有预设顺序标识的语料段,所述语料段由语料按照预设大小进行切割获取。所述标注单元601,用于采用分布式系统通过并行方式对每个所述语料段分别进行分时间戳标注以得到根据所述时间戳进行分段并且携带有所述预设顺序标识的多段语料片段组成的第一语料集。
需要说明的是,所属领域的技术人员可以清楚地了解到,上述用于语音识别训练的语料筛选装置和各单元的具体实现过程,可以参考前述方法实施例中的相应描述,为了描述的方便和简洁,在此不再赘述。
同时,上述用于语音识别训练的语料筛选装置中各个单元的划分和连接方式仅用于举例说明,在其他实施例中,可将用于语音识别训练的语料筛选装置按照需要划分为不同的单元,也可将用于语音识别训练的语料筛选装置中各单元采取不同的连接顺序和方式,以完成上述用于语音识别训练的语料筛选装置的全部或部分功能。
上述用于语音识别训练的语料筛选装置可以实现为一种计算机程序的形式,该计算机程序可以在如图8所示的计算机设备上运行。
请参阅图8,图8是本申请实施例提供的一种计算机设备的示意性框图。该计算机设备800可以是台式机电脑或者服务器等计算机设备,也可以是其他设备中的组件或者部件。
参阅图8,该计算机设备800包括通过系统总线801连接的处理器802、存储器和网络接口805,其中,存储器可以包括非易失性存储介质803和内存储器804。
该非易失性存储介质803可存储操作系统8031和计算机程序8032。该计算机程序8032被执行时,可使得处理器802执行一种上述用于语音识别训练的语料筛选方法。
该处理器802用于提供计算和控制能力,以支撑整个计算机设备800的运行。
该内存储器804为非易失性存储介质803中的计算机程序8032的运行提供环境,该计算机程序8032被处理器802执行时,可使得处理器802执行一种上述用于语音识别训练的语料筛选方法。
该网络接口805用于与其它设备进行网络通信。本领域技术人员可以理解,图8中示出的结构,仅仅是与本申请方案相关的部分结构的框图,并不构成对本申请方案所应用于其上的计算机设备800的限定,具体的计算机设备800可以包括比图中所示更多或更少的部件,或者组合某些部件,或者具有不同的部件布置。例如,在一些实施例中,计算机设备可以仅包括存储器及处理器,在这样的实施例中,存储器及处理器的结构及功能与图8所示实施例一致,在此不再赘述。
其中,所述处理器802用于运行存储在存储器中的计算机程序8032,以实现本申请实施例的用于语音识别训练的语料筛选方法。
应当理解,在本申请实施例中,处理器802可以是中央处理单元(Central Processing Unit,CPU),该处理器802还可以是其他通用处理器、数字信号处理器(Digital Signal Processor,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现成可编程门阵列(Field-Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。其中,通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。
本领域普通技术人员可以理解的是实现上述实施例的方法中的全部或部分流程,是可以通过计算机程序来完成,该计算机程序可存储于一计算机可读存储介质。该计算机程序被该计算机系统中的至少一个处理器执行,以实现上述用于语音识别训练的语料筛选方法的实施例的步骤。
因此,本申请实施例还提供一种计算机可读存储介质。该计算机可读存储 介质可以为非易失性的计算机可读存储介质,该计算机可读存储介质存储有计算机程序,该计算机程序被处理器执行时使处理器执行如下步骤:
一种计算机程序产品,当其在计算机上运行时,使得计算机执行以上各实施例中所描述的用于语音识别训练的语料筛选方法的步骤。
所述计算机可读存储介质可以是前述设备的内部存储单元,例如设备的硬盘或内存。所述计算机可读存储介质也可以是所述设备的外部存储设备,例如所述设备上配备的插接式硬盘,智能存储卡(Smart Media Card,SMC),安全数字(Secure Digital,SD)卡,闪存卡(Flash Card)等。进一步地,所述计算机可读存储介质还可以既包括所述设备的内部存储单元也包括外部存储设备。
所属领域的技术人员可以清楚地了解到,为了描述的方便和简洁,上述描述的设备、装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。
所述存储介质为实体的、非瞬时性的存储介质,例如可以是U盘、移动硬盘、只读存储器(Read-Only Memory,ROM)、磁碟或者光盘等各种可以存储计算机程序的实体存储介质。
本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、计算机软件或者二者的结合来实现,为了清楚地说明硬件和软件的可互换性,在上述说明中已经按照功能一般性地描述了各示例的组成及步骤。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。
以上所述,仅为本申请的具体实施方式,但本申请明的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到各种等效的修改或替换,这些修改或替换都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以权利要求的保护范围为准。

Claims (20)

  1. 一种用于语音识别训练的语料筛选方法,包括:
    对语料进行分时间戳标注以得到多段语料片段,并将多段所述语料片段组成第一语料集;
    使用所述第一语料集对语音识别模型进行训练以得到第一语音识别模型;
    通过所述第一语音识别模型对所述第一语料集中的每个所述语料片段进行识别以得到每个所述语料片段对应的第一词序列;
    将每个所述第一词序列和每个所述第一词序列对应的标准词序列进行比对以统计每个所述语料片段的第一词识别率,所述第一词识别率包括词错误率或者词正确率;
    对每个所述语料片段的所述第一词识别率是否满足第一词识别率预设条件进行判断;
    将满足所述第一词识别率预设条件的所述第一词识别率所对应的所述语料片段进行存储以形成筛选后的第二语料集。
  2. 根据权利要求1所述用于语音识别训练的语料筛选方法,其中,所述将满足所述第一词识别率预设条件的所述第一词识别率所对应的所述语料片段进行存储以形成筛选后的第二语料集的步骤之后,还包括:
    使用所述第二语料集对所述第一语音识别模型进行训练以得到第二语音识别模型;
    通过所述第二语音识别模型对所述第二语料集中的每个所述语料片段进行识别以得到每个所述语料片段的第二词序列;
    将每个所述第二词序列和每个所述第二词序列对应的标准词序列进行比对以统计每个所述语料片段的第二词识别率,所述第二词识别率包括词错误率或者词正确率;
    对每个所述语料片段的所述第二词识别率是否满足第二词识别率预设条件进行判断;
    将满足所述第二词识别率预设条件的所述第二词识别率所对应的所述语料片段进行存储以形成筛选后的第三语料集;
    迭代上述步骤直至得到满足预设词识别率预设条件的所有所述语料片段以形成筛选后的语料集。
  3. 根据权利要求1所述用于语音识别训练的语料筛选方法,其中,所述第一词识别率为第一词错误率;
    所述对每个所述语料片段的所述第一词识别率是否满足第一词识别率预设条件进行判断的步骤包括:
    对每个所述语料片段的所述第一词错误率是否小于或者等于第一预设词错误率阈值进行判断;
    所述将满足所述第一词识别率预设条件的所述第一词识别率所对应的所述语料片段进行存储以形成筛选后的第二语料集的步骤包括:
    将满足所述第一词错误率小于或者等于所述第一预设词错误率阈值的所述第一词错误率所对应的所述语料片段进行存储以形成筛选后的第二语料集。
  4. 根据权利要求3所述用于语音识别训练的语料筛选方法,其中,所述将每个所述第一词序列和每个所述第一词序列对应的标准词序列进行比对以统计每个所述语料片段的第一词错误率的步骤包括:
    将每个所述第一词序列和每个所述第一词序列对应的标准词序列中对应的词按照词序列的顺序逐个比对以得到所述第一词序列调整成所述标准词序列的插入词、替换词及删除词;
    计算所述插入词、所述替换词及所述删除词的数量之和与所述标准词序列中词数量的比值以得到所述第一词错误率。
  5. 根据权利要求1所述用于语音识别训练的语料筛选方法,其中,所述第一词识别率为第一词正确率;
    所述对每个所述语料片段的所述第一词识别率是否满足第一词识别率预设条件进行判断的步骤包括:
    对每个所述语料片段的所述第一词正确率是否大于或者等于第一预设词正确率阈值进行判断;
    所述将满足所述第一词识别率预设条件的所述第一词识别率所对应的所述语料片段进行存储以形成筛选后的第二语料集的步骤包括:
    将满足所述第一词正确率大于或者等于所述第一预设词正确率阈值的所述第一词正确率所对应的所述语料片段进行存储以形成筛选后的第二语料集。
  6. 根据权利要求5所述用于语音识别训练的语料筛选方法,其中,所述将每个所述第一词序列和每个所述第一词序列对应的标准词序列进行比对以统计每个所述语料片段的第一词正确率的步骤包括:
    将每个所述第一词序列和每个所述第一词序列对应的标准词序列中对应的词按照词序列的顺序逐个比对以得到所述第一词序列调整成所述标准词序列的匹配词;
    计算所述匹配词与所述标准词序列中词数量的比值以得到第一词正确率;
    或者,所述将每个所述第一词序列和每个所述第一词序列对应的标准词序列进行比对以统计每个所述语料片段的第一词正确率的步骤包括:
    将每个所述第一词序列和每个所述第一词序列对应的标准词序列中对应的词按照词序列的顺序逐个比对以得到所述第一词序列调整成所述标准词序列的插入词、替换词及删除词;
    计算所述插入词、所述替换词及所述删除词的数量之和与所述标准词序列中词数量的比值以得到第一词错误率;
    根据所述第一词错误率获得对应所述语料片段的第一词正确率。
  7. 根据权利要求1所述用于语音识别训练的语料筛选方法,其中,所述对语料进行分时间戳标注以得到多段语料片段,并将多段所述语料片段组成第一语料集的步骤之前,还包括:
    获取多个携带有预设顺序标识的语料段,所述语料段由语料按照预设大小进行切割获取;
    所述对语料进行分时间戳标注以得到多段语料片段,并将多段所述语料片段组成第一语料集的步骤包括:
    采用分布式系统通过并行方式对每个所述语料段分别进行分时间戳标注以得到根据所述时间戳进行分段并且携带有所述预设顺序标识的多段语料片段组成的第一语料集。
  8. 根据权利要求1所述用于语音识别训练的语料筛选方法,其中,所述对 语料进行分时间戳标注以得到多段语料片段,并将多段所述语料片段组成第一语料集的步骤之前,还包括:
    通过语音活动检测消除所述语料中的静音期信号。
  9. 一种用于语音识别训练的语料筛选装置,包括:
    标注单元,用于对语料进行分时间戳标注以得到多段语料片段,并将多段所述语料片段组成第一语料集;
    第一训练单元,用于使用所述第一语料集对语音识别模型进行训练以得到第一语音识别模型;
    第一解码单元,用于通过所述第一语音识别模型对所述第一语料集中的每个所述语料片段进行识别以得到每个所述语料片段对应的第一词序列;
    第一统计单元,用于将每个所述第一词序列和每个所述第一词序列对应的标准词序列进行比对以统计每个所述语料片段的第一词识别率,所述第一词识别率包括词错误率或者词正确率;
    第一判断单元,用于对每个所述语料片段的所述第一词识别率是否满足第一词识别率预设条件进行判断;
    第一筛选单元,用于将满足所述第一词识别率预设条件的所述第一词识别率所对应的所述语料片段进行存储以形成筛选后的第二语料集。
  10. 根据权利要求9所述用于语音识别训练的语料筛选装置,其中,所述用于语音识别训练的语料筛选装置还包括:
    第二训练单元,用于使用所述第二语料集对所述第一语音识别模型进行训练以得到第二语音识别模型;
    第二解码单元,用于通过所述第二语音识别模型对所述第二语料集中的每个所述语料片段进行识别以得到每个所述语料片段的第二词序列;
    第二统计单元,用于将每个所述第二词序列和每个所述第二词序列对应的标准词序列进行比对以统计每个所述语料片段的第二词识别率,所述第二词识别率包括词错误率或者词正确率;
    第二判断单元,用于对每个所述语料片段的所述第二词识别率是否满足第二词识别率预设条件进行判断;
    第二筛选单元,用于将满足所述第二词识别率预设条件的所述第二词识别率所对应的所述语料片段进行存储以形成筛选后的第三语料集;
    迭代单元,用于迭代上述步骤直至得到满足预设词识别率预设条件的所有所述语料片段以形成筛选后的语料集。
  11. 一种计算机设备,其包括存储器以及与所述存储器相连的处理器;其中,所述存储器用于存储计算机程序;所述处理器用于运行所述存储器中存储的计算机程序时实现如下步骤:
    对语料进行分时间戳标注以得到多段语料片段,并将多段所述语料片段组成第一语料集;
    使用所述第一语料集对语音识别模型进行训练以得到第一语音识别模型;
    通过所述第一语音识别模型对所述第一语料集中的每个所述语料片段进行识别以得到每个所述语料片段对应的第一词序列;
    将每个所述第一词序列和每个所述第一词序列对应的标准词序列进行比对以统计每个所述语料片段的第一词识别率,所述第一词识别率包括词错误率或者词正确率;
    对每个所述语料片段的所述第一词识别率是否满足第一词识别率预设条件进行判断;
    将满足所述第一词识别率预设条件的所述第一词识别率所对应的所述语料片段进行存储以形成筛选后的第二语料集。
  12. 根据权利要求11所述计算机设备,其中,所述将满足所述第一词识别率预设条件的所述第一词识别率所对应的所述语料片段进行存储以形成筛选后的第二语料集的步骤之后,还包括:
    使用所述第二语料集对所述第一语音识别模型进行训练以得到第二语音识别模型;
    通过所述第二语音识别模型对所述第二语料集中的每个所述语料片段进行识别以得到每个所述语料片段的第二词序列;
    将每个所述第二词序列和每个所述第二词序列对应的标准词序列进行比对以统计每个所述语料片段的第二词识别率,所述第二词识别率包括词错误率或 者词正确率;
    对每个所述语料片段的所述第二词识别率是否满足第二词识别率预设条件进行判断;
    将满足所述第二词识别率预设条件的所述第二词识别率所对应的所述语料片段进行存储以形成筛选后的第三语料集;
    迭代上述步骤直至得到满足预设词识别率预设条件的所有所述语料片段以形成筛选后的语料集。
  13. 根据权利要求11所述计算机设备,其中,所述第一词识别率为第一词错误率;
    所述对每个所述语料片段的所述第一词识别率是否满足第一词识别率预设条件进行判断的步骤包括:
    对每个所述语料片段的所述第一词错误率是否小于或者等于第一预设词错误率阈值进行判断;
    所述将满足所述第一词识别率预设条件的所述第一词识别率所对应的所述语料片段进行存储以形成筛选后的第二语料集的步骤包括:
    将满足所述第一词错误率小于或者等于所述第一预设词错误率阈值的所述第一词错误率所对应的所述语料片段进行存储以形成筛选后的第二语料集。
  14. 根据权利要求13所述计算机设备,其中,所述将每个所述第一词序列和每个所述第一词序列对应的标准词序列进行比对以统计每个所述语料片段的第一词错误率的步骤包括:
    将每个所述第一词序列和每个所述第一词序列对应的标准词序列中对应的词按照词序列的顺序逐个比对以得到所述第一词序列调整成所述标准词序列的插入词、替换词及删除词;
    计算所述插入词、所述替换词及所述删除词的数量之和与所述标准词序列中词数量的比值以得到所述第一词错误率。
  15. 根据权利要求11所述计算机设备,其中,所述第一词识别率为第一词正确率;
    所述对每个所述语料片段的所述第一词识别率是否满足第一词识别率预设 条件进行判断的步骤包括:
    对每个所述语料片段的所述第一词正确率是否大于或者等于第一预设词正确率阈值进行判断;
    所述将满足所述第一词识别率预设条件的所述第一词识别率所对应的所述语料片段进行存储以形成筛选后的第二语料集的步骤包括:
    将满足所述第一词正确率大于或者等于所述第一预设词正确率阈值的所述第一词正确率所对应的所述语料片段进行存储以形成筛选后的第二语料集。
  16. 根据权利要求15所述计算机设备,其中,所述将每个所述第一词序列和每个所述第一词序列对应的标准词序列进行比对以统计每个所述语料片段的第一词正确率的步骤包括:
    将每个所述第一词序列和每个所述第一词序列对应的标准词序列中对应的词按照词序列的顺序逐个比对以得到所述第一词序列调整成所述标准词序列的匹配词;
    计算所述匹配词与所述标准词序列中词数量的比值以得到第一词正确率;
    或者,所述将每个所述第一词序列和每个所述第一词序列对应的标准词序列进行比对以统计每个所述语料片段的第一词正确率的步骤包括:
    将每个所述第一词序列和每个所述第一词序列对应的标准词序列中对应的词按照词序列的顺序逐个比对以得到所述第一词序列调整成所述标准词序列的插入词、替换词及删除词;
    计算所述插入词、所述替换词及所述删除词的数量之和与所述标准词序列中词数量的比值以得到第一词错误率;
    根据所述第一词错误率获得对应所述语料片段的第一词正确率。
  17. 根据权利要求11所述计算机设备,其中,所述对语料进行分时间戳标注以得到多段语料片段,并将多段所述语料片段组成第一语料集的步骤之前,还包括:
    获取多个携带有预设顺序标识的语料段,所述语料段由语料按照预设大小进行切割获取;
    所述对语料进行分时间戳标注以得到多段语料片段,并将多段所述语料片 段组成第一语料集的步骤包括:
    采用分布式系统通过并行方式对每个所述语料段分别进行分时间戳标注以得到根据所述时间戳进行分段并且携带有所述预设顺序标识的多段语料片段组成的第一语料集。
  18. 根据权利要求11所述计算机设备,其中,所述对语料进行分时间戳标注以得到多段语料片段,并将多段所述语料片段组成第一语料集的步骤之前,还包括:
    通过语音活动检测消除所述语料中的静音期信号。
  19. 一种计算机可读存储介质,其中,所述计算机可读存储介质存储有计算机程序,所述计算机程序被处理器执行时使所述处理器执行如下操作:
    对语料进行分时间戳标注以得到多段语料片段,并将多段所述语料片段组成第一语料集;
    使用所述第一语料集对语音识别模型进行训练以得到第一语音识别模型;
    通过所述第一语音识别模型对所述第一语料集中的每个所述语料片段进行识别以得到每个所述语料片段对应的第一词序列;
    将每个所述第一词序列和每个所述第一词序列对应的标准词序列进行比对以统计每个所述语料片段的第一词识别率,所述第一词识别率包括词错误率或者词正确率;
    对每个所述语料片段的所述第一词识别率是否满足第一词识别率预设条件进行判断;
    将满足所述第一词识别率预设条件的所述第一词识别率所对应的所述语料片段进行存储以形成筛选后的第二语料集。
  20. 根据权利要求19所述计算机可读存储介质,其中,所述将满足所述第一词识别率预设条件的所述第一词识别率所对应的所述语料片段进行存储以形成筛选后的第二语料集的步骤之后,还包括:
    使用所述第二语料集对所述第一语音识别模型进行训练以得到第二语音识别模型;
    通过所述第二语音识别模型对所述第二语料集中的每个所述语料片段进行 识别以得到每个所述语料片段的第二词序列;
    将每个所述第二词序列和每个所述第二词序列对应的标准词序列进行比对以统计每个所述语料片段的第二词识别率,所述第二词识别率包括词错误率或者词正确率;
    对每个所述语料片段的所述第二词识别率是否满足第二词识别率预设条件进行判断;
    将满足所述第二词识别率预设条件的所述第二词识别率所对应的所述语料片段进行存储以形成筛选后的第三语料集;
    迭代上述步骤直至得到满足预设词识别率预设条件的所有所述语料片段以形成筛选后的语料集。
PCT/CN2019/103470 2019-05-06 2019-08-30 用于语音识别训练的语料筛选方法、装置及计算机设备 WO2020224121A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910372331.0 2019-05-06
CN201910372331.0A CN110265001B (zh) 2019-05-06 2019-05-06 用于语音识别训练的语料筛选方法、装置及计算机设备

Publications (1)

Publication Number Publication Date
WO2020224121A1 true WO2020224121A1 (zh) 2020-11-12

Family

ID=67914304

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/103470 WO2020224121A1 (zh) 2019-05-06 2019-08-30 用于语音识别训练的语料筛选方法、装置及计算机设备

Country Status (2)

Country Link
CN (1) CN110265001B (zh)
WO (1) WO2020224121A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113139816A (zh) * 2021-04-26 2021-07-20 北京沃东天骏信息技术有限公司 信息处理方法、装置、电子设备和存储介质
CN113362800A (zh) * 2021-06-02 2021-09-07 深圳云知声信息技术有限公司 用于语音合成语料库的建立方法、装置、设备和介质

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111091812B (zh) * 2019-11-26 2022-05-17 思必驰科技股份有限公司 小语种语料的生成方法及系统
CN111091834B (zh) * 2019-12-23 2022-09-06 科大讯飞股份有限公司 文本与音频对齐方法及相关产品
CN111739519A (zh) * 2020-06-16 2020-10-02 平安科技(深圳)有限公司 基于语音识别的对话管理处理方法、装置、设备及介质
CN112435656B (zh) * 2020-12-11 2024-03-01 平安科技(深圳)有限公司 模型训练方法、语音识别方法、装置、设备及存储介质
CN115240659B (zh) * 2022-09-21 2023-01-06 深圳市北科瑞声科技股份有限公司 分类模型训练方法、装置、计算机设备及存储介质

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105989081A (zh) * 2015-02-11 2016-10-05 联想(北京)有限公司 一种语料处理方法和装置
CN108711421A (zh) * 2017-04-10 2018-10-26 北京猎户星空科技有限公司 一种语音识别声学模型建立方法及装置和电子设备
CN109388743A (zh) * 2017-08-11 2019-02-26 阿里巴巴集团控股有限公司 语言模型的确定方法和装置

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN203456091U (zh) * 2013-04-03 2014-02-26 中金数据系统有限公司 语音语料库的构建系统
CN108305619B (zh) * 2017-03-10 2020-08-04 腾讯科技(深圳)有限公司 语音数据集训练方法和装置
CN110019990B (zh) * 2017-07-14 2023-05-23 阿里巴巴集团控股有限公司 样本筛选的方法和装置、业务对象数据搜索的方法和装置
CN110310623B (zh) * 2017-09-20 2021-12-28 Oppo广东移动通信有限公司 样本生成方法、模型训练方法、装置、介质及电子设备
CN108242234B (zh) * 2018-01-10 2020-08-25 腾讯科技(深圳)有限公司 语音识别模型生成方法及其设备、存储介质、电子设备
CN108389577B (zh) * 2018-02-12 2019-05-31 广州视源电子科技股份有限公司 优化语音识别声学模型的方法、系统、设备及存储介质
CN109637537B (zh) * 2018-12-28 2020-06-30 北京声智科技有限公司 一种自动获取标注数据优化自定义唤醒模型的方法

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105989081A (zh) * 2015-02-11 2016-10-05 联想(北京)有限公司 一种语料处理方法和装置
CN108711421A (zh) * 2017-04-10 2018-10-26 北京猎户星空科技有限公司 一种语音识别声学模型建立方法及装置和电子设备
CN109388743A (zh) * 2017-08-11 2019-02-26 阿里巴巴集团控股有限公司 语言模型的确定方法和装置

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113139816A (zh) * 2021-04-26 2021-07-20 北京沃东天骏信息技术有限公司 信息处理方法、装置、电子设备和存储介质
CN113362800A (zh) * 2021-06-02 2021-09-07 深圳云知声信息技术有限公司 用于语音合成语料库的建立方法、装置、设备和介质

Also Published As

Publication number Publication date
CN110265001A (zh) 2019-09-20
CN110265001B (zh) 2023-06-23

Similar Documents

Publication Publication Date Title
WO2020224119A1 (zh) 用于语音识别的音频语料筛选方法、装置及计算机设备
WO2020224121A1 (zh) 用于语音识别训练的语料筛选方法、装置及计算机设备
CN107657947B (zh) 基于人工智能的语音处理方法及其装置
CN112115706B (zh) 文本处理方法、装置、电子设备及介质
TW202008349A (zh) 語音標註方法、裝置及設備
CN110648658A (zh) 一种语音识别模型的生成方法、装置及电子设备
US11749257B2 (en) Method for evaluating a speech forced alignment model, electronic device, and storage medium
WO2019218467A1 (zh) 一种音视频通话方言识别方法、装置、终端设备及介质
JP2012128188A (ja) テキスト修正装置およびプログラム
CN110136715B (zh) 语音识别方法和装置
CN110503956B (zh) 语音识别方法、装置、介质及电子设备
CN108877779B (zh) 用于检测语音尾点的方法和装置
CN111326144A (zh) 语音数据处理方法、装置、介质和计算设备
CN111639529A (zh) 基于多层次逻辑的语音话术检测方法、装置及计算机设备
US11600279B2 (en) Transcription of communications
JP5180800B2 (ja) 統計的発音変異モデルを記憶する記録媒体、自動音声認識システム及びコンピュータプログラム
CN113921011A (zh) 音频处理方法、装置及设备
CN113380229B (zh) 语音响应速度确定方法、相关装置及计算机程序产品
CN110853627A (zh) 用于语音标注的方法及系统
CN112151019A (zh) 文本处理方法、装置及计算设备
CN112530417A (zh) 语音信号处理方法、装置、电子设备及存储介质
US10650803B2 (en) Mapping between speech signal and transcript
TW202211077A (zh) 多國語言語音辨識及翻譯方法與相關的系統
US20230028897A1 (en) System and method for caption validation and sync error correction
CN115862631A (zh) 一种字幕生成方法、装置、电子设备和存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19927933

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19927933

Country of ref document: EP

Kind code of ref document: A1