CN114783424A

CN114783424A - Text corpus screening method, device, equipment and storage medium

Info

Publication number: CN114783424A
Application number: CN202210275587.1A
Authority: CN
Inventors: 张献涛; 曾祥永; 支涛
Original assignee: Beijing Yunji Technology Co Ltd
Current assignee: Beijing Yunji Technology Co Ltd
Priority date: 2022-03-21
Filing date: 2022-03-21
Publication date: 2022-07-22

Abstract

The disclosure provides a text corpus screening method, a text corpus screening device, text corpus screening equipment and a storage medium. The method comprises the following steps: acquiring a basic text corpus and a recording corpus of a target object, recognizing voice data by using a voice recognition model to obtain a first phoneme sequence, and performing phoneme conversion on a voice text to obtain a second phoneme sequence; generating an evaluation sequence according to the first phoneme sequence and the second phoneme sequence, generating an evaluation training data set based on the evaluation sequence, and training an evaluation model by using the evaluation training data set; sequentially selecting each corpus in the basic text corpus, calculating gain when the corpus is added into the target corpus, predicting a phoneme sequence of each corpus by using an evaluation model, and scoring each corpus according to the gain and a prediction result of the evaluation model; and generating a target corpus set according to the scoring result and the screening condition. The method and the device can generate the personalized text corpus for the target object, improve the quality of the text corpus and improve the tuning effect of the model.

Description

Text corpus screening method, device and equipment and storage medium

Technical Field

The present disclosure relates to the field of speech recognition technologies, and in particular, to a text corpus screening method, apparatus, device, and storage medium.

Background

With the deep development of digitization and intelligence technologies in various fields, more intelligent devices play a role in life. Various intelligent voice devices, such as smart speakers, smart phones, smart robots, and the like, have supported voice conversations. Speech Recognition (ASR) has been widely used and achieves high accuracy. However, in some speakers with accents, the recognition result is not ideal, so that personalized speech recognition customized for some specific speakers is often required.

In the prior art, a pre-training model is used, and data of a target speaker is used for specific tuning on the basis, that is, a specific speaker is required to record voices for a plurality of durations according to provided text corpus data, and parameter training optimization is performed on one pre-training model. The mode of tuning model training requires that the target speaker records voice according to the text, so that the corpus on which voice recording depends is very important for tuning model training. At present, a universal corpus is adopted for a text corpus for voice recording, personalized customization cannot be performed on a target speaker, so that the recording effect of the target speaker on the corpus is poor, and when model training and tuning are performed on a text recorded voice obtained by utilizing the universal corpus, obtained model parameters are not accurate enough, and the model training and tuning effect is poor.

Disclosure of Invention

In view of this, the embodiments of the present disclosure provide a text corpus screening method, apparatus, device, and storage medium, so as to solve the problems in the prior art that an individualized text corpus cannot be generated for a target speaker, so that the recording effect of the target speaker on the text corpus is poor, and the model training tuning effect is reduced.

In a first aspect of the embodiments of the present disclosure, a method for screening text corpora is provided, including: acquiring a basic text corpus and a recording corpus of a target object, wherein the recording corpus comprises voice data and a voice text corresponding to the voice data; recognizing the voice data by using a preset voice recognition model to obtain a first phoneme sequence corresponding to the voice data, and performing phoneme conversion operation on a voice text corresponding to the voice data to obtain a second phoneme sequence corresponding to the voice text; generating an evaluation sequence according to the first phoneme sequence and the second phoneme sequence, generating an evaluation training data set based on the evaluation sequence, and training an evaluation model by using the evaluation training data set to obtain a trained evaluation model; sequentially selecting each corpus in the basic text corpus, calculating corresponding gain when each corpus is added into the target corpus, predicting a phoneme sequence corresponding to each corpus by using a trained evaluation model, and scoring each corpus according to the gain and a prediction result of the evaluation model; and adding the corpora which accord with the screening condition into the target corpus set according to the scoring result corresponding to each corpus and the preset screening condition so as to obtain the screened target corpus set.

In a second aspect of the embodiments of the present disclosure, a text corpus screening apparatus is provided, including: the acquisition module is configured to acquire a basic text corpus and a recording corpus of a target object, wherein the recording corpus comprises voice data and a voice text corresponding to the voice data; the recognition module is configured to recognize the voice data by using a preset voice recognition model to obtain a first phoneme sequence corresponding to the voice data, and execute phoneme conversion operation on a voice text corresponding to the voice data to obtain a second phoneme sequence corresponding to the voice text; the training module is configured to generate an evaluation sequence according to the first phoneme sequence and the second phoneme sequence, generate an evaluation training data set based on the evaluation sequence, and train the evaluation model by using the evaluation training data set to obtain a trained evaluation model; the prediction module is configured to sequentially select each corpus in the basic text corpus, calculate corresponding gains when each corpus is added to the target corpus, predict a phoneme sequence corresponding to each corpus by using the trained evaluation model, and score each corpus according to the gains and the prediction result of the evaluation model; and the screening module is configured to add the corpora which accord with the screening condition to the target corpus set according to the scoring result corresponding to each corpus and a preset screening condition so as to obtain the screened target corpus set.

In a third aspect of the disclosed embodiments, an electronic device is provided, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and when the processor executes the computer program, the steps of the method are implemented.

In a fourth aspect of the embodiments of the present disclosure, a computer-readable storage medium is provided, which stores a computer program, which when executed by a processor, implements the steps of the above-mentioned method.

The embodiment of the present disclosure adopts at least one technical scheme that can achieve the following beneficial effects:

acquiring a basic text corpus and a recording corpus of a target object, wherein the recording corpus comprises voice data and a voice text corresponding to the voice data; recognizing the voice data by using a preset voice recognition model to obtain a first phoneme sequence corresponding to the voice data, and performing phoneme conversion operation on a voice text corresponding to the voice data to obtain a second phoneme sequence corresponding to the voice text; generating an evaluation sequence according to the first phoneme sequence and the second phoneme sequence, generating an evaluation training data set based on the evaluation sequence, and training an evaluation model by using the evaluation training data set to obtain a trained evaluation model; sequentially selecting each corpus in the basic text corpus, calculating corresponding gain when each corpus is added into the target corpus, predicting a phoneme sequence corresponding to each corpus by using a trained evaluation model, and scoring each corpus according to the gain and a prediction result of the evaluation model; and adding the corpora which accord with the screening condition into the target corpus set according to the scoring result corresponding to each corpus and the preset screening condition so as to obtain the screened target corpus set. The method and the device can generate the personalized text corpus aiming at the target speaker, improve the recording effect of the target speaker on the text corpus, and improve the accuracy of model parameters when the model training and tuning are carried out by recording voice by using the text, thereby improving the effect of model training and tuning.

Drawings

To more clearly illustrate the technical solutions in the embodiments of the present disclosure, the drawings needed for the embodiments or the prior art descriptions will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present disclosure, and other drawings can be obtained by those skilled in the art without inventive efforts.

Fig. 1 is a schematic flow chart of a text corpus screening method according to an embodiment of the present disclosure;

fig. 2 is a schematic structural diagram of a text corpus screening apparatus according to an embodiment of the present disclosure;

fig. 3 is a schematic structural diagram of an electronic device provided in an embodiment of the present disclosure.

Detailed Description

In the following description, for purposes of explanation and not limitation, specific details are set forth, such as particular system structures, techniques, etc. in order to provide a thorough understanding of the disclosed embodiments. However, it will be apparent to one skilled in the art that the present disclosure may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present disclosure with unnecessary detail.

As mentioned in the background, with the development of artificial intelligence, speech recognition (ASR) has been widely used and achieved with high accuracy. Speech recognition is an artificial intelligence technique that enables machines to translate speech signals into corresponding text or commands through a process of recognition and understanding. Although speech recognition can achieve a high recognition accuracy for users with more standard mandarin chinese language, speech recognition is not effective for certain specific targets (e.g., speakers with accents), and therefore customized speech recognition for a specific speaker is usually required.

The current common method is to use a general pre-training model and then use the recorded voice data of the target speaker to perform specific training and tuning on the model. That is, a specific speaker is required to record voice for a plurality of time periods according to the provided text corpus data, and parameter training optimization is performed on a pre-trained model (such as a deep learning model) by using the recorded voice. In the process, a specific speaker needs to record voice according to the screened text corpora, and the process is time-consuming and labor-consuming for the target speaker, so how to screen out the compact and effective text corpora for voice recording and enable the target speaker to complete voice recording as much as time-saving and labor-saving as possible is one of the problems to be solved urgently in the field of voice recognition at present.

In the prior art, some corpora are generally randomly extracted from an existing text corpus to form a general corpus used for voice recording, and different target speakers are subjected to voice recording based on the general corpus, but the mode of performing voice recording by using the general corpus cannot perform personalized corpus customization on the target speakers, so that the recording effect of the target speakers on the general corpus is poor, and when model training and tuning are performed by using text recorded voice obtained by using the general corpus, the obtained model parameters are not accurate enough, so that the model training and tuning effect is reduced.

In view of the above problems in the prior art, the present disclosure provides a text corpus screening method, which includes obtaining part of daily recording of a target speaker, recognizing speech data using a pre-training model, converting a speech text corresponding to the speech data into a phoneme sequence, training an evaluation model using the obtained speech tagging data, sequentially scoring each corpus in a basic text corpus using the evaluation model and a selection function, screening out corpora meeting preset conditions according to the scoring of each corpus, and adding the corpora to a target corpus set to obtain a set including a fixed number of text corpora, so as to record customized speech of the target speaker using the target corpus set in the future, thereby improving an effect of model training tuning.

Fig. 1 is a schematic flow chart of a text corpus screening method according to an embodiment of the present disclosure. The text corpus filtering method of fig. 1 may be performed by a server. As shown in fig. 1, the text corpus screening method may specifically include:

s101, acquiring a basic text corpus and a recording corpus of a target object, wherein the recording corpus comprises voice data and a voice text corresponding to the voice data;

s102, recognizing voice data by using a preset voice recognition model to obtain a first phoneme sequence corresponding to the voice data, and performing phoneme conversion operation on a voice text corresponding to the voice data to obtain a second phoneme sequence corresponding to the voice text;

s103, generating an evaluation sequence according to the first phoneme sequence and the second phoneme sequence, generating an evaluation training data set based on the evaluation sequence, and training an evaluation model by using the evaluation training data set to obtain a trained evaluation model;

s104, sequentially selecting each corpus in the basic text corpus, calculating corresponding gains when each corpus is added to the target corpus, predicting a phoneme sequence corresponding to each corpus by using the trained evaluation model, and scoring each corpus according to the gains and the prediction result of the evaluation model;

and S105, adding the corpora which accord with the screening condition into the target corpus set according to the scoring result corresponding to each corpus and a preset screening condition to obtain the screened target corpus set.

Specifically, the basic text corpus according to the embodiment of the present disclosure refers to a text corpus containing a large number of corpora generated based on existing text corpora, the embodiment of the present disclosure does not specifically limit the corpus content in the basic text corpus, and some existing common sentences and phrases can be used as the corpora in the basic text corpus. The target object of the embodiment of the present disclosure may refer to a certain specific target speaker or specific target speakers, and one or more specific target speakers usually have similar pronunciation habits and speaking accents, and it is for the embodiment of the present disclosure that a personalized text corpus for voice recording is generated for these specific target speakers.

Further, the speech recognition model adopted in the embodiment of the present disclosure may be an existing speech recognition model, such as a DeepSpeech model, and the embodiment of the present disclosure does not improve the speech recognition model itself, so any common speech recognition model may be used in the present solution. The phoneme is the minimum voice unit divided according to the natural attribute of the voice, and from the acoustic property, the phoneme is the minimum voice unit divided from the aspect of the tone quality; the phoneme sequence may be considered as a string of sequences formed by the phonemes constituting the text. In speech recognition technology, it is often necessary to convert a character sequence of a text into a corresponding sequence of pronounced phonemes, a conversion process also referred to as front-end processing in TTS technology.

According to the technical scheme provided by the embodiment of the disclosure, a basic text corpus and a recording corpus of a target object are obtained, wherein the recording corpus comprises voice data and a voice text corresponding to the voice data; recognizing the voice data by using a preset voice recognition model to obtain a first phoneme sequence corresponding to the voice data, and performing phoneme conversion operation on a voice text corresponding to the voice data to obtain a second phoneme sequence corresponding to the voice text; generating an evaluation sequence according to the first phoneme sequence and the second phoneme sequence, generating an evaluation training data set based on the evaluation sequence, and training an evaluation model by using the evaluation training data set to obtain a trained evaluation model; sequentially selecting each corpus in the basic text corpus, calculating corresponding gain when each corpus is added to the target corpus, predicting a phoneme sequence corresponding to each corpus by using the trained evaluation model, and scoring each corpus according to the gain and the prediction result of the evaluation model; and adding the corpora which accord with the screening condition into the target corpus set according to the scoring result corresponding to each corpus and the preset screening condition so as to obtain the screened target corpus set. The method and the device can generate the personalized text corpus aiming at the target speaker, improve the recording effect of the target speaker on the text corpus, and improve the accuracy of model parameters when the model training and tuning are carried out by recording voice by using the text, thereby improving the effect of model training and tuning.

In some embodiments, obtaining the base text corpus and the transcriptions of the target object includes: acquiring a pre-configured basic text corpus, wherein the basic text corpus comprises a plurality of linguistic data, and each linguistic data comprises a text and a phoneme sequence corresponding to the text; and sending a recording acquisition request to the target object, responding to the confirmation operation of the target object on the recording acquisition request, acquiring a recording file from the mobile terminal of the target object, and labeling the recording file to obtain a recording corpus.

In particular, a base text corpus

Contains n corpora, which can be expressed as a large volume

Each corpus pair in the corpus consists of a text and a phoneme sequence corresponding to the text, for example, the ith corpus can be represented as

. Wherein the phoneme sequence

Can be composed of a plurality of pinyin and can be expressed as

For example, a corpus pair may be (breakfast time, zao c a n s h ij a n). The purpose of obtaining the basic text corpus is to utilize the technical scheme provided by the embodiment of the disclosure to obtain the basic text corpus

Is picked outAnd the plurality of linguistic data form a target linguistic data set for training a personalized voice recognition model.

Further, besides obtaining the basic text corpus, it is also necessary to obtain the recorded corpus of the target object, and the target object of the embodiment of the present disclosure may be regarded as a target speaker, i.e., an object customized by the personalized corpus. In practical applications, in order to have a preliminary knowledge of the voice of the target speaker and prepare for the evaluation model of the next step, it is necessary to obtain a part of the daily speaking record of the target speaker. The obtaining mode may be that after obtaining the user permission, a recording file is collected from channels such as a telephone, a mobile phone, an APP, a smart speaker of the user, and the format of the recording file may be 16kHz, 16bit, wav, mono, and the like. Marking out corresponding text from the recording file to obtain a recording corpus, which can be recorded as

Each recording corpus consists of speech data and speech text, e.g. a recording corpus can represent

。

In some embodiments, recognizing the speech data by using a preset speech recognition model to obtain a first phoneme sequence corresponding to the speech data includes: and acquiring a pre-trained voice recognition model, taking voice data in the recorded corpus as the input of the voice recognition model, and recognizing the voice data by using the voice recognition model to obtain a first phoneme sequence corresponding to each voice data.

Specifically, before training a personalized speech recognition model for a specific target speaker, a basic pre-training model (i.e., a speech recognition model) is generally required, and the training of the speech recognition model is completed by combining the small amount of speech tagging data generated for the target speaker. In practical applications, the pre-trained speech recognition model is called as

The model can adopt an existing voice recognition model, such as a DeepSpeech model.

Further, the recorded voice is processed

Voice data in

As trained speech recognition models

By a speech recognition model

Identifying to obtain phoneme sequence corresponding to each piece of voice data

(i.e., corresponding to the first phoneme sequence); for example, assuming that a piece of speech can be recognized as "your", the phoneme sequence corresponding to the speech is "n iy a".

In some embodiments, performing a phoneme conversion operation on a speech text corresponding to the speech data to obtain a second phoneme sequence corresponding to the speech text includes: and acquiring a voice text corresponding to each voice data in the recording corpus, and converting each voice text by using a text-to-phoneme conversion tool to obtain a second phoneme sequence corresponding to each voice text.

Specifically, the recorded voice is recorded

Speech text in (1)

As an input of the text-to-phoneme conversion tool, a conversion operation is performed on the speech text by using the text-to-phoneme conversion tool (such as py2ipa, etc.), and a phoneme sequence of a mark source is obtainedColumn(s)

(i.e., corresponding to the second phoneme sequence); for example, assuming that the speech text corresponding to the speech data is "hello", the phoneme sequence obtained by converting the speech text using the text-to-phoneme conversion tool is "ni h a o". The process of converting the phonetic text into the phoneme sequence by using the text phoneme conversion tool can be implemented in a known manner, and the detailed description of the disclosure is omitted here.

In some embodiments, generating a rating sequence from the first phoneme sequence and the second phoneme sequence, generating a rating training data set based on the rating sequence, comprises: and calculating the editing distance between each first phoneme sequence and each second phoneme sequence by using an editing distance algorithm, aligning the first phoneme sequences and the second phoneme sequences according to the editing distance, determining the recognition result corresponding to each phoneme in the second phoneme sequences according to the alignment result, generating an evaluation sequence according to the recognition result, and generating an evaluation training data set by using the second phoneme sequences and the evaluation sequences.

In particular, using speech recognition models

After the speech data is recognized to obtain a first phoneme sequence and the speech text is subjected to phoneme conversion by using a text phoneme conversion tool to obtain a second phoneme sequence, an evaluation training data set for training an evaluation model needs to be further generated according to the first phoneme sequence and the second phoneme sequence.

Further, the disclosed embodiments calculate the second phoneme sequence based on an edit distance algorithm

Conversion into a first phoneme sequence

Performing minimal modification operation between the first phoneme sequence and the second phoneme sequenceAnd (4) aligning. The Edit Distance (Edit Distance) refers to the minimum number of editing operations required to change one character into another between two character strings, where the editing operations include, but are not limited to, replacing one character with another, inserting one character, deleting one character, and the like. Generally, the smaller the edit distance, the greater the similarity representing two character strings.

In practical applications, in combination with the foregoing embodiments, the first phoneme sequence "n iy a" in the foregoing embodiments is compared with the second phoneme sequence "n i h a o", and obviously the first phoneme sequence is recognized incorrectly, and at this time, the character y needs to be replaced by the character h, and a character o needs to be added to convert the first phoneme sequence into "n i h a o".

Further, the second phoneme sequence is judged according to the alignment result determined by the above editing distance

If the phoneme in (1) can be correctly identified, the phoneme is marked as 1 when the phoneme is correctly identified, otherwise, the phoneme is marked as 0, so that an evaluation sequence consisting of 0 and 1 can be obtained

. For example, in the foregoing embodiment, the phoneme n, i, a in nihao is correctly recognized, h is incorrectly recognized, and o is missed, so that the evaluation sequence is generated as 1, 1, 0, 1, 0 according to the recognition result, and the evaluation sequence corresponds to a vector value of five dimensions. Thus, each phoneme sequence can be obtained

01 evaluation sequence to whether each phoneme is predicted to be correct

。

Further, a data set can be obtained according to the above operations

The data set is composed of a phoneme sequence and an evaluation sequence

. Using the data set as an evaluation training data set to train an evaluation model to obtain a final evaluation model

The evaluation model can be used for an arbitrary input phoneme sequence

Labeling a sequence and outputting a corresponding 0, 1 evaluation sequence

And can output a probability value for 0, 1 at any one location j. For example

Represents a phoneme string

The probability that the phoneme at the jth position is predicted incorrectly (0). In practical applications, the evaluation model is used to determine which phonemes are more prone to error in a particular target speaker, and to make a predictive output. The evaluation model can be trained by using a model of a sequence label class (such as a CRF model, a Bi-LSTM model and the like).

In some embodiments, sequentially selecting each corpus in the base text corpus, and calculating a corresponding gain when each corpus is added to the target corpus, includes: and in the initial state, the target corpus is an empty set, one corpus is sequentially selected from the basic text corpus, the occurrence frequency of each phoneme in the corpus in the target corpus is determined, and the gain when the corpus is added into the target corpus is calculated by using a preset gain calculation formula according to the occurrence frequency.

Specifically, when training an individualized speech recognition model for a target speaker, speech training data of the target speaker is needed, and it is expected that the data is as comprehensive as possible, and the corpus is made to be more distinctive against the deficiency of a pre-training model (speech recognition model). Furthermore, the targeted speaker cannot be required to record more voice data, limited to the cost of acquisition and application. Therefore, the embodiment of the disclosure defines an evaluation formula, and uses the evaluation formula to score and judge the selected sentences to obtain compact and effective text corpora, which are provided for model training and tuning.

Further, from the base text corpus

Part of the corpora are selected and taken out to form a target corpus set Yset. When the corpus is actually screened, it is assumed that i corpora have been screened into the target corpus set Yset, and then the basic corpus text corpus is required to be selected

When the (i + 1) th corpus y is selected, the corpus with the highest score should be selected. The following two points need to be considered when selecting the material y: the first point is that a corpus which is most easily identified by a pre-training model is found from a trained evaluation model; the second point is to enrich the proportions of various phonemes as much as possible, but select some categories (such as phonemes with a high number of errors) to cause the proportion of the phonemes in the target corpus set Yset to be unbalanced, which may cause the corpus to be screened to fail to achieve the predetermined effect.

Further, considering the above two contents together, before the evaluation formula is used to perform score judgment on the selected sentences, a function is defined to represent the gain generated when the corpus y is added to the target corpus set Yset, that is, a newly added corpus y is judged, and for the gain condition brought by the original target corpus set Yset, the phoneme balance is controlled, and the gain calculation function is represented as follows:

wherein the function

Representing a phoneme

The number of occurrences in corpus set Yset, f corresponds to a function of

Where T is a threshold.

In some embodiments, predicting the phoneme sequence corresponding to each corpus by using the trained evaluation model, and scoring each corpus according to the gain and the prediction result of the evaluation model, including: using a phoneme sequence corresponding to a corpus selected from a basic text corpus as an input of an evaluation model, and predicting the error probability of a phoneme corresponding to each position in the phoneme sequence by using the evaluation model; and scoring the linguistic data according to the error probability of the phoneme corresponding to each position in the phoneme sequence and the gain corresponding to the linguistic data to obtain a scoring result corresponding to the linguistic data.

Specifically, after the gain when the corpus y is added to the target corpus is calculated, the trained evaluation model is used for predicting the phoneme sequence corresponding to the corpus y to obtain the error probability of each phoneme in the phoneme sequence, and finally, the corpus y is scored based on the result of the gain calculation and the output of the evaluation model, namely, the evaluation model and the gain calculation function are used for carrying out combined scoring, and the score is obtained from the basic text corpus

And screening out a fixed number of text corpora. The evaluation formula provided by the embodiment of the disclosure is defined as follows:

where y represents the text from the base corpus

To extract a single corpus to be added to the target corpus set Yeset,

representing a sequence of phonemes resulting from conversion of a phonetic text of a single corpus y, n representing

The number of phonemes in (1), P represents

The phoneme set in (2).

Furthermore, based on the evaluation formula, the basic text corpus is traversed

Calculating each corpus to obtain a corpus with the highest score, and adding the corpus serving as the screened corpus into a target corpus set Yset. Following embodiments, the basic text corpus is processed

The process of traversing and screening the corpus is briefly explained in the basic text corpus

When the corpus is traversed, the following steps can be adopted:

firstly, initializing a target corpus set Yset, and setting the target corpus set Yset as an empty set; for a base text corpus

Sequentially calculating scores of all the linguistic data y; get the highest scoreThe corpus y with the value of scoreMax is put into the target corpus set Yset and is selected from the basic text corpus

Removing; and when the size of the target corpus set Yset reaches a preset corpus number upper limit value X or the value of scoreMax is lower than a set threshold value, stopping traversing operation, otherwise, continuously scoring the remaining corpuses until the target corpus set Yset meeting the screening condition is obtained.

According to the technical scheme provided by the embodiment of the disclosure, the disclosure provides a personalized text corpus screening processing method, a compact and effective text corpus set can be obtained through reasonable screening scoring design, and the text corpus set is provided for model training and tuning. The method comprises the steps of obtaining part of daily recording of a target speaker, utilizing a pre-training model to identify voice data, converting a voice text corresponding to the voice data into a phoneme sequence, training an evaluation model through the obtained voice labeling data, utilizing the evaluation model and a selection function to jointly grade each corpus in a basic text corpus in sequence, screening out corpora meeting preset conditions according to the grade of each corpus, adding the corpora into a target corpus set to obtain a set containing a fixed number of text corpora, and then utilizing the target corpus set to perform customized voice recording on the target speaker in a subsequent mode, so that the effect of model training tuning is improved.

The following are embodiments of the disclosed apparatus that may be used to perform embodiments of the disclosed methods. For details not disclosed in the embodiments of the apparatus of the present disclosure, refer to the embodiments of the method of the present disclosure.

Fig. 2 is a schematic structural diagram of a text corpus screening apparatus according to an embodiment of the present disclosure. As shown in fig. 2, the text corpus filtering apparatus includes:

the obtaining module 201 is configured to obtain a base text corpus and a recording corpus of a target object, where the recording corpus includes voice data and a voice text corresponding to the voice data;

the recognition module 202 is configured to recognize the voice data by using a preset voice recognition model to obtain a first phoneme sequence corresponding to the voice data, and perform a phoneme conversion operation on a voice text corresponding to the voice data to obtain a second phoneme sequence corresponding to the voice text;

a training module 203 configured to generate an evaluation sequence according to the first phoneme sequence and the second phoneme sequence, generate an evaluation training data set based on the evaluation sequence, and train the evaluation model by using the evaluation training data set to obtain a trained evaluation model;

the prediction module 204 is configured to sequentially select each corpus in the basic text corpus, calculate corresponding gains when each corpus is added to the target corpus, predict a phoneme sequence corresponding to each corpus by using the trained evaluation model, and score each corpus according to the gains and the prediction result of the evaluation model;

the filtering module 205 is configured to add, according to the scoring result corresponding to each corpus and a preset filtering condition, the corpus that meets the filtering condition to the target corpus set to obtain a filtered target corpus set.

In some embodiments, the obtaining module 201 in fig. 2 obtains a pre-configured basic text corpus, where the basic text corpus includes a plurality of corpora, and each corpus includes a text and a phoneme sequence corresponding to the text; sending a recording acquisition request to a target object, responding to the confirmation operation of the target object on the recording acquisition request, collecting a recording file from a mobile terminal of the target object, and labeling the recording file to obtain a recording corpus.

In some embodiments, the recognition module 202 of fig. 2 obtains a pre-trained speech recognition model, uses speech data in the recorded corpus as input of the speech recognition model, and recognizes the speech data by using the speech recognition model to obtain a first phoneme sequence corresponding to each piece of speech data.

In some embodiments, the recognition module 202 in fig. 2 obtains a speech text corresponding to each piece of speech data in the recording corpus, and converts each speech text by using a text-to-phoneme conversion tool, so as to obtain a second phoneme sequence corresponding to each speech text.

In some embodiments, the training module 203 of fig. 2 calculates an editing distance between each first phoneme sequence and the second phoneme sequence by using an editing distance algorithm, aligns the first phoneme sequence with the second phoneme sequence according to the editing distance, determines a recognition result corresponding to each phoneme in the second phoneme sequence according to the alignment result, generates an evaluation sequence according to the recognition result, and generates an evaluation training data set by using the second phoneme sequence and the evaluation sequence.

In some embodiments, the prediction module 204 shown in fig. 2 selects a corpus from the basic text corpus in order when the target corpus is an empty set in the initial state, determines the occurrence frequency of each phoneme in the corpus in the target corpus, and calculates the gain when the corpus is added to the target corpus by using a preset gain calculation formula according to the occurrence frequency.

In some embodiments, the prediction module 204 of fig. 2 uses a phoneme sequence corresponding to a corpus selected from the base text corpus as an input of an evaluation model, and predicts an error probability of a phoneme corresponding to each position in the phoneme sequence by using the evaluation model; and scoring the corpus according to the error probability of the phoneme corresponding to each position in the phoneme sequence and the gain corresponding to the corpus to obtain a scoring result corresponding to the corpus.

It should be understood that, the sequence numbers of the steps in the foregoing embodiments do not imply an execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation on the implementation process of the embodiments of the present disclosure.

Fig. 3 is a schematic structural diagram of an electronic device 3 provided in the embodiment of the present disclosure. As shown in fig. 3, the electronic apparatus 3 of this embodiment includes: a processor 301, a memory 302, and a computer program 303 stored in the memory 302 and operable on the processor 301. The steps in the various method embodiments described above are implemented when the processor 301 executes the computer program 303. Alternatively, the processor 301 implements the functions of the modules/units in the above-described device embodiments when executing the computer program 303.

Illustratively, the computer program 303 may be partitioned into one or more modules/units, which are stored in the memory 302 and executed by the processor 301 to accomplish the present disclosure. One or more of the modules/units may be a series of computer program instruction segments capable of performing certain functions, the instruction segments being used for describing the execution of the computer program 303 in the electronic device 3.

The electronic device 3 may be a desktop computer, a notebook, a palm computer, a cloud server, or other electronic devices. The electronic device 3 may include, but is not limited to, a processor 301 and a memory 302. Those skilled in the art will appreciate that fig. 3 is merely an example of the electronic device 3, and does not constitute a limitation of the electronic device 3, and may include more or less components than those shown, or combine certain components, or different components, for example, the electronic device may also include input-output devices, network access devices, buses, etc.

The Processor 301 may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, discrete hardware component, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The storage 302 may be an internal storage unit of the electronic device 3, for example, a hard disk or a memory of the electronic device 3. The memory 302 may also be an external storage device of the electronic device 3, for example, a plug-in hard disk provided on the electronic device 3, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like. Further, the memory 302 may also include both an internal storage unit of the electronic device 3 and an external storage device. The memory 302 is used for storing computer programs and other programs and data required by the electronic device. The memory 302 may also be used to temporarily store data that has been output or is to be output.

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-mentioned division of the functional units and modules is illustrated, and in practical applications, the above-mentioned function distribution may be performed by different functional units and modules according to needs, that is, the internal structure of the apparatus is divided into different functional units or modules, so as to perform all or part of the functions described above. Each functional unit and module in the embodiments may be integrated in one processing unit, or each unit may exist alone physically, or two or more units are integrated in one unit, and the integrated unit may be implemented in a form of hardware, or in a form of software functional unit. In addition, specific names of the functional units and modules are only used for distinguishing one functional unit from another, and are not used for limiting the protection scope of the present application. The specific working processes of the units and modules in the system may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and reference may be made to the related descriptions of other embodiments for parts that are not described or illustrated in a certain embodiment.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the technical solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present disclosure.

In the embodiments provided in the present disclosure, it should be understood that the disclosed apparatus/computer device and method may be implemented in other ways. For example, the above-described apparatus/computer device embodiments are merely illustrative, and for example, a module or a unit may be divided into only one logical function, another division may be made in actual implementation, multiple units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one position, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present disclosure may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit may be implemented in the form of hardware, or may also be implemented in the form of a software functional unit.

The integrated modules/units, if implemented in the form of software functional units and sold or used as separate products, may be stored in a computer readable storage medium. Based on such understanding, the present disclosure may implement all or part of the flow of the method in the above embodiments, and may also be implemented by a computer program to instruct related hardware, where the computer program may be stored in a computer readable storage medium, and when the computer program is executed by a processor, the computer program may implement the steps of the above methods and embodiments. The computer program may comprise computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer readable medium may include: any entity or device capable of carrying computer program code, recording medium, U.S. disk, removable hard disk, magnetic diskette, optical disk, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), electrical carrier signal, telecommunications signal, software distribution medium, etc. It should be noted that the computer readable medium may contain suitable additions or additions that may be required in accordance with legislative and patent practices within the jurisdiction, for example, in some jurisdictions, computer readable media may not include electrical carrier signals or telecommunications signals in accordance with legislative and patent practices.

The above examples are only intended to illustrate the technical solution of the present disclosure, not to limit it; although the present disclosure has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present disclosure, and are intended to be included within the scope of the present disclosure.

Claims

1. A text corpus screening method is characterized by comprising the following steps:

acquiring a basic text corpus and a recording corpus of a target object, wherein the recording corpus comprises voice data and a voice text corresponding to the voice data;

recognizing the voice data by using a preset voice recognition model to obtain a first phoneme sequence corresponding to the voice data, and performing phoneme conversion operation on a voice text corresponding to the voice data to obtain a second phoneme sequence corresponding to the voice text;

generating an evaluation sequence according to the first phoneme sequence and the second phoneme sequence, generating an evaluation training data set based on the evaluation sequence, and training an evaluation model by using the evaluation training data set to obtain a trained evaluation model;

sequentially selecting each corpus in the basic text corpus, calculating corresponding gain when each corpus is added into a target corpus, predicting a phoneme sequence corresponding to each corpus by using the trained evaluation model, and scoring each corpus according to the gain and a prediction result of the evaluation model;

and adding the corpora which accord with the screening condition into the target corpus set according to the scoring result corresponding to each corpus and a preset screening condition so as to obtain the screened target corpus set.

2. The method of claim 1, wherein obtaining the base corpus of text and the corpus of recorded sounds of the target object comprises:

acquiring a pre-configured basic text corpus, wherein the basic text corpus comprises a plurality of linguistic data, and each linguistic data comprises a text and a phoneme sequence corresponding to the text;

and sending a recording acquisition request to the target object, responding to the confirmation operation of the target object on the recording acquisition request, acquiring a recording file from a mobile terminal of the target object, and labeling the recording file to obtain the recording corpus.

3. The method according to claim 1, wherein the recognizing the speech data by using a preset speech recognition model to obtain a first phoneme sequence corresponding to the speech data comprises:

and acquiring a pre-trained voice recognition model, taking voice data in the recording corpus as input of the voice recognition model, and recognizing the voice data by using the voice recognition model to obtain the first phoneme sequence corresponding to each voice data.

4. The method of claim 1, wherein performing a phoneme conversion operation on the speech text corresponding to the speech data to obtain a second phoneme sequence corresponding to the speech text comprises:

and acquiring a voice text corresponding to each voice data in the recording corpus, and converting each voice text by using a text-to-phoneme conversion tool to obtain a second phoneme sequence corresponding to each voice text.

5. The method of claim 1, wherein generating a ranking sequence from the first phone sequence and the second phone sequence, and generating a ranking training data set based on the ranking sequence comprises:

and calculating an editing distance between each first phoneme sequence and each second phoneme sequence by using an editing distance algorithm, aligning the first phoneme sequences and the second phoneme sequences according to the editing distance, determining a recognition result corresponding to each phoneme in the second phoneme sequences according to the alignment result, generating the evaluation sequence according to the recognition result, and generating the evaluation training data set by using the second phoneme sequences and the evaluation sequences.

6. The method according to claim 1, wherein said sequentially selecting each corpus in said base text corpus and calculating a corresponding gain when each corpus is added to a target corpus comprises:

and in an initial state, the target corpus is a null set, a corpus is sequentially selected from the basic text corpus, the occurrence frequency of each phoneme in the corpus in the target corpus is determined, and the gain when the corpus is added to the target corpus is calculated by using a preset gain calculation formula according to the occurrence frequency.

7. The method according to claim 6, wherein the predicting the phoneme sequence corresponding to each corpus using the trained evaluation model, and scoring each corpus according to the gain and the prediction result of the evaluation model comprises:

selecting a phoneme sequence corresponding to a corpus from the basic text corpus as an input of the evaluation model, and predicting the error probability of a phoneme corresponding to each position in the phoneme sequence by using the evaluation model; and scoring the corpus according to the error probability of the phoneme corresponding to each position in the phoneme sequence and the gain corresponding to the corpus to obtain a scoring result corresponding to the corpus.

8. A text corpus screening apparatus, comprising:

the acquisition module is configured to acquire a basic text corpus and a recording corpus of a target object, wherein the recording corpus comprises voice data and a voice text corresponding to the voice data;

the recognition module is configured to recognize the voice data by using a preset voice recognition model to obtain a first phoneme sequence corresponding to the voice data, and perform phoneme conversion operation on a voice text corresponding to the voice data to obtain a second phoneme sequence corresponding to the voice text;

the training module is configured to generate an evaluation sequence according to the first phoneme sequence and the second phoneme sequence, generate an evaluation training data set based on the evaluation sequence, and train an evaluation model by using the evaluation training data set to obtain a trained evaluation model;

the prediction module is configured to sequentially select each corpus in the basic text corpus, calculate corresponding gains when each corpus is added to a target corpus, predict a phoneme sequence corresponding to each corpus by using the trained evaluation model, and score each corpus according to the gains and the prediction result of the evaluation model;

and the screening module is configured to add the corpora meeting the screening condition to the target corpus set according to the scoring result corresponding to each corpus and a preset screening condition so as to obtain a screened target corpus set.

9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the method according to any one of claims 1 to 7 when executing the program.

10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1 to 7.