CN114333839A

CN114333839A - Model training material selection method and device, electronic equipment and storage medium

Info

Publication number: CN114333839A
Application number: CN202111665084.7A
Authority: CN
Inventors: 闫影; 文博龙; 甘文东; 陈海涛; 李海
Original assignee: Chengdu iQIYI Intelligent Innovation Technology Co Ltd
Current assignee: Chengdu iQIYI Intelligent Innovation Technology Co Ltd
Priority date: 2021-12-31
Filing date: 2021-12-31
Publication date: 2022-04-12

Abstract

The application relates to a method and a device for selecting model training materials, electronic equipment and a storage medium, wherein the method comprises the following steps: segmenting a dry sound material to obtain a plurality of audio sentences, wherein the dry sound material comprises a plurality of audios for model training; extracting the signal-to-noise ratio of each audio statement, and taking the audio statement with the signal-to-noise ratio larger than a target signal-to-noise ratio as candidate audio; by the method, the audios which can be used for model training are automatically selected from the dry sound materials in a mode of calculating the voiceprint similarity and serve as the model training materials, so that the problem of long working period caused by manual classification of the speakers according to model training requirements and collection of the audios of the speakers is solved.

Description

Model training material selection method and device, electronic equipment and storage medium

Technical Field

The present application relates to the field of speech conversion, and in particular, to a method and an apparatus for selecting model training materials, an electronic device, and a storage medium.

Background

Currently, speech synthesis (TTS) and speech conversion (VC) are widely used in artificial intelligence systems. For example, TTS is used in audio books, navigation, voice assistants, and so on, and VC is used in movie and television drama dubbing, live entertainment, and so on. TTS and VC have wide use scenes and great commercial value at present and in the future.

In speech synthesis and speech conversion systems, deep learning is the mainstream solution at present. Although the deep learning effectively improves the effect of the system, higher requirements are put on the quantity of training data. At present, when a model is trained by a deep learning method, very strict standards are provided for the data volume and requirements of training materials, for example, dozens of hours of data are often required for the total data of a speaker, and the data volume is large, so that the production period of the model training materials is long, and the production efficiency of the model training materials is low.

Disclosure of Invention

The application provides a method and a device for selecting model training materials, electronic equipment and a storage medium, and aims to solve the problems that in the related art, the model training materials are long in manufacturing period and high in manufacturing cost.

In a first aspect, the present application provides a method for selecting model training materials, where the method for selecting model training materials includes: segmenting a dry sound material to obtain a plurality of audio sentences, wherein the dry sound material comprises a plurality of audios for model training; extracting the signal-to-noise ratio of each audio statement, and taking the audio statement with the signal-to-noise ratio larger than a target signal-to-noise ratio as candidate audio; determining a reference audio sentence from the candidate audio, and determining a target audio sentence in the candidate audio according to the reference audio sentence, wherein the target audio sentence is a training material of the model, the similarity between the target audio sentence and the reference audio sentence is greater than a similarity threshold, and the reference audio sentence and the target audio sentence are not the same audio.

In a second aspect, the present application provides a device is selected to model training material, the device is selected to model training material, include: the segmentation module is used for segmenting dry sound materials to obtain a plurality of audio sentences, and the dry sound materials comprise a plurality of audios for model training; the extraction module is used for extracting the signal-to-noise ratio of each audio statement and taking the audio statement with the signal-to-noise ratio larger than a target signal-to-noise ratio as a candidate audio; a determining module, configured to determine a reference audio sentence from the candidate audio, and determine a target audio sentence in the candidate audio according to the reference audio sentence, where the target audio sentence is a training material for model training, a similarity between the target audio sentence and the reference audio sentence is greater than a similarity threshold, and the reference audio sentence and the target audio sentence are not the same audio.

In a third aspect, an electronic device is provided, which includes a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory complete communication with each other through the communication bus;

a memory for storing a computer program;

a processor, configured to implement the steps of the method for selecting model training material according to any one of the embodiments of the first aspect when executing the program stored in the memory.

In a fourth aspect, a computer-readable storage medium is provided, on which a computer program is stored, which computer program, when being executed by a processor, carries out the steps of the method for model training material selection according to any one of the embodiments of the first aspect.

Compared with the prior art, the technical scheme provided by the embodiment of the application has the following advantages:

the method provided by the embodiment of the application comprises the following steps: segmenting a dry sound material to obtain a plurality of audio sentences, wherein the dry sound material comprises a plurality of audios for model training; extracting the signal-to-noise ratio of each audio statement, and taking the audio statement with the signal-to-noise ratio larger than a target signal-to-noise ratio as candidate audio; the method comprises the steps of determining a reference audio statement from the candidate audio, determining a target audio statement in the candidate audio according to the reference audio statement, wherein the target audio statement is a training material of the model, the similarity between the target audio statement and the reference audio statement is larger than a similarity threshold, and the reference audio statement and the target audio statement are not the same audio.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention.

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without inventive exercise.

Fig. 1 is a schematic flowchart of a method for selecting model training materials according to an embodiment of the present disclosure;

fig. 2 is a schematic diagram of a basic structure of a model training system according to an embodiment of the present disclosure;

fig. 3 is a schematic diagram of a basic structure of a model training apparatus according to an embodiment of the present disclosure;

fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

Fig. 1 is a schematic flowchart of a method for selecting model training materials according to an embodiment of the present disclosure, as shown in fig. 1, which includes but is not limited to:

s101, dividing a dry sound material to obtain a plurality of audio sentences;

it is to be understood that dry sound material is original human voice recording material without any post-processing, which may be understood as sound picked up by a microphone. The sound is not processed at all, and the characteristics of the sound during recording can be obviously highlighted. The stem sound material is a collection of multiple-sentence audio and comprises multiple-sentence audio, wherein at least part of the multiple-sentence audio can be used for model training, namely the stem sound material comprises multiple audios used for model training, and when the model is trained, the training data needs to be a single sentence, so that the multiple audios in the stem sound material are segmented to obtain multiple audio sentences, each audio sentence is a single sentence, the stem sound material is an integral audio, and the single sentence is an independent sentence meeting the requirement of the model training.

It is understood that, before the segmentation of the dry sound material, the model training material selection method further comprises: acquiring dry sound materials; the method comprises the steps that a main sound material database is used for storing main sound materials, wherein the main sound materials can be obtained from the main sound material database, and the main sound materials can also be obtained from a network; in some examples, the dry sound materials may be dry sound materials of a tv series or a movie, and since the total audio duration of the dry sound materials of the tv series is greater than the total audio duration of the dry sound materials of the movie, preferably, the dry sound materials are the dry sound materials of the tv series, in addition, multiple characters are often present in the dry sound materials of the tv series to sound, that is, there are audio frequencies corresponding to multiple characters, and since the total duration of the dry sound materials of the tv series is long, a single character can also extract enough audio duration, and thus can acquire the audio frequencies corresponding to multiple speakers.

S102, extracting the signal-to-noise ratio of each audio statement, and taking the audio statement with the signal-to-noise ratio larger than a target signal-to-noise ratio as candidate audio;

it should be understood that the signal-to-noise ratio (signal-to-noise ratio) is the difference between the speech segment and the silence segment noise signal (power) in an audio sentence. Expressed in dB. For example, the SNR of a certain audio is 80dB, i.e. the power of a speech segment is 10^8 times that of a silence segment, the standard deviation of the speech segment output signal is 10^4 times that of the silence segment, and the higher the SNR is, the smaller the relative noise is. In some examples, the target snr is determined according to the training requirements of the model, it is understood that when the models are different, the training requirements of the models may be different, and therefore, the target snr is determined according to the training requirements of the models, and it is understood that the training requirements of the models are flexibly set by the relevant personnel, for example: when the relevant personnel set the model training requirement to be the audio frequency with the required signal-to-noise ratio exceeding 30db, setting the target signal-to-noise ratio to be 30 db; as another example, when the relevant person sets the model training requirement to an audio requiring a signal-to-noise ratio in excess of 50db, then the target signal-to-noise ratio is set to 50 db.

It should be appreciated that in some examples, the training requirement of the model may be an audio statement with a signal-to-noise ratio less than a certain signal-to-noise ratio, and then when the certain signal-to-noise ratio is taken as the target signal-to-noise ratio, the audio statement with the signal-to-noise ratio less than the target signal-to-noise ratio is taken as the candidate audio.

S103, determining a reference audio statement from the candidate audio, and determining a target audio statement in the candidate audio according to the reference audio statement;

it should be understood that the target audio sentence is a training material of the model, the similarity between the target audio sentence and the reference audio sentence is greater than a similarity threshold, and the reference audio sentence and the target audio sentence are not the same audio.

In some examples of this embodiment, segmenting the stem material into a plurality of audio sentences includes: acquiring a sound track of the dry sound material and a preset audio segmentation range, wherein the preset audio segmentation range is determined according to the model training requirement, and the preset audio segmentation range is used for limiting the longest segmentation length and the shortest segmentation length; and segmenting the sound track according to the mute segment of the dry sound material by taking the preset audio segmentation range as a limit so as to segment the dry sound material into a plurality of audio sentences, wherein the length of the duration of each audio sentence is within the preset audio segmentation length. It should be understood that, the preset audio segmentation range is a range value, the maximum value in the range value is the longest segmentation length of each audio sentence, the minimum value in the range value is the shortest segmentation length of each audio sentence, and specifically, the required durations of the audio sentences are different when different models are trained, it can be understood that the training requirements of the models are flexibly set by related personnel, for example, when the related personnel set the models and need 5S to 10S of audio sentences for training, 5S to 10S is taken as the preset audio segmentation range, so that the segmented audio sentences are in the range of 5S to 10S; and when the training requirement of the model set by related personnel is 3s-5s, taking 3s-5s as a preset audio segmentation range, so that the audio sentence obtained by segmentation is in the range of 3s-5 s.

It should be understood that the tracks are parallel "tracks" of one another as seen in the sequencer software. Each audio track defines the attributes of the audio track, such as the timbre, the timbre library, the number of channels, the input/output ports, the volume and the like of the audio track, and each audio track has the concept of time length; the method comprises the steps that a main sound material corresponds to a complete sound track, a plurality of sub sound tracks are obtained by segmenting the sound track, each sub sound track corresponds to an audio statement, and therefore the main sound material is segmented, the plurality of audio statements are made, the sum of the time lengths of the sub sound tracks is equal to the time length of the sound track of the main sound material, and it should be understood that when the sound track of the main sound material is segmented, the preset audio segmentation length range is used as a limit, the sound track is segmented according to a mute section of the main sound material, and the main sound material is segmented into the plurality of audio statements; for example, the time length of a track of a main sound material is 150S, the preset audio segmentation range is 5S-10S, the track of the main sound material is divided into a plurality of sub-tracks by obtaining the starting points of all silent sections in the main sound material and then taking the starting points as segmentation points, the main sound material is segmented according to the segmentation points, when the length of the sub-tracks is within the range of 5S-10S, the induction material is segmented, and if the length of a certain sub-track is not within the range of 5S-10S, the sub-track is directly segmented according to the maximum value of 10S in the range values, so that the segmented sub-track is within the preset audio segmentation range.

In some examples, when the main sound material is segmented except that the time length of the sub-track is within the preset audio segmentation range, it is also required that each audio statement obtained by segmentation only contains one speaker, so that the voiceprint features corresponding to the audio statement can be obtained better in the following. It should be understood that the audio track may be segmented according to the preset audio segmentation length by a sentence segmentation module, so as to segment the dry sound material into a plurality of single sentences, resulting in a plurality of audio sentences; wherein the clause module can be a parannote-audio structure model.

In some embodiments of this embodiment, extracting a signal-to-noise ratio of each audio sentence, and taking the audio sentence with the signal-to-noise ratio greater than a target signal-to-noise ratio as a candidate audio includes: determining a mute section and a voice section in each audio statement, and extracting the signal energy of the mute section and the signal energy of the voice section; solving the signal-to-noise ratio of the audio statement according to the signal energy of the audio statement and the signal energy of the mute section; and comparing the signal-to-noise ratio of the audio statement with the target signal-to-noise ratio, and when the signal-to-noise ratio of the audio statement is greater than the target signal-to-noise ratio, taking the audio statement as candidate audio. That is, the signal-to-noise ratios of all the audio sentences are obtained, then the signal-to-noise ratio of each audio sentence is compared with the target signal-to-noise ratio, and when the signal-to-noise ratio of one audio sentence is greater than the target signal-to-noise ratio, the audio sentence is taken as the candidate audio.

It should be understood that the audio sentence can be divided into a mute section and a speech section, wherein the mute section and the speech section of the audio sentence can be determined by acquiring the short-time energy STE and the zero crossing rate ZCC of the audio sentence, when the energy STE or the zero crossing rate at a certain moment in the audio sentence exceeds a preset threshold, the start of the speech section and the end of the mute section are determined, and when the energy or the zero crossing rate at a certain moment in the audio sentence is lower than the preset threshold, the start of the mute section and the end of the speech section are determined; methods of determining the energy and zero crossing rate of an audio sentence include, but are not limited to: the audio sentence is framed in a frame of 20ms (since a short-time fourier transform is generally performed, the resolution in the time domain and the frequency domain needs to be balanced, and 20ms is a balance point). Here the input signal sampling rate is 8000 HZ. Ste is calculated as the sum of the squares of the signals in the frame, for each frame length of 160samples. The ZCC calculation method is that all samples in a frame are translated by 1, products are made on corresponding points, zero crossing is indicated when the sign is negative, and the zero crossing rate of the frame is obtained only by solving the product number of all negative numbers in the frame.

After determining the mute section and the voice section in the audio sentence, extracting the signal energy of the mute section and the signal energy of the voice section, and solving the signal-to-noise ratio of the audio sentence according to the signal energy at the extraction position, wherein the signal energy of the mute section is used for representing the noise signal of the mute section, the signal energy of the voice section is used for representing the noise signal of the voice section, after obtaining the signal energy of the voice section and the signal energy of the mute section, the signal energy of the voice section and the signal energy of the mute section are differed to obtain an energy difference value, and the difference value is used as the signal-to-noise ratio of the audio speech. It should be understood that the embodiment does not limit the manner of acquiring the signal energy, for example, the signal energy acquisition model may be input by the audio sentence of the distinguished silence segment and speech segment, and then the signal energy of the silence segment and speech segment is acquired, where the signal energy acquisition model may be an open-source trained model, and the embodiment is not limited.

In some examples of this embodiment, determining the reference audio sentence from the candidate audio comprises: determining a candidate speaker, wherein the candidate speaker is a speaker corresponding to any audio statement in the stem material; and acquiring N candidate audios corresponding to the candidate speakers, and taking the acquired candidate audios corresponding to the candidate speakers as the reference audio sentences. It should be understood that, when training a model, different audios of the same person are required to be used for training, therefore, when selecting a model training material from dry sound materials, it is required to determine a candidate speaker from speakers of the dry sound materials, and then select N candidate audios corresponding to the candidate speaker from all candidate audios as reference audio sentences, where N is flexibly set by relevant persons, specifically, after determining the candidate speaker, first obtain the timbre of the candidate speaker, then compare the timbre with the timbres of the candidate audios, and select the candidate audio with the same timbre as the candidate speaker as the reference audio. Preferably, about 10 candidate audios corresponding to the candidate speaker may be selected from all the candidate audios to serve as the reference audio sentence. For example, there are 10000 candidate audios, each candidate audio corresponds to a speaker, and multiple candidate audios may correspond to a speaker, that is, a speaker may select 10 candidate audios corresponding to the candidate speaker from the 10000 candidate audios as the reference audio sentence after determining the candidate speaker.

Taking the above example into consideration, it should be understood that multiple candidate speakers may be determined simultaneously from speakers of dry sound material, and then the audio sentences of each candidate speaker pair are selected as reference audio sentences, respectively, and in some examples, after a certain candidate audio is taken as a reference audio sentence, the audio sentence taken as the reference audio sentence is removed from the candidate audio to avoid subsequent repeated comparison, for example, when A, B, C, D … N candidate audio exists and C is selected as the reference audio sentence, C is removed from the candidate audio, and only A, B, D … N is taken as the candidate audio.

In some examples of this embodiment, determining the target audio sentence in the candidate audio from the reference audio sentence comprises: extracting the voiceprint features of the reference audio statement and extracting each candidate audio voiceprint feature; respectively obtaining the voiceprint similarity of the voiceprint feature of the reference audio statement and each candidate audio voiceprint feature; and when the voiceprint similarity is greater than a voiceprint feature similarity threshold, taking the candidate audio as the target audio statement. It should be understood that the sound in the audio sentence is an analog signal, and the time-domain waveform of the sound only represents the relationship of the sound pressure changing along with the time, and cannot well represent the characteristics of the sound, so that the sound waveform must be converted into an acoustic feature vector to accurately obtain the voiceprint characteristics of the audio sentence; the method for extracting the voiceprint features of each audio statement includes but is not limited to: mel-frequency cepstrum coefficients MFCC, linear prediction cepstrum coefficients LPCC, multimedia content description interface MPEG7, and the like; MFCC is preferred as an algorithm to extract the voiceprint features of audio because it is cepstrum-based, and more consistent with the human auditory principles. Prior to extraction of the MFCCs, pre-processing of the sound, including analog-to-digital conversion, pre-emphasis, and windowing, is required.

After the voiceprint features of the reference audio statement and each candidate audio voiceprint feature are extracted by any one of the above methods, the voiceprint similarity of the voiceprint features of the reference audio statement and the voiceprint similarity of each candidate audio voiceprint feature are respectively calculated, it should be understood that the voiceprint features of the reference audio statement and the voiceprint similarity of each candidate audio voiceprint feature are acoustic feature vectors, so a cosine similarity algorithm can be adopted to calculate the voiceprint similarity of the voiceprint features of the reference audio statement and the voiceprint similarity of each candidate audio voiceprint feature, and when a cosine value calculated by the cosine similarity algorithm is greater than a voiceprint feature similarity threshold value, two sounds are considered to belong to the same speaker, so the candidate audio is taken as a target audio statement; it should be understood that the threshold of the similarity of the voiceprint features is a threshold that is calculated by using the audios of a large number of speakers and calculating the similarity value of the internal data of each speaker.

In some examples of this embodiment, before the candidate audio is taken as the target audio sentence, the method further comprises: acquiring a video file corresponding to the dry sound material, acquiring a face image corresponding to the reference audio statement from the video file, and acquiring a face image corresponding to the candidate audio from the video file; calculating the face recognition similarity of the face image corresponding to the reference audio sentence and the face image corresponding to the candidate audio; when the voiceprint similarity is greater than a voiceprint feature similarity threshold, taking the candidate audio as the target audio statement, including: and when the voiceprint similarity is greater than a voiceprint feature similarity threshold and the face recognition similarity is greater than a face similarity threshold, taking the candidate audio as the target audio statement. In some examples of this embodiment, it may also be assisted by face recognition to recognize whether multiple audios are sounds made by the same person, specifically, a video file corresponding to a dry sound material is first obtained, durations of the video file and the dry sound material are consistent, a start time and an end time of the reference audio statement in the dry sound material are determined, a video image corresponding to the start time and the end time is obtained from the video file, after the video image is obtained, a face in the video image is used as a face image corresponding to the reference audio statement, and then a face image corresponding to a speaker of the reference audio statement is obtained by obtaining the video image corresponding to the reference audio statement, and then the face image is converted into a face vector feature; similarly, a video image corresponding to the candidate audio is obtained, a face image corresponding to the candidate audio speaker is further obtained, then the face image is converted into a face vector feature, and then the face vector feature corresponding to the reference audio statement and the face vector feature corresponding to the candidate audio are calculated, so that the face recognition similarity is obtained; in some examples, video images corresponding to the start time and the end time are acquired from a video file, after the video images are acquired, when faces in the video images are used as face images corresponding to reference audio statements, if a plurality of face images are acquired, the plurality of face images are respectively displayed through an interactive interface, and a determination instruction output by a use object is received to determine the face images corresponding to the reference audio statements; similarly, the method may also be adopted when obtaining the video image corresponding to the candidate audio, which is not described herein again.

Taking the example as a bearing, and taking the candidate audio as the target audio statement only when the voiceprint similarity is greater than the voiceprint feature similarity threshold and the face recognition similarity is greater than the face similarity threshold after the face recognition similarity is obtained; in some examples, the face recognition similarity is also obtained only when the voiceprint similarity is greater than the voiceprint feature similarity threshold.

In some examples of this embodiment, after determining the target audio sentence in the candidate audio from the reference audio sentence, the method further comprises: training the model through the target audio sentences corresponding to different speakers; specifically, by the method for selecting the model training material provided by this embodiment, the target audio meeting the training requirement can be selected from the stem sound material, and for better training the model, the target audio sentences corresponding to a plurality of different speakers are obtained, and the model is trained by the target audio sentences corresponding to a plurality of different speakers.

The method for selecting the model training material provided by the embodiment comprises the following steps: segmenting a dry sound material to obtain a plurality of audio sentences, wherein the dry sound material comprises a plurality of audios for model training; extracting the signal-to-noise ratio of each audio statement, and taking the audio statement with the signal-to-noise ratio larger than a target signal-to-noise ratio as candidate audio; the method comprises the steps of determining a reference audio statement from the candidate audio, determining a target audio statement in the candidate audio according to the reference audio statement, wherein the target audio statement is a training material of the model, the similarity between the target audio statement and the reference audio statement is larger than a similarity threshold, and the reference audio statement and the target audio statement are not the same audio.

Based on the same concept, the present embodiment provides a model training material selection system, as shown in fig. 2, the model training material selection system includes but is not limited to:

sentence dividing module

It will be appreciated that the dry material of a movie is often an entire track, for example the dry of a episode of a television show is a soundtrack. When training TTS and VC models, the training data needs to be single sentence data, and the single sentence duration is generally about 1-10 s. Therefore, a sentence splitting module is needed to perform single sentence splitting on a sound track to obtain a plurality of audio sentences, and the currently adopted sentence splitting module is an open source banana-audio and is used for setting the splitting range according to the requirements of users.

Signal-to-noise ratio calculation module

It can be understood that, according to the timestamp information recorded when the sentence segmentation module segments, the signal energy difference between the speech segment and the silence segment of the audio sentence is calculated, i.e. the signal-to-noise ratio. In the training of TTS and VC models, the signal-to-noise ratio of audio files is generally required to be greater than 30db, so that the audio meeting the signal-to-noise ratio requirement is screened out as candidate audio according to the threshold.

Voiceprint similarity calculation module

Firstly, according to the requirements of a user, 10 or so reference audio sentences of a candidate speaker are selected from candidate audios obtained from stem sound materials and used as the voiceprint calculation reference of the speaker. And then, using a voiceprint extraction tool to respectively extract the voiceprint features of the reference audio sentence and the candidate audio, then calculating the average value of cosine similarity (cosine) similarity, and when the average value of the cosine similarity is greater than a certain threshold value, determining that the two sounds belong to the same speaker. The similarity threshold is a threshold obtained by calculating the similarity value of internal data of each speaker by using a large number of speakers and then calculating the similarity value.

Based on the same concept, the present embodiment further provides a model training material selecting apparatus, as shown in fig. 3, which includes but is not limited to:

the segmentation module 1 is used for segmenting dry sound materials to obtain a plurality of audio sentences, wherein the dry sound materials comprise a plurality of audios for model training;

the extraction module 2 is used for extracting the signal to noise ratio of each audio statement and taking the audio statement with the signal to noise ratio larger than the target signal to noise ratio as candidate audio;

a determining module 3, configured to determine a reference audio sentence from the candidate audio, and determine a target audio sentence in the candidate audio according to the reference audio sentence, where the target audio sentence is a training material for model training, a similarity between the target audio sentence and the reference audio sentence is greater than a similarity threshold, and the reference audio sentence and the candidate audio sentence are not the same audio.

It should be understood that, the combination between the modules of the speech conversion apparatus provided in this embodiment can implement the steps of the speech conversion method, and achieve the same technical effects as the steps of the speech conversion method, and therefore, the details are not described herein again.

As shown in fig. 4, an electronic device according to an embodiment of the present application includes a processor 111, a communication interface 112, a memory 113, and a communication bus 114, where the processor 111, the communication interface 112, and the memory 113 complete mutual communication via the communication bus 114,

a memory 113 for storing a computer program;

in an embodiment of the present application, the processor 111, when configured to execute the program stored in the memory 113, is configured to implement the method for selecting model training materials provided in any one of the foregoing method embodiments, including:

the present application further provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the steps of the model training material selection method provided in any one of the foregoing method embodiments.

It is noted that, in this document, relational terms such as "first" and "second," and the like, may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The foregoing are merely exemplary embodiments of the present invention, which enable those skilled in the art to understand or practice the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A method for selecting model training materials is characterized by comprising the following steps:

segmenting a dry sound material to obtain a plurality of audio sentences, wherein the dry sound material comprises a plurality of audios for model training;

extracting the signal-to-noise ratio of each audio statement, and taking the audio statement with the signal-to-noise ratio larger than a target signal-to-noise ratio as candidate audio;

determining a reference audio sentence from the candidate audio, and determining a target audio sentence in the candidate audio according to the reference audio sentence, wherein the target audio sentence is a training material of the model, the similarity between the target audio sentence and the reference audio sentence is greater than a similarity threshold, and the reference audio sentence and the target audio sentence are not the same audio.

2. The method of claim 1, wherein segmenting the stem material into a plurality of audio sentences comprises:

acquiring a sound track of the dry sound material and a preset audio segmentation range, wherein the preset audio segmentation range is used for limiting the longest segmentation length and the shortest segmentation length;

and segmenting the sound track according to the mute segment of the dry sound material by taking the preset audio segmentation range as a limit so as to segment the dry sound material into a plurality of audio sentences.

3. The method of claim 1, wherein extracting a signal-to-noise ratio of each audio sentence, and using the audio sentence with a signal-to-noise ratio greater than a target signal-to-noise ratio as a candidate audio comprises:

determining a mute section and a voice section in each audio statement, and extracting the signal energy of the mute section and the signal energy of the voice section;

solving the signal-to-noise ratio of the audio statement according to the signal energy of the audio statement and the signal energy of the mute section;

and comparing the signal-to-noise ratio of the audio statement with the target signal-to-noise ratio, and when the signal-to-noise ratio of the audio statement is greater than the target signal-to-noise ratio, taking the audio statement as the candidate audio.

4. The method of claim 1, wherein determining a reference audio sentence from the candidate audio comprises:

determining a candidate speaker, wherein the candidate speaker is a speaker corresponding to any audio statement in the stem material;

and acquiring N candidate audios corresponding to the candidate speakers, and taking the acquired candidate audios corresponding to the candidate speakers as the reference audio sentences.

5. The method of claim 1, wherein determining the target audio sentence in the candidate audio from the reference audio sentence comprises:

extracting the voiceprint features of the reference audio statement and extracting each candidate audio voiceprint feature;

respectively obtaining the voiceprint similarity of the voiceprint feature of the reference audio statement and each candidate audio voiceprint feature;

and when the voiceprint similarity is greater than a voiceprint feature similarity threshold, taking the candidate audio as the target audio statement.

6. The method of claim 5, wherein prior to considering the candidate audio as the target audio statement, the method further comprises:

acquiring a video file corresponding to the dry sound material, acquiring a face image corresponding to the reference audio statement from the video file, and acquiring a face image corresponding to the candidate audio from the video file;

calculating the face recognition similarity of the face image corresponding to the reference audio sentence and the face image corresponding to the candidate audio;

when the voiceprint similarity is greater than a voiceprint feature similarity threshold, taking the candidate audio as the target audio statement, including:

and when the voiceprint similarity is greater than a voiceprint feature similarity threshold and the face recognition similarity is greater than a face similarity threshold, taking the candidate audio as the target audio statement.

7. The method of claim 1, wherein after determining the target audio sentence in the candidate audio from the reference audio sentence, the method further comprises:

and training the model through the target audio sentences corresponding to different speakers.

8. The model training material selection device is characterized by comprising the following components:

the segmentation module is used for segmenting dry sound materials to obtain a plurality of audio sentences, and the dry sound materials comprise a plurality of audios for model training;

the extraction module is used for extracting the signal-to-noise ratio of each audio statement and taking the audio statement with the signal-to-noise ratio larger than a target signal-to-noise ratio as a candidate audio;

a determining module, configured to determine a reference audio sentence from the candidate audio, and determine a target audio sentence in the candidate audio according to the reference audio sentence, where the target audio sentence is a training material for model training, a similarity between the target audio sentence and the reference audio sentence is greater than a similarity threshold, and the reference audio sentence and the target audio sentence are not the same audio.

9. An electronic device is characterized by comprising a processor, a communication interface, a memory and a communication bus, wherein the processor and the communication interface are used for realizing mutual communication by the memory through the communication bus;

a memory for storing a computer program;

a processor for implementing the steps of the method of selecting model training material of any one of claims 1 to 7 when executing a program stored in the memory.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method for model training material selection according to any one of claims 1 to 7.