CN111833842A

CN111833842A - Synthetic sound template discovery method, device and equipment

Info

Publication number: CN111833842A
Application number: CN202010621981.7A
Authority: CN
Inventors: 钟奥; 王建社; 冯祥
Original assignee: Iflytek Information Technology Co Ltd
Current assignee: Iflytek Information Technology Co Ltd
Priority date: 2020-06-30
Filing date: 2020-06-30
Publication date: 2020-10-27
Anticipated expiration: 2040-06-30
Also published as: CN111833842B

Abstract

The invention discloses a synthetic sound template discovery method, a synthetic sound template discovery device and synthetic sound template discovery equipment. The invention is characterized in that the repeating characteristic of the synthetic sound template is utilized, the pronunciation similarity comparison is firstly carried out among a large number of voice materials, a suspected synthetic sound template is primarily selected from the voice materials, the selected voice materials are cut, then the cut voice sections are classified according to the pronunciation characteristic of the synthetic sound template, and finally the required synthetic sound template is found according to the number of the voice sections contained in the same type. The method supplements reliable synthetic sound template samples for subsequent synthetic sound detection, saves subsequent works of manual marking, identification and the like, and can solve the problem of confusion between natural speech and synthetic speech in the corpus by independently analyzing suspected speech segments, so that the method can effectively improve the accuracy of the subsequent synthetic sound detection on the premise of controlling the cost.

Description

Synthetic sound template discovery method, device and equipment

Technical Field

The invention relates to the field of synthetic sound identification, in particular to a synthetic sound template discovery method, a synthetic sound template discovery device and synthetic sound template discovery equipment.

Background

Along with the convenience brought to our life by network communication, the technology of intelligently detecting and intercepting synthetic sound comes to the end.

The existing synthetic sound detection is usually based on a neural network to train a synthetic sound recognition model in advance, the training process needs to manually distinguish audio samples and mark synthetic sounds, and all synthetic sound data are subjected to audio feature extraction and then converted into synthetic sound features. And then, combining the synthetic voice and the natural human voice respectively, taking a cross entropy function as a target function, and performing model training by adopting an Adam algorithm, so that the synthetic voice recognition model can output the classification result of the natural human voice and the synthetic voice.

It can be seen that the recognition effect of the existing synthetic speech recognition technology depends on the quantity and quality of training data, as described above, training samples based on manual labeling not only consume a large amount of labor cost, but also the labeling accuracy is limited by manual experience and processing capability; in addition, most of the synthetic tone templates used as training data in the prior art are recorded according to specific scenes and specific requirements, but in a real test environment, input data may be different from the training data in various aspects such as channels, synthesis algorithms, synthesis styles and the like, so that the training data is not adapted to the real test data, and the synthetic tone recognition effect is greatly influenced.

Disclosure of Invention

In view of the foregoing, the present invention aims to provide a method, an apparatus, and a device for discovering a synthesized sound template, and accordingly provides a computer-readable storage medium and a computer program product, which automatically discover and acquire synthesized sound templates from a large-scale audio material in an unsupervised manner, so as to greatly improve the recognition effect of synthesized sounds at the back end while effectively controlling labor cost.

The technical scheme adopted by the invention is as follows:

in a first aspect, the present invention provides a synthetic speech template discovery method, including:

a voice material library is constructed in advance;

extracting mean value super vectors of all audio to be processed in the voice material library;

on the basis of the mean value super vector, mutually comparing the similarity of the audio to be processed and screening out a plurality of approximate audios;

cutting the approximate audio into a plurality of voice segments, and classifying the voice segments based on acoustic information of synthetic voice and natural voice and a clustering strategy;

and acquiring a synthetic voice template according to the number of the voice fragments under each category.

In at least one possible implementation manner, the classifying each of the speech segments based on acoustic information of synthetic speech and natural speech and a clustering strategy includes:

presetting a plurality of audio categories based on acoustic information of synthetic voice and natural voice;

and determining the audio category of each voice segment according to the probability score of each voice segment relative to each audio category.

In at least one possible implementation manner, the determining the audio category of each voice segment according to the probability score of each voice segment with respect to each audio category includes:

based on the similarity of the mean value supervectors of all the voice segments and all the audio categories, solving the prior probability of all the voice segments relative to all the audio categories;

and solving and iteratively updating the posterior probability of each voice segment according to the prior probability, the mean value super vector of each voice segment and a pre-constructed clustering model, and finally determining the audio class to which each voice segment belongs.

In at least one possible implementation manner, the obtaining a synthesized sound template according to the number of the speech segments under each category includes:

and selecting at least one voice fragment from the categories of which the number of the voice fragments is greater than or equal to a preset target number threshold value as the synthetic voice template.

In at least one possible implementation manner, the comparing the similarity of the to-be-processed audio and screening out a plurality of approximate audios based on the mean value supervector includes:

constructing a confusion audio library for the audio to be processed which meets the similarity standard based on a preset library comparison strategy;

using the to-be-processed audio within the obfuscated audio library as the approximate audio.

In at least one possible implementation manner, the constructing a confusing audio library for the to-be-processed audio meeting the similarity criterion based on a preset sub-library comparison policy includes:

splitting the voice material library into two sub-libraries according to the audio time length;

comparing the audio to be processed in the two sub-libraries one by one based on the mean value super vector;

constructing a confusion audio library for the audio to be processed meeting a first similarity threshold;

if the total number of the audios in the confusion audio library exceeds a preset upper limit of the number, the confusion audio library is split and then compared with each other again, and screening is performed based on a second similarity threshold, and so on until the total number of the audios in the confusion audio library is less than or equal to the upper limit of the number.

In at least one possible implementation manner, the extracting the mean supervector of all the audio to be processed in the speech material library includes:

extracting acoustic features of the audio to be processed based on cepstrum coefficients of a cochlear filter;

and estimating the mean value super vector of the audio to be processed by utilizing the acoustic features and a pre-trained general background model.

In at least one possible implementation, the general background model is a Gaussian mixture model that characterizes a neutral speaker and is trained based on the acoustic features and a specific joint algorithm.

In a second aspect, the present invention provides a synthesized sound template discovery apparatus, including:

the material collection module is used for constructing a voice material library in advance;

the mean value super vector extraction module is used for extracting mean value super vectors of all audio frequencies to be processed in the voice material library;

the similar audio screening module is used for mutually comparing the similarity of the audio to be processed and screening out a plurality of similar audios based on the mean value super vector;

the segmentation and clustering module is used for segmenting the approximate audio into a plurality of voice segments and classifying the voice segments based on acoustic information of synthesized voice and natural voice and a clustering strategy;

and the synthetic voice template finding module is used for obtaining the synthetic voice template according to the number of the voice fragments under each category.

In at least one possible implementation manner, the segmentation and clustering module includes:

an audio category setting unit for presetting a plurality of audio categories based on acoustic information of the synthetic speech and the natural speech;

and the segment classification unit is used for determining the audio category of each voice segment according to the probability score of each voice segment relative to each audio category.

In at least one possible implementation manner, the segment classifying unit includes:

the first clustering component is used for solving the prior probability of each voice fragment relative to each audio category based on the similarity of the mean value super-vectors of each voice fragment and each audio category;

and the second clustering component is used for solving and iteratively updating the posterior probability of each voice segment according to the prior probability, the mean value super vector of each voice segment and a pre-constructed clustering model, and finally determining the audio class to which each voice segment belongs.

In at least one possible implementation manner, the synthesized sound template discovery module is specifically configured to select at least one of the speech segments as the synthesized sound template from a category in which the number of the speech segments is greater than or equal to a preset target number threshold.

In at least one possible implementation manner, the similar audio screening module includes:

the database dividing comparison unit is used for constructing a confusion audio database for the audio to be processed which accords with the similarity standard based on a preset database dividing comparison strategy;

an approximate audio determining unit for using the audio to be processed within the confusing audio library as the approximate audio.

In at least one possible implementation manner, the database dividing and comparing unit includes:

the database dividing component is used for dividing the voice material database into two sub-databases according to the audio time length;

the similarity comparison component is used for comparing the audio to be processed in the two sub-libraries one by one based on the mean value super vector;

an obfuscated audio library construction component for constructing an obfuscated audio library from the to-be-processed audio satisfying a first similarity threshold;

and the circulating component is used for splitting the confusion audio library and then comparing the split confusion audio library with each other again if the total number of the audios in the confusion audio library exceeds a preset upper number limit, screening the split confusion audio library based on a second similar threshold, and so on until the total number of the audios in the confusion audio library is less than or equal to the upper number limit.

In at least one possible implementation manner, the mean value super vector extraction module includes:

the acoustic feature extraction unit is used for extracting acoustic features of the audio to be processed based on the inverse pedigree number of the cochlear filter;

and the mean value super vector estimation unit is used for estimating the mean value super vector of the audio to be processed by utilizing the acoustic features and a pre-trained general background model.

In a third aspect, the present invention provides a synthesized sound template discovery apparatus including:

one or more processors, memory which may employ a non-volatile storage medium, and one or more computer programs, wherein the one or more computer programs are stored in the memory, the one or more computer programs comprising instructions which, when executed by the apparatus, cause the apparatus to perform the method as in the first aspect or any possible implementation form of the first aspect.

In a fourth aspect, the present invention provides a computer readable storage medium having stored thereon a computer program which, when run on a computer, causes the computer to perform the method as described in the first aspect or any possible implementation manner of the first aspect.

In a fifth aspect, the present invention also provides a computer program product for performing the method of the first aspect or any of its possible implementation forms, when the computer program product is executed by a computer.

In a possible design of the fifth aspect, the relevant program related to the product may be stored in whole or in part on a memory packaged with the processor, or may be stored in part or in whole on a storage medium not packaged with the processor.

The invention is characterized in that the repeating characteristic of the synthetic sound template is utilized, pronunciation similarity comparison is firstly carried out among a large number of voice materials, a suspected synthetic sound template is primarily selected from the voice materials, the selected voice materials are cut, then the cut voice sections are classified according to the pronunciation characteristic of the synthetic sound template, and finally the required synthetic sound template is found according to the number of the voice sections contained in the same type. The method supplements reliable synthetic sound template samples for subsequent synthetic sound detection, saves subsequent works of manual marking, identification and the like, and can solve the problem of confusion between natural speech and synthetic speech in the corpus by independently analyzing suspected speech segments, so that the accuracy of the subsequent synthetic sound detection can be effectively improved on the premise of controlling the cost.

Drawings

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be further described with reference to the accompanying drawings, in which:

FIG. 1 is a flow chart of an embodiment of a synthetic speech template discovery method provided by the present invention;

FIG. 2 is a flowchart illustrating a preferred embodiment of the present invention;

FIG. 3 is a block diagram of an embodiment of a synthesized sound template discovery apparatus provided by the present invention;

fig. 4 is a schematic diagram of an embodiment of a synthesized sound template discovery apparatus provided by the present invention.

Detailed Description

Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are illustrative only and should not be construed as limiting the invention.

The present invention provides at least one embodiment of a synthetic speech template discovery method, as shown in fig. 1, which may include the steps of:

and step S1, pre-constructing a voice material library.

The invention collects voice materials in a library mode, and aims to find a finally needed synthetic voice template from a plurality of channels, a plurality of scenes and a plurality of existing large voice data, wherein the synthetic voice template comprises synthetic voice materials with formed formats and standards, and the synthetic voice materials have certain characteristics from the application scale and the acoustic level; there are many options for the source of the collection in this step, such as but not limited to from the internet public opinion audio library or specific audio library that needs to detect the synthesized voice template.

And step S2, extracting the mean value super vector of all the audio to be processed in the voice material library.

In speaker distinguishing technology, similar means such as mean value super vector extraction and the like are usually involved, but it should be noted that, specifically in combination with the synthetic speech template discovery technology concerned by the present invention, on the basis of referring to speaker recognition related technical means, in some preferred embodiments, further specific and targeted embodiments are provided for the needs of the present invention, for example, in some embodiments, the aforementioned extracting of mean value super vectors of all audio to be processed in the speech material library may specifically include extracting acoustic features of the audio to be processed based on cepstrum coefficients of a cochlear filter, and then estimating the mean value super vector of the audio to be processed by using the acoustic features and a pre-trained general background model.

It can be stated that the aforesaid acoustic features may conventionally include characteristic parameters from the acoustic angle of the speaker, such as MFCC, PLP, BN, but in the aforesaid preferred embodiment, the present invention proposes to use acoustic features different from the conventional acoustic features based on the cepstral coefficients of the cochlear filter, which are obtained by an algorithm simulating the response of the human intra-aural base membrane (basic membrane) to the input signal, for example, by performing auditory transformation on the input speech signal, by performing windowing and nonlinear, discrete cosine transformation, etc. In practical operation, the auditory transformation may use a Gammatone filter to process the original input signal, so as to simulate the whole process of sound transmission from the external ear to the base membrane in the ear, and the principle is as follows: when a voice signal is transmitted into a human ear to cause the basement membrane to move up and down, a shear stress is generated between the basement membrane and a cover membrane (protective membrane), the shear stress causes the displacement of the hair cells on the uppermost layer, so that the hair cells generate nerve signals, but the hair cells generate the nerve signals only when the basement membrane moves towards one direction, and when the basement membrane moves towards the other direction, the hair cells are not excited, and the nerve signals are not generated. Different hair cell excitation functions can be tried to better simulate the response of the hair cells. The acoustic feature is not the focus of the invention, and the improvement of the invention is under the specific technical requirement of applying the acoustic feature with strong robustness to the environmental noise to the synthetic audio template discovery, so that the subsequently extracted mean value super vector can reflect the characteristics of each audio material more truly and reliably.

Then, a general background model is trained on the extracted acoustic features, and it should be noted that the synthetic sound template discovery scheme provided by the present invention does not customize a synthetic sound detection model for the collected speech material, but considers that an unsupervised training mode is directly adopted to design an acoustic feature detection mechanism for a large number of speech samples, that is, the general background model is designed to cover speech data under various channels and application scenes, and then subsequent similarity comparison between materials can be performed based on the design. Thus, in some embodiments of the present invention, it is proposed that the core of the aforementioned general background model may use a Gaussian Mixture Model (GMM) to model the phoneme distributions of different speech sources, so that by comparing the GMM phoneme distributions of different speech samples (objectively, the synthesized speech template also has unique and extremely rich phoneme information), it is possible to help distinguish between natural speech data and synthesized speech data without labeling the synthesized speech and the natural speech.

The process of establishing the general background model is briefly described as follows; the method comprises the steps of inputting cepstrum coefficient characteristic vectors based on a cochlear filter mentioned above, calculating an initial model by adopting an LBG algorithm, wherein the model constructed by the LBG algorithm only can be obtained because a hard decision (one point totally judges which class belongs to or does not belong to) mode is adopted, therefore, the model can be further optimally trained by adopting an EM iterative algorithm after the training of the LBG model is completed. Then, a global difference space T is estimated by combining each gaussian mixture component and feature parameter in the general background model, and then the above-mentioned mean supervector (e.g. i-vector) can be further estimated. The estimation itself is a conventional technique, and the present invention can be implemented by taking reference to the prior art, and will not be described in detail.

And S3, comparing the similarity of the audio to be processed with each other and screening out a plurality of approximate audios based on the mean value super vector.

As mentioned above, the present invention aims not to detect synthesized speech of speech material, but to find synthesized speech templates by itself, and therefore, this step proposes the closeness of the respective mean supervectors between all speech materials, which can be understood as the degree of repetition in some embodiments, since the present invention is analyzed to assume that synthesized speech templates existing in a large amount of speech material will have repetitive characteristics, i.e., a sender can send one synthesized speech template to all its users, and therefore the present invention is premised on the need to collect a large amount of speech material in a large range, and a high-reproducibility synthesized speech template exists with a high probability. Of course, in real scenes this "duplication" is not a literally bad reproduction, so the invention proposes here to compare similarities between materials.

There are many options for the way of comparison, such as internal comparison based on the original voice material library, but considering the breadth and scale of voice material collection, preferably, the comparison can be carried out based on a preset database comparison strategy, i.e., splitting the original speech material library according to a predetermined criteria, such as, but not limited to, the duration of each speech material, forming smaller and/or contaminant-removed sub-libraries, then, a comparison is carried out between the sub-libraries, and further, the aforementioned audio to be processed meeting the similarity standard is constructed into an audio confusing library (i.e. a library which can include a plurality of voice materials which are relatively closer, similar and repeated), and each audio to be processed stored in the confusion audio library can be used as the approximate audio, namely a processing object of a subsequent link of the invention.

For example, the original speech material library may be divided into two parts according to the audio duration and the material quantity, and may be named as library a and library B, and each of the library a is usedComparing the mean value super vector of the material to be processed with each material to be processed in the B library one by one, wherein the comparison method can refer to but is not limited to the following: to the mean value supervectors w of the participating comparisons₁And w₂Calculating the cosine distance between the two

And the calculated similarity score can be designed to be 0-1]Between the intervals, the more similar the acoustic elements (such as phonemes and the like) and the content information and the like of the two audios are, the higher the score is, and then the high-score similar speech to be processed can be selected according to a preset threshold value and put into the confusing audio library.

Considering that the size and source of the constructed speech material library are not absolutely limited when the present invention is implemented, because the amount of audio to be processed in the speech material library may be relatively large, the foregoing library comparison strategy may be further modified into the manner shown in fig. 2, which includes:

step S31, splitting the voice material library into two sub-libraries according to the audio time length;

step S32, comparing the audio to be processed in the two sub-libraries based on the mean value super vector;

step S33, constructing a confusion audio library for the audio to be processed meeting a first similarity threshold;

and step S34, if the total number of the audios in the confusion audio library exceeds a preset upper limit of the number, splitting the confusion audio library, then comparing the split confusion audio library with each other again, screening based on a second similar threshold, and so on until the total number of the audios in the confusion audio library is less than or equal to the upper limit of the number.

That is, based on the preset 'stock' quantity standard, the large-scale original voice materials are filtered and screened step by step through multiple rounds of database dividing operation and in combination with the mode of adjusting the similarity score threshold value, so that the more similar approximate audios which accord with the reasonable processing quantity are selected. Of course, it can be understood by those skilled in the art that no matter what kind of sub-library comparison method is adopted, the sub-library operation itself can occur in any of the foregoing steps, that is, the sub-library can be executed immediately after the original speech material library is constructed, and does not need to be synchronized with the comparison operation, so that the timing sequence of the sub-library operation itself is not limited by the present invention.

Step S4, the approximate audio is cut into a plurality of voice segments, and each voice segment is classified based on the acoustic information of the synthesized voice and the natural voice and the clustering strategy.

It should be noted that step S4 essentially includes two operation steps, one is to perform a segmentation operation on the audio to be processed to obtain a shorter speech fragment; and the other is to mark each voice segment by category.

The segmentation design aims to take into account that the collected objects of the voice materials are wide, so that the materials may contain samples of long sentences and may also contain samples mixed by natural voice and synthetic voice, and each approximate audio is cut, so that relatively complete single sentence samples with short time can be obtained, and the natural voice and the synthetic voice can be separated together. The present invention is not limited to this, and specific audio segmentation techniques can be used to refer to mature related algorithms, such as but not limited to VAD.

The purpose of the design of the category label is to label the category of the acoustic angle for each of the aforementioned voice segments, that is, to determine the acoustic category of each voice segment. Thus, in some embodiments of the present invention, a plurality of audio categories, each representing a set of similar acoustic features, may be preset based on acoustic information of the synthesized speech and the natural speech, and then probability scores of the speech segments with respect to the audio categories are considered, so that the audio categories of the speech segments may be determined.

In the specific operation of the process, the short-time speech clustering algorithm can be adopted for the screened high-confusion audio library to carry out the classification operation, the conventional short-time speech clustering technology can have good description capacity through single Gauss, but the invention finds that the data duration can be increased along with the continuous progress of hierarchical clustering and the distribution of the pronunciation characteristics of different speech is not accurately described only by the single Gauss, and based on the description, the invention can consider the combined realization by adopting a multi-clustering thought and an algorithm, for example, in some better embodiments, the prior probability of each speech fragment relative to each audio category can be obtained firstly based on the similarity of the mean value supervectors of each speech fragment and each audio category, and then the clustering model can be constructed in advance according to the prior probability, the mean value supervectors of each speech fragment and each audio category, and solving and iteratively updating the posterior probability of each voice segment, and finally determining the audio class of each voice segment.

Particularly, short-term clustering and long-term clustering can be fused, and classification and marking operations can be carried out by fully utilizing the respective reliability and advantages of the short-term clustering and the long-term clustering. Such as but not limited to:

(1) initializing a plurality of preset audio categories, and extracting mean value super-vectors corresponding to the categories as class centers of the categories; then, for each relatively short-term speech segment, a mean supervector is extracted (in practice, this mean supervector can be directly obtained from the above step of extracting the mean supervector, so that the operation process can be simplified).

(2) And solving the square of the cosine distance between the mean value super vector of each voice segment and the mean value super vector of each class center, and taking the square as the prior probability, so that each short-time voice segment can obtain a probability score relative to each audio class.

(3) And (3) designing a clustering threshold, combining with a certain mature and reliable long-term clustering algorithm, such as but not limited to a PLDA (programmable logic device) model, solving an objective function value of the clustering model by using the mean value super-vector and corresponding prior probability, updating the posterior probability score of each voice fragment belonging to each category according to the operation result of the clustering model, combining similar voice fragments after operation, re-extracting the mean value super-vector, and performing multiple cycle iteration until the objective function value is not increased any more.

The above mentioned joint clustering method, clustering algorithm, clustering threshold, etc. are only schematic, and when the actual clustering operation is performed, the clustering threshold can be adjusted according to the data scale, in short, the core idea of the above contents is to perform the front-end preliminary clustering first, and then perform the rear-end enhancement on the clustering effect until obtaining the more accurate and reliable class mark of the voice segment.

And step S5, acquiring a synthetic voice template according to the number of the voice fragments under each category.

In specific implementation, each voice segment marked with the "category" may be clustered again based on the result of step S4, that is, the result of step S4 is sorted so as to count the number of voice segments under each category, the implementation manner may be, but is not limited to, hierarchical clustering by using an AP clustering technique, and the clustering manner may also use the aforementioned hooks with the mean value supervector, the set category, and the clustering threshold value, and perform clustering by using similar ideas as described above. Further, the analysis can be referred to in the foregoing, the synthesized sound template has a large number of repeated characteristics in the original speech material, that is, the number of the speech segments of the same category has a positive correlation with the accuracy of the synthesized sound template belonging to the category, so that after the steps are clustered, the method can be realized by setting a target number threshold, and the method has the effects that different synthesized sound templates can be distinguished, interference of non-synthesized sound data can be filtered, and the synthesized sound template required by the method can be conveniently found. For example, a category in which the number of speech segments is greater than or equal to a preset target number threshold may be determined as a category in which all speech segments are synthetic speech templates, and at least one of the speech segments may be selected as a desired synthetic speech template ("template" means having a repetitive, close nature, and therefore preferably one of the categories may be randomly selected as a representative — i.e., a synthetic speech template found from a large number of original materials via the present invention).

In summary, the concept of the present invention is to use the repetition characteristics of the synthesized voice template, firstly compare the pronunciation similarity among a large number of voice materials, initially select a suspected synthesized voice template, cut the selected voice materials, classify the cut voice segments according to the pronunciation characteristics of the synthesized voice template, and finally find the required synthesized voice template according to the number of the voice segments contained in the same class. The invention supplements reliable synthetic sound template samples for subsequent synthetic sound detection, saves subsequent work of manual marking, identification and the like, and can solve the problem of confusion between natural speech and synthetic speech in the corpus by independently analyzing suspected speech segments, thereby effectively improving the accuracy of the subsequent synthetic sound detection on the premise of controlling the cost.

Corresponding to the above embodiments and preferred schemes, the present invention further provides an embodiment of a synthesized sound template discovery apparatus, as shown in fig. 3, which may specifically include the following components:

the material collection module 1 is used for constructing a voice material library in advance;

the mean value super vector extraction module 2 is used for extracting mean value super vectors of all audio frequencies to be processed in the voice material library;

the similar audio screening module 3 is used for comparing the similarity of the audio to be processed with each other and screening out a plurality of similar audios based on the mean value super vector;

the segmentation and clustering module 4 is used for segmenting the approximate audio into a plurality of voice segments and classifying the voice segments based on acoustic information of synthetic voice and natural voice and a clustering strategy;

and the synthetic voice template finding module 5 is used for obtaining the synthetic voice template according to the number of the voice fragments under each category.

It should be understood that the division of the components in the synthetic speech template discovery apparatus shown in fig. 3 is merely a logical division, and the actual implementation may be wholly or partially integrated into a physical entity or may be physically separated. And these components may all be implemented in software invoked by a processing element; or may be implemented entirely in hardware; and part of the components can be realized in the form of calling by the processing element in software, and part of the components can be realized in the form of hardware. For example, a certain module may be a separate processing element, or may be integrated into a certain chip of the electronic device. Other components are implemented similarly. In addition, all or part of the components can be integrated together or can be independently realized. In implementation, each step of the above method or each component above may be implemented by an integrated logic circuit of hardware in a processor element or an instruction in the form of software.

For example, the above components may be one or more integrated circuits configured to implement the above methods, such as: one or more Application Specific Integrated Circuits (ASICs), one or more microprocessors (DSPs), one or more Field Programmable Gate Arrays (FPGAs), etc. For another example, these components may be integrated together and implemented in the form of a System-On-a-Chip (SOC).

In view of the foregoing examples and their preferred embodiments, it will be appreciated by those skilled in the art that in practice, the invention may be practiced in a variety of embodiments, and that the invention is illustrated schematically in the following vectors:

(1) a synthesized sound template discovery apparatus may include:

one or more processors, memory, and one or more computer programs, wherein the one or more computer programs are stored in the memory, the one or more computer programs comprising instructions, which when executed by the apparatus, cause the apparatus to perform the steps/functions of the foregoing embodiments or equivalent implementations.

Fig. 4 is a schematic structural diagram of an embodiment of a synthesized sound template discovery device according to the present invention, wherein the device may be an electronic device or a circuit device built in the electronic device. The electronic device can be a PC, a server, a smart terminal (a mobile phone, a tablet, a watch, glasses and the like) and the like. The embodiment does not limit the specific form of the synthesized sound template discovery device.

As shown in fig. 4 in particular, the synthetic tone template discovery apparatus 900 includes a processor 910 and a memory 930. Wherein, the processor 910 and the memory 930 can communicate with each other and transmit control and/or data signals through the internal connection path, the memory 930 is used for storing computer programs, and the processor 910 is used for calling and running the computer programs from the memory 930. The processor 910 and the memory 930 may be combined into a single processing device, or more generally, separate components, and the processor 910 is configured to execute the program code stored in the memory 930 to implement the functions described above. In particular implementations, the memory 930 may be integrated with the processor 910 or may be separate from the processor 910.

In addition, in order to make the function of the synthesized sound template discovery apparatus 900 more complete, the apparatus 900 may further include one or more of an input unit 960, a display unit 970, an audio circuit 980 which may further include a speaker 982, a microphone 984, and the like, a camera 990, a sensor 901, and the like. The display unit 970 may include a display screen, among others.

Further, the above-described synthetic tone template discovery apparatus 900 may further include a power supply 950 for supplying power to various devices or circuits in the apparatus 900.

It is to be understood that the synthetic tone template discovery apparatus 900 shown in fig. 4 can implement the respective processes of the method provided by the foregoing embodiments. The operations and/or functions of the various components of the apparatus 900 may each be configured to implement the corresponding flow in the above-described method embodiments. Reference is made in detail to the foregoing description of embodiments of the method, apparatus, etc., and a detailed description is omitted here as appropriate to avoid redundancy.

It should be understood that the processor 910 of the synthesized sound template discovery apparatus 900 shown in fig. 4 may be a system on a chip SOC, and the processor 910 may include a Central Processing Unit (CPU), and may further include other types of processors, such as: an image Processing Unit (GPU), etc., which will be described in detail later.

In summary, various portions of the processors or processing units within the processor 910 may cooperate to implement the foregoing method flows, and corresponding software programs for the various portions of the processors or processing units may be stored in the memory 930.

(2) A readable storage medium, on which a computer program or the above-mentioned apparatus is stored, which, when executed, causes the computer to perform the steps/functions of the above-mentioned embodiments or equivalent implementations.

In the several embodiments provided by the present invention, any function, if implemented in the form of a software functional unit and sold or used as a separate product, may be stored in a computer readable storage medium. Based on this understanding, some aspects of the present invention may be embodied in the form of a software product as described below, or a part thereof, which contributes to the art per se.

(3) A computer program product (which may include the above-described apparatus) which, when run on a terminal device, causes the terminal device to perform the synthetic tone template discovery method of the preceding embodiment or equivalent embodiments.

From the above description of the embodiments, it is clear to those skilled in the art that all or part of the steps in the above implementation method can be implemented by software plus a necessary general hardware platform. With this understanding, the above-described computer program product may include, but is not limited to, refer to APP; continuing on, the aforementioned device/terminal may be a computer device (e.g., a mobile phone, a PC terminal, a cloud platform, a server cluster, or a network communication device such as a media gateway). Moreover, the hardware structure of the computer device may further specifically include: at least one processor, at least one communication interface, at least one memory, and at least one communication bus; the processor, the communication interface and the memory can all complete mutual communication through the communication bus. The processor may be a central Processing unit CPU, a DSP, a microcontroller, or a digital Signal processor, and may further include a GPU, an embedded Neural Network Processor (NPU), and an Image Signal Processing (ISP), and may further include a specific integrated circuit ASIC, or one or more integrated circuits configured to implement the embodiments of the present invention, and the processor may have a function of operating one or more software programs, and the software programs may be stored in a storage medium such as a memory; and the aforementioned memory/storage media may comprise: non-volatile memories (non-volatile memories) such as non-removable magnetic disks, U-disks, removable hard disks, optical disks, etc., and Read-Only memories (ROM), Random Access Memories (RAM), etc.

In the embodiments of the present invention, "at least one" means one or more, "a plurality" means two or more. "and/or" describes the association relationship of the associated objects, and means that there may be three relationships, for example, a and/or B, and may mean that a exists alone, a and B exist simultaneously, and B exists alone. Wherein A and B can be singular or plural. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship. "at least one of the following" and similar expressions refer to any combination of these items, including any combination of singular or plural items. For example, at least one of a, b, and c may represent: a, b, c, a and b, a and c, b and c or a and b and c, wherein a, b and c can be single or multiple.

Those of skill in the art will appreciate that the various modules, elements and method steps described in the embodiments disclosed in the present specification can be implemented as electronic hardware, a combination of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

In addition, the embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments may be referred to each other. In particular, for embodiments of devices, apparatuses, etc., since they are substantially similar to the method embodiments, reference may be made to the description of the method embodiments for relevant points. The above-described embodiments of apparatuses, devices, etc. are merely illustrative, and modules, units, etc. described as separate components may or may not be physically separate, i.e., may be located in one place, or may be distributed in multiple places, e.g., nodes of a system network. Some or all of the modules and units can be selected according to actual needs to achieve the purpose of the solution of the above embodiment. Can be understood and implemented by those skilled in the art without inventive effort.

The structure, features and effects of the present invention have been described in detail in the embodiments shown in the drawings, but the above embodiments are merely preferred embodiments of the present invention, and it should be understood that technical features related to the above embodiments and preferred modes can be reasonably combined and configured into various equivalent schemes by those skilled in the art without departing from and changing the design idea and technical effects of the present invention; therefore, the invention is not limited to the embodiments shown in the drawings, and all equivalent embodiments modified or equivalent according to the concept of the invention should be within the scope of the invention without departing from the spirit of the description and the drawings.

Claims

1. A synthetic speech template discovery method, comprising:

a voice material library is constructed in advance;

comparing the similarity of the audio to be processed with each other and screening out a plurality of approximate audios based on the mean value super vector;

2. The synthetic speech template discovery method according to claim 1, wherein said classifying each of said speech segments based on acoustic information of synthetic speech and natural speech and a clustering strategy comprises:

3. The synthetic speech template discovery method of claim 2 wherein said determining an audio category for each of said speech segments based on a probability score for each of said speech segments with respect to each audio category comprises:

based on the similarity of the mean value super-vectors of the voice segments and the audio categories, solving the prior probability of the voice segments relative to the audio categories;

4. The synthetic speech template discovery method according to claim 1, wherein said obtaining a synthetic speech template based on the number of speech segments in each category includes:

5. The synthetic audio template discovery method according to claim 1, wherein the comparing the similarity of the audio to be processed with each other and screening out a plurality of approximate audios based on the mean supervector comprises:

6. The synthetic audio template discovery method according to claim 5, wherein the constructing a confusing audio library of the to-be-processed audio meeting the similarity criterion based on a preset sub-library comparison strategy comprises:

if the total number of the audios in the confusion audio library exceeds a preset upper number limit, the confusion audio library is split and then compared with each other again, screening is carried out based on a second similarity threshold, and the like is carried out until the total number of the audios in the confusion audio library is smaller than or equal to the upper number limit.

7. A synthetic audio template discovery method according to any one of claims 1 to 6, wherein said extracting the mean supervectors of all audio to be processed in said library of speech materials comprises:

8. The synthetic speech template discovery method according to claim 7, wherein said general background model is a Gaussian mixture model characterizing neutral speakers trained based on said acoustic features and a specific joint algorithm.

9. A synthesized sound template finding apparatus, comprising:

the segmentation and clustering module is used for segmenting the approximate audio into a plurality of voice segments and classifying the voice segments based on acoustic information of synthetic voice and natural voice and a clustering strategy;

10. The synthetic speech template discovery apparatus according to claim 9, wherein said segmentation clustering module comprises:

and the segment classifying unit is used for determining the audio category of each voice segment according to the probability score of each voice segment relative to each audio category.

11. The synthetic sound template finding apparatus according to claim 10, wherein the segment classifying unit includes:

the first clustering component is used for solving the prior probability of each voice fragment relative to each audio class based on the similarity of the mean value super-vectors of each voice fragment and each audio class;

and the second clustering component is used for solving and iteratively updating the posterior probability of each voice segment according to the prior probability, the mean value super vector of each voice segment and a pre-constructed clustering model, and finally determining the audio category to which each voice segment belongs.

12. A synthesized sound template discovery apparatus, comprising:

one or more processors, a memory, and one or more computer programs, wherein the one or more computer programs are stored in the memory, the one or more computer programs comprising instructions which, when executed by the apparatus, cause the apparatus to perform the synthetic tone template discovery method of any of claims 1-8.