CN112614479B - Training data processing method and device and electronic equipment - Google Patents

Training data processing method and device and electronic equipment Download PDF

Info

Publication number
CN112614479B
CN112614479B CN202011349252.7A CN202011349252A CN112614479B CN 112614479 B CN112614479 B CN 112614479B CN 202011349252 A CN202011349252 A CN 202011349252A CN 112614479 B CN112614479 B CN 112614479B
Authority
CN
China
Prior art keywords
training data
training
preset
audio
processing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011349252.7A
Other languages
Chinese (zh)
Other versions
CN112614479A (en
Inventor
刘龙飞
陈昌滨
高占杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN202011349252.7A priority Critical patent/CN112614479B/en
Publication of CN112614479A publication Critical patent/CN112614479A/en
Application granted granted Critical
Publication of CN112614479B publication Critical patent/CN112614479B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/04Details of speech synthesis systems, e.g. synthesiser structure or memory management
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Signal Processing (AREA)
  • Electrically Operated Instructional Devices (AREA)

Abstract

The application discloses a training data processing method and device and electronic equipment, and relates to the technical field of artificial intelligence such as voice technology, big data and deep learning. The scheme is as follows: acquiring training data; checking the training data according to a preset standard; and adding the qualified training data into a training data set of the zero-time learning speech synthesis model. According to the method and the device, training data are mined and screened according to preset standards, manual intervention is not relied on, processing of industrial full-automatic zero-learning speech synthesis model training data can be achieved, and efficiency and reliability in the processing process of the training data are improved.

Description

Training data processing method and device and electronic equipment
Technical Field
Embodiments of the present application relate generally to the field of data processing technology, and more particularly, to the field of artificial intelligence technology such as voice technology, big data, deep learning, and the like.
Background
With the rise of mobile internet, users' demands for personalized Speech customization services and the like are becoming stronger, and thus, a personalized TTS (Text to Speech) technology is coming. Particularly, the personalized TTS technology based on Zero-shot (Zero-learning) can synthesize the voice of a speaker who does not appear in training data without a Fine-tune model, and receives more and more attention.
The realization of the TTS technology based on Zero-shot with prominent performance benefits from a well-trained basic model based on an ultra-large rule database, and the model can cover various types and styles of timbres due to the abundance of training data volume and has certain stability and generalization capability.
However, according to the existing training data processing method, when a training data set of the Zero-shot TTS model is acquired, the problems of long time consumption, poor efficiency, low accuracy and the like often exist. Therefore, how to improve the efficiency and reliability of the training data processing process has become one of important research directions.
Disclosure of Invention
The application provides a training data processing method and device and electronic equipment.
According to a first aspect, there is provided a method for processing training data, comprising:
acquiring training data;
checking the training data according to a preset standard;
and adding the qualified training data into a training data set of the zero-time learning speech synthesis model.
According to a second aspect, there is provided a processing apparatus of training data, comprising:
the acquisition module is used for acquiring training data;
the inspection module is used for inspecting the training data according to a preset standard;
and the adding module is used for adding the qualified training data into the training data set of the zero-time learning speech synthesis model.
According to a third aspect, there is provided an electronic device comprising: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor, the instructions being executable by the at least one processor to enable the at least one processor to perform the method of processing training data according to the first aspect of the present application.
According to a fourth aspect, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing a computer to perform the method of processing training data of the first aspect of the present application.
According to a fifth aspect, there is provided a computer program product comprising a computer program which, when executed by a processor, implements a method of processing training data according to the first aspect of the application.
It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present application, nor do they limit the scope of the present application. Other features of the present application will become apparent from the following description.
Drawings
The drawings are included to provide a better understanding of the present solution and are not intended to limit the present application. Wherein:
FIG. 1 is a schematic diagram according to a first embodiment of the present application;
FIG. 2 is a schematic diagram according to a second embodiment of the present application;
fig. 3 is a block diagram of a training data processing apparatus for implementing a method of processing training data according to an embodiment of the present application;
fig. 4 is a block diagram of a training data processing apparatus for implementing a method of processing training data according to an embodiment of the present application;
FIG. 5 is a block diagram of an electronic device used to implement the processing of training data of an embodiment of the present application.
Detailed Description
The following description of the exemplary embodiments of the present application, taken in conjunction with the accompanying drawings, includes various details of the embodiments of the application for the understanding of the same, which are to be considered exemplary only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present application. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
The following briefly describes the technical field to which the solution of the present application relates:
data processing (data processing) is the collection, storage, retrieval, processing, transformation and transmission of data. The basic purpose of data processing is to extract and derive valuable, meaningful data for certain people from large, possibly chaotic, unintelligible amounts of data. Data processing is the basic link of system engineering and automatic control. Data processing is throughout various fields of social production and social life. The development of data processing technology and the breadth and depth of its application have greatly influenced the progress of human society development.
AI (Artificial Intelligence) is a subject for studying a computer to simulate some thinking processes and intelligent behaviors (such as learning, reasoning, thinking, planning, etc.) of a human being, and has a technology at a hardware level and a technology at a software level. Artificial intelligence hardware techniques generally include computer vision techniques, speech recognition techniques, natural language processing techniques, and learning/deep learning thereof, big data processing techniques, knowledge-graph techniques, and the like.
DL (Deep Learning), a new research direction in the field of ML Machine Learning (Machine Learning), is introduced into Machine Learning to make it closer to the original target, artificial intelligence. Deep learning is the intrinsic law and representation hierarchy of learning sample data, and information obtained in the learning process is very helpful for interpretation of data such as characters, images and sounds. The final aim of the method is to enable the machine to have the analysis and learning capability like a human, and to recognize data such as characters, images and sounds. Deep learning is a complex machine learning algorithm, and achieves the effect in speech and image recognition far exceeding the prior related art.
The Speech technology refers to key technologies in the field of computers, such as ASR (Automatic Speech Recognition) and TTS (Text to Speech). The speech synthesis technology is needed for computer speaking, the core of the speech synthesis technology is TTS technology, the speech synthesis is even applied to an information system of an automobile, and an automobile owner can convert text files, e-mails, network news or novels downloaded into a system computer into speech to listen in the automobile.
Big Data (Big Data) is a Data set which cannot be captured, managed and processed by a conventional software tool within a certain time range, and is a massive, high-growth-rate and diversified information asset which can have stronger decision-making power, insight discovery power and flow optimization capability only by a new processing mode.
The following describes a training data processing method, a training data processing device, and an electronic device according to an embodiment of the present application with reference to the drawings.
Fig. 1 is a schematic diagram according to a first embodiment of the present application. It should be noted that the main execution body of the training data processing method of the present embodiment is a training data processing device, and the training data processing device may specifically be a hardware device, or software in a hardware device, or the like. The hardware devices are, for example, terminal devices, servers, and the like. As shown in fig. 1, the method for processing training data according to this embodiment includes the following steps:
and S101, acquiring training data.
The training data may be newly added training data after the training data is obtained last time.
It should be noted that, the specific manner for acquiring the training data is not limited in the present application, and may be selected according to actual situations.
Optionally, whether a preset fixed directory of the zero-time learning speech synthesis model is empty or not may be determined in real time or according to a preset period, and if the preset fixed directory is empty, it indicates that no newly added training data exists after the training data is obtained last time; if the preset fixed directory is not empty, it indicates that there is new training data after the last training data acquisition, and at this time, this part of training data may be acquired.
The preset fixed directory refers to a directory for storing newly added training data, and may be set according to an actual situation, for example, the preset fixed directory may be set as an external data source directory.
And S102, checking the training data according to a preset standard.
The preset standard refers to a standard for checking whether the training data is qualified, and can be set according to actual conditions.
And S103, adding the qualified training data into the training data set of the zero-time learning speech synthesis model.
In the embodiment of the application, the qualified training data can be added into the training data set of the zero-time learning speech synthesis model, and under the condition, the qualified training data and the original data in the training data set of the zero-time learning speech synthesis model form a new training data set.
According to the training data processing method, training data can be obtained, the training data are checked according to the preset standard, and then the qualified training data are added into the training data set of the zero-time learning speech synthesis model. Therefore, training data are mined and screened according to preset standards, manual intervention is not relied on, processing of the industrial full-automatic zero-learning speech synthesis model training data can be achieved, and efficiency and reliability in the processing process of the training data are improved.
It should be noted that, in the present application, when attempting to acquire training data, the training data may be acquired periodically. Alternatively, the training data may be periodically extracted from a preset training data list, and the extracted training data may be removed and sent to a data verification process.
As a possible implementation manner, whether a preset fixed directory of the zero-time learning speech synthesis model is empty or not can be judged according to a preset period, and if the preset fixed directory is empty, it is indicated that no new training data exists after the training data is obtained last time; if the preset fixed directory is not empty, it indicates that there is new training data after the last training data acquisition, and at this time, this part of training data may be extracted.
Further, when attempting to test the training data according to the preset criterion, the training data that does not satisfy the preset failure criterion may be determined as the training data that is qualified for the test.
The training data includes training audio and corresponding text, and in this case, the preset unqualified standard may include the following 4 standards:
unqualified Standard 1: training data of the speaker with the number of the training audios smaller than a preset first number threshold.
The first number threshold may be set according to actual conditions, for example, the first number threshold may be set to 20. For example, if the number of training audios is less than 20, the training data of the corresponding speaker (speaker) is the unqualified training data.
Unqualified standard 2: training data in which the alignment likelihood value of the training audio and the corresponding text is smaller than a preset alignment likelihood threshold.
As a possible implementation manner, the training audio and the corresponding text may be input to a pre-trained alignment recognition model, so as to obtain an alignment likelihood value of the training audio and the corresponding text. Further, the obtained alignment likelihood value may be compared with a preset alignment likelihood threshold, and if the alignment likelihood value is smaller than the alignment likelihood threshold, it is indicated that the training audio quality is poor or the training audio and the corresponding text cannot be aligned, and in this case, the training data is unqualified training data.
The alignment likelihood threshold may be set according to actual conditions.
Unqualified standard 3: training data in which the embedding vector value of the training audio is more than a preset distance threshold from the average embedding vector value of the same speaker.
As a possible implementation manner, the training audio may be input to a pre-trained Voiceprint (Voiceprint) model, so as to obtain an embedded vector (Embedding) value corresponding to the training audio. Further, an average of the embedding vector values and a distance between each embedding vector value and the average may be calculated for each speaker. Then, the distance of the average embedding vector value may be compared with a preset distance threshold, and if the distance of the average embedding vector value is greater than the preset distance threshold, it is determined that the style of the training audio corresponding to the embedding vector value is inconsistent with other audio. In this case, the training data may be determined to be non-conforming training data.
Unqualified Standard 4: and training data of the speaker of which the number of the training audios with the distance greater than the distance threshold is greater than a preset second number threshold.
Optionally, the number of training audios of the same speaker whose distance is greater than the distance threshold may be obtained and compared with the preset second number threshold, and if the number of training audios of the same speaker whose distance is greater than the distance threshold is greater than the training data of the speaker whose preset second number threshold, the speaker may be determined to be an unqualified speaker, and accordingly, all the training data of the speaker are regarded as unqualified training data.
The second number threshold may be set according to actual conditions, and for example, the second number threshold may be set to 5. For example, if the distance between the average embedding vector values of 7 training audios in total for recognizing the first speaker is greater than the preset distance threshold, the first speaker can be determined to be an unqualified speaker, and accordingly, all the training data of the first speaker are regarded as unqualified training data.
It should be noted that, in practical applications, the training data set is non-static, and on the basis of ensuring the quality of the training data, the addition of the training data brings forward gain to the synthesis effect of performing speech synthesis based on zero-time learning of the speech synthesis model. Therefore, as the training data is continuously supplemented and accumulated, the zero-time learning speech synthesis model has better generalization capability when the number of the training audios in the training data set reaches the preset third number threshold. In this case, the predetermined fail criteria may be further defined to include the following criteria:
and the similarity of the embedded vector values of the training audio and the corresponding target audio is greater than the training data of a preset similarity threshold, and the target audio is the audio obtained by learning the speech synthesis model for zero times by the text corresponding to the training audio.
Optionally, the training audio may be input to the zero-learning speech synthesis model to obtain a target audio, then the similarity between the embedding vector values of the training audio and the target audio is calculated and compared with a preset similarity threshold, and if the similarity is smaller than or equal to the similarity threshold, which indicates that the speaker cannot be well generalized by the zero-learning speech synthesis model, the training data may be determined to be qualified; if the similarity value is larger than the similarity threshold value, the speaker can be well generalized by the zero-time learning speech synthesis model, and the training data can be determined to be unqualified.
It should be noted that, in the present application, if the newly added training data is qualified, the newly added training data may be added into the training data set of the zero-time learning speech synthesis model; and if the newly added training data is unqualified in inspection, determining a matched processing strategy according to the actual situation. For example, failing training data may be discarded; for another example, the unqualified training data can be stored in the corresponding second predetermined fixed directory for standby.
According to the processing method of the training data, the training data can be checked according to the preset standard, and after the training data are determined to be qualified, the training data are added into the training data set of the zero-time learning speech synthesis model. Therefore, in the application, due to the fact that the preset standard is flexible and easy to expand, the universality of the training data in the processing process is improved. Furthermore, the matched preset standard is set according to different scales of the training data, and the efficiency and the reliability of the training data in the processing process are further improved.
Fig. 2 is a schematic diagram according to a second embodiment of the present application. As shown in fig. 2, on the basis of the previous embodiment, a specific process of the training data processing method according to the embodiment of the present application includes the following steps:
s201, judging whether a preset fixed directory of the zero-time learning voice synthesis model is empty or not.
Alternatively, whether the preset fixed directory is empty may be determined according to a preset period. If the preset fixed directory is identified to be empty, judging the preset fixed directory again in the next period; if the default fixed directory is not identified as empty, step S202 may be executed.
S202, extracting training data from a preset training data catalogue.
And S203, checking the training data according to a preset standard, and judging whether the training data is qualified.
Alternatively, if the training data is qualified, step S204 may be executed; if the training data is identified as not being qualified, the training data may be discarded.
And S204, adding the qualified training data into the training data set of the zero-time learning speech synthesis model.
According to the training data processing method, the training data are mined and screened according to the preset standard without relying on manual intervention, the processing of the industrial full-automatic zero-learning speech synthesis model training data can be realized, and the efficiency and the reliability in the processing process of the training data are improved. Due to the fact that the preset standard is flexible and easy to expand, the universality in the processing process of the training data is improved. Furthermore, the matched preset standard is set according to different scales of the training data, and the efficiency and the reliability of the training data in the processing process are further improved.
Corresponding to the methods for processing training data provided in the foregoing embodiments, an embodiment of the present application further provides a device for processing training data, and since the device for processing training data provided in the embodiment of the present application corresponds to the methods for processing training data provided in the foregoing embodiments, the embodiments of the method for processing training data are also applicable to the device for processing training data provided in the embodiment, and are not described in detail in the embodiment.
Fig. 3 is a schematic structural diagram of a training data processing apparatus according to an embodiment of the present application.
As shown in fig. 3, the apparatus 300 for processing training data includes: an acquisition module 310, a verification module 320, and a joining module 330. Wherein:
an obtaining module 310, configured to obtain training data;
a checking module 320, configured to check the training data according to a preset standard;
and an adding module 330, configured to add the qualified training data into a training data set of the zero-time learning speech synthesis model.
Fig. 4 is a schematic structural diagram of a training data processing apparatus according to another embodiment of the present application.
As shown in fig. 4, the apparatus 400 for processing training data includes: an acquisition module 410, a verification module 420, and a joining module 430. Wherein:
an acquisition module 410, comprising:
the obtaining sub-module 411 is configured to periodically obtain the training data.
Wherein, obtaining submodule 411 includes:
an obtaining unit 4111, configured to periodically extract the training data from a preset training data directory.
A verification module 420, comprising:
a determining sub-module 421, configured to determine the training data that does not meet a preset fail criterion as the training data that passes the inspection.
Optionally, the training data includes training audio and corresponding text, and the preset unqualified standard includes:
the training data of the speaker of which the number of the training audios is smaller than a preset first number threshold;
the training data of which the alignment likelihood value of the training audio and the corresponding text is smaller than a preset alignment likelihood threshold value;
the distance between the embedding vector value of the training audio and the average embedding vector value of the same speaker is larger than a preset distance threshold value; and
the training data of the speaker of which the number of the training audios of which the distance is greater than the distance threshold is greater than a preset second number threshold.
Optionally, if the number of the training audios in the training data set reaches a preset third number threshold, the preset unqualified standard further includes:
and the similarity of the embedded vector values of the training audio and the corresponding target audio is greater than the preset similarity threshold, and the target audio is the audio obtained by the text corresponding to the training audio through zero-time learning of the speech synthesis model.
It should be noted that the joining module 430 has the same function and structure as the joining module 330.
Fig. 5 is a schematic structural diagram of a training data processing apparatus according to another embodiment of the present application.
As shown in fig. 5, the apparatus 500 for processing training data includes: the obtaining module 510, the verifying module 520 and the joining module 530 further include: a first determination module 540 and a second determination module 550. Wherein:
the first determining module 540 is configured to input the training audio and the corresponding text into an alignment recognition model, so as to obtain the alignment likelihood value.
A second determining module 550, configured to input the training audio to a voiceprint model, so as to obtain the embedding vector value.
It should be noted that the acquiring module 510, the verifying module 520, and the joining module 530 have the same functions and structures as the acquiring module 310, the verifying module 320, and the joining module 330.
According to the training data processing method, the training data are mined and screened according to the preset standard without relying on manual intervention, the processing of the industrial full-automatic zero-learning speech synthesis model training data can be realized, and the efficiency and the reliability in the processing process of the training data are improved. Due to the fact that the preset standard is flexible and easy to expand, the universality in the processing process of the training data is improved. Furthermore, the matched preset standard is set according to different scales of the training data, and the efficiency and the reliability of the training data in the processing process are further improved.
According to an embodiment of the present application, an electronic device and a readable storage medium are also provided.
As shown in fig. 5, a block diagram of an electronic device for processing training data according to an embodiment of the present application is shown. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the present application that are described and/or claimed herein.
As shown in fig. 5, the electronic apparatus includes: one or more processors 510, memory 520, and interfaces for connecting the various components, including high-speed interfaces and low-speed interfaces. The various components are interconnected using different buses and may be mounted on a common motherboard or in other manners as desired. The processor may process instructions for execution within the electronic device, including instructions stored in or on the memory to display graphical information of a GUI on an external input/output apparatus (such as a display device coupled to the interface). In other embodiments, multiple processors and/or multiple buses may be used, along with multiple memories and multiple memories, as desired. Also, multiple electronic devices may be connected, with each device providing portions of the necessary operations (e.g., as a server array, a group of blade servers, or a multi-processor system). In fig. 5, one processor 510 is taken as an example.
Memory 520 is a non-transitory computer readable storage medium as provided herein. The memory stores instructions executable by at least one processor to cause the at least one processor to perform the method for processing training data provided herein. The non-transitory computer-readable storage medium of the present application stores computer instructions for causing a computer to execute the method of processing training data provided by the present application.
Memory 520, which is a non-transitory computer readable storage medium, may be used to store non-transitory software programs, non-transitory computer executable programs, and modules, such as program instructions/modules (e.g., acquisition module 310, verification module 320, and joining module 330 shown in fig. 3) corresponding to the processing method of the training data in the embodiments of the present application. The processor 510 executes various functional applications of the server and data processing, i.e., a processing method of training data in the above-described method embodiments, by executing non-transitory software programs, instructions, and modules stored in the memory 520.
The memory 520 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created from use of the positioning electronic device, and the like. Further, the memory 520 may include high speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, memory 520 may optionally include memory located remotely from processor 510, which may be connected to a location electronics device via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The electronic device of processing of training data may further comprise: an input device 530 and an output device 540. The processor 510, the memory 520, the input device 530, and the output device 540 may be connected by a bus or other means, and the bus connection is exemplified in fig. 5.
The input device 530 may receive input numeric or character information and generate key signal inputs related to user settings and function controls of the pointing electronic device, such as a touch screen, keypad, mouse, track pad, touch pad, pointer stick, one or more mouse buttons, track ball, joystick, or other input device. The output device 540 may include a display device, an auxiliary lighting device (e.g., an LED), a haptic feedback device (e.g., a vibration motor), and the like. The display device may include, but is not limited to, a Liquid Crystal Display (LCD), a Light Emitting Diode (LED) display, and a plasma display. In some implementations, the display device can be a touch screen.
Various implementations of the systems and techniques described here can be realized in digital electronic circuitry, integrated circuitry, application specific ASICs (application specific integrated circuits), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.
These computer programs (also known as programs, software applications, or code) include machine instructions for a programmable processor, and may be implemented using high-level procedural and/or object-oriented programming languages, and/or assembly/machine languages. As used herein, the terms "machine-readable medium" and "computer-readable medium" refer to any computer program product, apparatus, and/or device (e.g., magnetic discs, optical disks, memory, Programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term "machine-readable signal" refers to any signal used to provide machine instructions and/or data to a programmable processor.
To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), the internet, and blockchain networks.
The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The service end can be a cloud Server, also called a cloud computing Server or a cloud host, and is a host product in a cloud computing service system, so as to solve the defects of high management difficulty and weak service expansibility in the traditional physical host and VPS service (Virtual Private Server, or VPS for short). The server may also be a server of a distributed system, or a server incorporating a blockchain.
According to the training data processing method, the training data are mined and screened according to the preset standard without relying on manual intervention, the processing of the industrial full-automatic zero-learning speech synthesis model training data can be realized, and the efficiency and the reliability in the processing process of the training data are improved. Due to the fact that the preset standard is flexible and easy to expand, the universality in the processing process of the training data is improved. Furthermore, the matched preset standard is set according to different scales of the training data, and the efficiency and the reliability of the training data in the processing process are further improved.
According to an embodiment of the present application, there is also provided a computer program product comprising a computer program which, when executed by a processor, implements a method of processing training data according to an embodiment of the present application.
It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present application may be executed in parallel, sequentially, or in different orders, and the present invention is not limited thereto as long as the desired results of the technical solutions disclosed in the present application can be achieved.
The above-described embodiments should not be construed as limiting the scope of the present application. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims (14)

1. A method of processing training data, comprising:
acquiring training data;
checking the training data according to a preset standard;
adding the qualified training data into a training data set of a zero-time learning speech synthesis model;
the checking the training data according to a preset standard comprises:
determining the training data which do not meet a preset unqualified standard as the training data which are qualified by the inspection;
the training data comprises training audio and corresponding texts, and the preset unqualified standard comprises:
the training data of the speaker of which the number of the training audios is smaller than a preset first number threshold;
the training data of which the alignment likelihood value of the training audio and the corresponding text is smaller than a preset alignment likelihood threshold value;
the distance between the embedding vector value of the training audio and the average embedding vector value of the same speaker is larger than a preset distance threshold value; and
the training data of the speaker of which the number of the training audios of which the distance is greater than the distance threshold is greater than a preset second number threshold.
2. The process of claim 1, wherein said obtaining training data comprises:
the training data is acquired periodically.
3. The processing method of claim 2, wherein the periodically acquiring the training data comprises:
and periodically extracting the training data from a preset training data catalogue.
4. The processing method according to claim 1, wherein the number of training audios in the training data set reaches a preset third number threshold, the preset fail criteria further comprises:
and the similarity of the embedded vector values of the training audio and the corresponding target audio is greater than the preset similarity threshold, and the target audio is the audio obtained by the text corresponding to the training audio through zero-time learning of the speech synthesis model.
5. The processing method of claim 1, further comprising:
and inputting the training audio and the corresponding text into an alignment recognition model to obtain the alignment likelihood value.
6. The processing method of claim 1, further comprising:
and inputting the training audio to a voiceprint model to obtain the embedding vector value.
7. An apparatus for processing training data, comprising:
the acquisition module is used for acquiring training data;
the inspection module is used for inspecting the training data according to a preset standard;
the adding module is used for adding the qualified training data into a training data set of the zero-time learning speech synthesis model;
the inspection module includes:
a determination submodule for determining the training data which do not meet a preset disqualification standard as the training data which are qualified by the inspection;
the training data comprises training audio and corresponding texts, and the preset unqualified standard comprises:
the training data of the speaker of which the number of the training audios is smaller than a preset first number threshold;
the training data of which the alignment likelihood value of the training audio and the corresponding text is smaller than a preset alignment likelihood threshold value;
the distance between the embedding vector value of the training audio and the average embedding vector value of the same speaker is larger than a preset distance threshold value; and
the training data of the speaker of which the number of the training audios of which the distance is greater than the distance threshold is greater than a preset second number threshold.
8. The processing apparatus of claim 7, wherein the acquisition module comprises:
and the acquisition submodule is used for periodically acquiring the training data.
9. The processing apparatus of claim 8, wherein the acquisition submodule comprises:
and the acquisition unit is used for periodically extracting the training data from a preset training data catalogue.
10. The processing apparatus according to claim 7, wherein the number of training audios in the training data set reaches a preset third number threshold, the preset disqualification criterion further comprises:
and the similarity of the embedded vector values of the training audio and the corresponding target audio is greater than the preset similarity threshold, and the target audio is the audio obtained by the text corresponding to the training audio through zero-time learning of the speech synthesis model.
11. The processing apparatus of claim 7, further comprising:
and the first determining module is used for inputting the training audio and the corresponding text into an alignment recognition model to obtain the alignment likelihood value.
12. The processing apparatus of claim 7, further comprising:
and the second determining module is used for inputting the training audio to a voiceprint model to obtain the embedding vector value.
13. An electronic device, comprising:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform a method of processing training data as claimed in any one of claims 1 to 6.
14. A non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform the method of processing training data according to any one of claims 1 to 6.
CN202011349252.7A 2020-11-26 2020-11-26 Training data processing method and device and electronic equipment Active CN112614479B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011349252.7A CN112614479B (en) 2020-11-26 2020-11-26 Training data processing method and device and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011349252.7A CN112614479B (en) 2020-11-26 2020-11-26 Training data processing method and device and electronic equipment

Publications (2)

Publication Number Publication Date
CN112614479A CN112614479A (en) 2021-04-06
CN112614479B true CN112614479B (en) 2022-03-25

Family

ID=75225343

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011349252.7A Active CN112614479B (en) 2020-11-26 2020-11-26 Training data processing method and device and electronic equipment

Country Status (1)

Country Link
CN (1) CN112614479B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114582487A (en) * 2022-01-26 2022-06-03 北京博瑞彤芸科技股份有限公司 Traditional Chinese medicine diagnosis and treatment assisting method and system based on traditional Chinese medicine knowledge graph

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101840699A (en) * 2010-04-30 2010-09-22 中国科学院声学研究所 Voice quality evaluation method based on pronunciation model
CN109033378A (en) * 2018-07-27 2018-12-18 北京中关村科金技术有限公司 A kind of application method of Zero-shot Learning in intelligent customer service system
CN109036375A (en) * 2018-07-25 2018-12-18 腾讯科技(深圳)有限公司 Phoneme synthesizing method, model training method, device and computer equipment
WO2019046463A1 (en) * 2017-08-29 2019-03-07 Zhoa Tiancheng System and method for defining dialog intents and building zero-shot intent recognition models
CN110473525A (en) * 2019-09-16 2019-11-19 百度在线网络技术(北京)有限公司 The method and apparatus for obtaining voice training sample
WO2019222591A1 (en) * 2018-05-17 2019-11-21 Google Llc Synthesis of speech from text in a voice of a target speaker using neural networks
CN110556093A (en) * 2019-09-17 2019-12-10 浙江核新同花顺网络信息股份有限公司 Voice marking method and system
CN111144140A (en) * 2019-12-23 2020-05-12 语联网(武汉)信息技术有限公司 Zero-learning-based Chinese and Tai bilingual corpus generation method and device
CN111326136A (en) * 2020-02-13 2020-06-23 腾讯科技(深圳)有限公司 Voice processing method and device, electronic equipment and storage medium
CN111488925A (en) * 2020-04-07 2020-08-04 北京百度网讯科技有限公司 Data labeling method and device, electronic equipment and storage medium
CN111859995A (en) * 2020-06-16 2020-10-30 北京百度网讯科技有限公司 Training method and device of machine translation model, electronic equipment and storage medium

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150310862A1 (en) * 2014-04-24 2015-10-29 Microsoft Corporation Deep learning for semantic parsing including semantic utterance classification
US10621991B2 (en) * 2018-05-06 2020-04-14 Microsoft Technology Licensing, Llc Joint neural network for speaker recognition
CN109949797B (en) * 2019-03-11 2021-11-12 北京百度网讯科技有限公司 Method, device, equipment and storage medium for generating training corpus
US11404050B2 (en) * 2019-05-16 2022-08-02 Samsung Electronics Co., Ltd. Electronic apparatus and method for controlling thereof
CN111179905A (en) * 2020-01-10 2020-05-19 北京中科深智科技有限公司 Rapid dubbing generation method and device

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101840699A (en) * 2010-04-30 2010-09-22 中国科学院声学研究所 Voice quality evaluation method based on pronunciation model
WO2019046463A1 (en) * 2017-08-29 2019-03-07 Zhoa Tiancheng System and method for defining dialog intents and building zero-shot intent recognition models
WO2019222591A1 (en) * 2018-05-17 2019-11-21 Google Llc Synthesis of speech from text in a voice of a target speaker using neural networks
CN109036375A (en) * 2018-07-25 2018-12-18 腾讯科技(深圳)有限公司 Phoneme synthesizing method, model training method, device and computer equipment
CN109033378A (en) * 2018-07-27 2018-12-18 北京中关村科金技术有限公司 A kind of application method of Zero-shot Learning in intelligent customer service system
CN110473525A (en) * 2019-09-16 2019-11-19 百度在线网络技术(北京)有限公司 The method and apparatus for obtaining voice training sample
CN110556093A (en) * 2019-09-17 2019-12-10 浙江核新同花顺网络信息股份有限公司 Voice marking method and system
CN111144140A (en) * 2019-12-23 2020-05-12 语联网(武汉)信息技术有限公司 Zero-learning-based Chinese and Tai bilingual corpus generation method and device
CN111326136A (en) * 2020-02-13 2020-06-23 腾讯科技(深圳)有限公司 Voice processing method and device, electronic equipment and storage medium
CN111488925A (en) * 2020-04-07 2020-08-04 北京百度网讯科技有限公司 Data labeling method and device, electronic equipment and storage medium
CN111859995A (en) * 2020-06-16 2020-10-30 北京百度网讯科技有限公司 Training method and device of machine translation model, electronic equipment and storage medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Zero-Shot Multi-Speaker Text-To-Speech with State-Of-The-Art Neural Speaker Embeddings;Erica Cooper,et al.;《2020 IEEE International Conference on Acoustics, Speech and Signal Processing 》;IEEE;20200514;第6184-6188页 *
基于深度学习的中文语音合成;陈梦楠;《中国优秀博硕士学位论文全文数据库(硕士)信息科技辑》;中国学术期刊(光盘版)电子杂志社;20201115(第11期);第I136-7页 *

Also Published As

Publication number Publication date
CN112614479A (en) 2021-04-06

Similar Documents

Publication Publication Date Title
CN112560912B (en) Classification model training method and device, electronic equipment and storage medium
CN111639710A (en) Image recognition model training method, device, equipment and storage medium
CN110674314B (en) Sentence recognition method and device
CN111967262A (en) Method and device for determining entity tag
CN112507715A (en) Method, device, equipment and storage medium for determining incidence relation between entities
CN111667056B (en) Method and apparatus for searching model structures
CN111259671B (en) Semantic description processing method, device and equipment for text entity
CN111709247A (en) Data set processing method and device, electronic equipment and storage medium
CN111967256B (en) Event relation generation method and device, electronic equipment and storage medium
CN111859982B (en) Language model training method and device, electronic equipment and readable storage medium
CN111666380A (en) Intelligent calling method, device, equipment and medium
CN113553414B (en) Intelligent dialogue method, intelligent dialogue device, electronic equipment and storage medium
CN112528001B (en) Information query method and device and electronic equipment
CN112163405A (en) Question generation method and device
CN111522967A (en) Knowledge graph construction method, device, equipment and storage medium
CN111680517A (en) Method, apparatus, device and storage medium for training a model
CN114548110A (en) Semantic understanding method and device, electronic equipment and storage medium
CN111611990A (en) Method and device for identifying table in image
CN111767381A (en) Automatic question answering method and device
CN112466289A (en) Voice instruction recognition method and device, voice equipment and storage medium
CN111291192B (en) Method and device for calculating triplet confidence in knowledge graph
CN111984774A (en) Search method, device, equipment and storage medium
CN111966782A (en) Retrieval method and device for multi-turn conversations, storage medium and electronic equipment
CN114678137A (en) Big data mining method and server based on digital medical treatment
CN112614479B (en) Training data processing method and device and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant