CN112530399A - Method and system for expanding voice data, electronic equipment and storage medium - Google Patents

Method and system for expanding voice data, electronic equipment and storage medium Download PDF

Info

Publication number
CN112530399A
CN112530399A CN202011369921.7A CN202011369921A CN112530399A CN 112530399 A CN112530399 A CN 112530399A CN 202011369921 A CN202011369921 A CN 202011369921A CN 112530399 A CN112530399 A CN 112530399A
Authority
CN
China
Prior art keywords
voice
data
dialogue
text
scene
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011369921.7A
Other languages
Chinese (zh)
Inventor
金炎驰
梁志婷
韩振龙
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Minglue Artificial Intelligence Group Co Ltd
Original Assignee
Shanghai Minglue Artificial Intelligence Group Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Minglue Artificial Intelligence Group Co Ltd filed Critical Shanghai Minglue Artificial Intelligence Group Co Ltd
Priority to CN202011369921.7A priority Critical patent/CN112530399A/en
Publication of CN112530399A publication Critical patent/CN112530399A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers

Abstract

The invention provides a method, a system, electronic equipment and a storage medium for expanding voice data, wherein the technical scheme of the method comprises a conversation design step, which is to design conversation contents in a text form according to a preset scene; a text conversion step of converting the dialogue content into dialogue voice; and an audio mixing step, mixing the dialogue voice with the noise under the preset scene to obtain analog audio data and output the analog audio data. The invention solves the problems of complex expansion method and low reliability of the existing voice data.

Description

Method and system for expanding voice data, electronic equipment and storage medium
Technical Field
The present invention relates to the field of language processing, and more particularly, to a method and system for expanding voice data, an electronic device, and a storage medium, which are applicable to a voice recognition technology.
Background
With the development of information technology and the popularization of the internet, the human-computer interaction and the intelligent interaction between human and machines are realized, and the efficient and natural human-computer communication environment is constructed, which becomes an urgent need for the application and development of the current information technology.
In the last two decades, speech recognition technology has advanced significantly, starting to move from the laboratory to the market. With the rapid development of speech recognition technology, various online speech recognitions such as speech input, speech expansion, speech recognition, speech judgment, speech playing, speech speed regulation and the like are receiving more and more attention, and people expect that in the next 10 years, speech recognition technology will enter various fields such as industry, household appliances, communication, automotive electronics, medical treatment, home services, consumer electronics and the like. The application of speech recognition dictation machines in some fields is rated by the U.S. news community as one of ten major computer developments in 1997. Many experts consider the speech recognition technology to be one of the most important scientific and technological development technologies in the information technology field between 2000 and 2010. The fields to which speech recognition technology relates include: signal processing, pattern recognition, probability and information theory, sound and hearing mechanisms, artificial intelligence, and the like.
At present, training is often needed when audio is processed, and when training is often needed, voices different from speakers are added to serve as interference items or different characteristics of single speakers are added to serve as interference items to be trained.
Disclosure of Invention
The embodiment of the application provides a method, a system, electronic equipment and a storage medium for expanding voice data, so as to at least solve the problems of complexity and poor reliability of the existing method for expanding voice data.
In a first aspect, an embodiment of the present application provides an expansion method of voice data, including: a dialogue design step, namely designing the dialogue content in a text form according to a preset scene; a text conversion step of converting the conversation content into conversation voice; and an audio mixing step, mixing the dialogue voice with the noise under the preset scene to obtain analog audio data and output the analog audio data.
Preferably, the dialog design step further includes: according to a preset scene, text data under the preset scene is obtained, and conversation content is designed through the obtained text data.
Preferably, the text conversion step further comprises: and converting the conversation content into conversation voice through a TTS system.
Preferably, the audio mixing step includes: and playing the dialogue voice in a real scene, and simultaneously acquiring real noise data and the dialogue voice in the real scene.
Preferably, the audio mixing step includes: and acquiring the existing simulation noise data in the preset scene, and directly mixing the dialogue voice with the simulation noise data.
In a second aspect, an embodiment of the present application provides an expansion system for voice data, which is suitable for the above-mentioned method for expanding voice data, and includes: the dialogue design unit is used for acquiring text data under a preset scene according to the preset scene and designing dialogue contents through the acquired text data; the text conversion unit is used for converting the conversation content into conversation voice through a TTS system; and the audio mixing unit is used for mixing the dialogue voice with the noise under the preset scene to obtain analog audio data and outputting the analog audio data.
In some of these embodiments, the audio mixing unit comprises: and playing the dialogue voice in a real scene, and simultaneously acquiring real noise data and the dialogue voice in the real scene.
In some embodiments, the model building module further comprises: the audio mixing unit includes: and acquiring the existing simulation noise data in the preset scene, and directly mixing the dialogue voice with the simulation noise data.
In a third aspect, an embodiment of the present application provides an electronic device, which includes a memory, a processor, and a computer program stored on the memory and executable on the processor, and when the processor executes the computer program, the processor implements the method for expanding voice data according to the first aspect.
In a fourth aspect, the present application provides a computer-readable storage medium, on which a computer program is stored, which when executed by a processor implements a method for expanding speech data as described in the first aspect above.
Compared with the related art, the method for expanding the voice data provided by the embodiment of the application provides a more complete voice data acquisition mode, and the voice data of various different scenes can be quickly acquired on the premise of not losing audio data time sequence information, environmental information, own emotion information and content information; a large amount of simulated real scene data can be obtained in a short time through the design of conversation content and conversation forms, and the problem of data shortage in the early stage is provided for a data analysis module such as voice recognition.
Drawings
The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:
FIG. 1 is a flow chart of a method for expanding voice data according to the present invention;
FIG. 2 is a block diagram of an expansion system for voice data according to the present invention;
FIG. 3 is a block diagram of an electronic device of the present invention;
in the above figures:
1. a dialog design unit; 2. a text conversion unit; 3. an audio mixing unit; 60. a bus; 61. a processor; 62. a memory; 63. a communication interface.
Detailed Description
In order to make the purpose, technical solution and advantages of the present application more apparent, the present application will be described and illustrated with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments provided in the present application without any inventive step are within the scope of protection of the present application.
It is obvious that the drawings in the following description are only examples or embodiments of the present application, and that it is also possible for a person skilled in the art to apply the present application to other similar contexts on the basis of these drawings without inventive effort. Moreover, it should be further appreciated that such a development effort might be complex and tedious, but would nevertheless be a routine undertaking of design, fabrication, and manufacture for those of ordinary skill having the benefit of this disclosure, and it should be understood that such a development effort might be complex and tedious.
Reference in the specification to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the specification. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those of ordinary skill in the art will explicitly and implicitly appreciate that the embodiments described herein may be combined with other embodiments without conflict.
Unless defined otherwise, technical or scientific terms referred to herein shall have the ordinary meaning as understood by those of ordinary skill in the art to which this application belongs. Reference to "a," "an," "the," and similar words throughout this application are not to be construed as limiting in number, and may refer to the singular or the plural. The use of the terms "including," "comprising," "having," and any variations thereof herein, is intended to cover a non-exclusive inclusion; for example, a process, method, system, article, or apparatus that comprises a list of steps or modules (elements) is not limited to the listed steps or elements, but may include additional steps or elements not listed, or may include additional steps or elements inherent to such process, method, article, or apparatus.
Speech recognition is a cross discipline. In the last two decades, speech recognition technology has advanced significantly, starting to move from the laboratory to the market. It is expected that voice recognition technology will enter various fields such as industry, home appliances, communications, automotive electronics, medical care, home services, consumer electronics, etc. within the next 10 years. The application of speech recognition dictation machines in some fields is rated by the U.S. news community as one of ten major computer developments in 1997. Many experts consider the speech recognition technology to be one of the ten important technological development technologies in the information technology field between 2000 and 2010. The fields to which speech recognition technology relates include: signal processing, pattern recognition, probability and information theory, sound and hearing mechanisms, artificial intelligence, and the like.
Automatic Speech Recognition (Automatic Speech Recognition) is a technology for converting human Speech into text. Speech recognition is a multidisciplinary intersection field that is tightly connected to many disciplines, such as acoustics, phonetics, linguistics, digital signal processing theory, information theory, computer science, and the like. Due to the diversity and complexity of speech signals, speech recognition systems can only achieve satisfactory performance under certain constraints, or can only be used in certain specific situations.
TTS is an abbreviation of Text To Speech, i.e., "from Text To Speech," which is a part of a human-machine conversation that enables a machine To speak.
It applies the outstanding actions of linguistics and psychology at the same time, and under the support of built-in chip, it can intelligently convert the characters into natural speech flow by means of the design of neural network. The TTS technology carries out real-time conversion on the text file, and the conversion time can be calculated in seconds. Under the action of a special intelligent voice controller, the voice rhythm of the text output is smooth, so that a listener feels natural when listening to information and does not have the feeling of indifference and unsmooth of machine voice output. The TTS speech synthesis technology is to cover national standard first-level and second-level Chinese characters, has an English interface, automatically identifies Chinese and English, and supports mixed reading of Chinese and English. All the sounds adopt the real Mandarin as the standard pronunciation, the rapid speech synthesis of 120-150 Chinese characters/minute is realized, the reading speed reaches 3-4 Chinese characters/second, and the user can hear clear and pleasant tone quality and coherent and smooth intonation. TTS is one type of speech synthesis application that converts files stored in a computer, such as help files or web pages, into natural speech output. TTS can not only help visually impaired people read information on a computer, but also increase the readability of text documents. TTS applications include voice-driven mail and voice sensitive systems and are often used with voice recognition programs.
The TTS text-to-speech conversion has a wide range of applications, including reading of e-mail, voice prompt of IVR systems, etc., and IVR systems have been widely used in various industries (e.g., telecommunications, transportation, etc.). The key technology used in TTS is speech synthesis (SpeechSynthesis). Early TTS were typically implemented using dedicated chips, such as texas instruments TMS50C10/TMS50C57, philips PH84H36, and the like, but were mainly used in home appliances or children's toys. TTS based on microcomputer application is generally implemented by pure software, and mainly includes the following parts: text analysis-linguistic analysis is performed on an input text, and lexical, grammatical and semantic analysis is performed sentence by sentence to determine the low-level structure of a sentence and the composition of phonemes of each word, including text sentence break, word segmentation, polyphone processing, digit processing, abbreviation processing and the like; speech synthesis, namely extracting the single characters or phrases corresponding to the processed text from a speech synthesis library, and converting the linguistic description into speech waveforms; prosodic processing-synthesized Speech quality (Qualityof Synthetic Speech) refers to the quality of Speech output by a Speech synthesis system, and is generally subjectively evaluated in terms of intelligibility (or intelligibility), naturalness, and coherence. Clarity is the percentage of meaningful words that are correctly heard; the naturalness is used for evaluating whether the tone quality of the synthesized voice is close to the voice of a person and whether the tone of the synthesized word is natural; the coherence is used to evaluate whether the synthesized sentence is fluent.
To synthesize high quality speech, the algorithms used are extremely complex and therefore very demanding on the machine. The complexity of the algorithm determines the system capacity of the microcomputer for carrying out multi-channel TTS simultaneously.
The Chinese TTS system comprises Chinese speech processing and speech synthesis, and utilizes relevant knowledge such as Chinese rhythm to carry out word segmentation, part of speech judgment, phonetic notation and digital symbol conversion on Chinese sentences, and the speech synthesis obtains speech by querying a Chinese speech library. The Chinese TTS system, more well known, is: NUANCE, IBM, Microsoft, Fujitsu, scientific news, agilawood, etc. The key is that there are many problems in the aspects of Chinese prosody processing, symbolic digit, polyphone and word formation, and continuous research is needed, so that the naturalization degree of Chinese speech synthesis is high.
At present, training is often needed when audio is processed, and when training is often needed, voices different from speakers are added to serve as interference items or different characteristics of single speakers are added to serve as interference items to be trained.
The invention can be applied to the expansion of voice data for perfecting the voice recognition technology.
Embodiments of the invention are described in detail below with reference to the accompanying drawings:
fig. 1 is a flowchart of a method for expanding voice data according to the present invention, and please refer to fig. 1, the method for expanding voice data according to the present invention includes the following steps:
s1: and designing the dialog content in a text form according to a preset scene.
Optionally, the text data in a preset scene may be acquired according to the preset scene, and the dialog content may be designed according to the acquired text data.
In specific implementation, a specified scene is preset, and in the embodiment, an offline sale scene is preset; firstly, acquiring a large amount of text data related to conversations existing in an offline sales scene, and designing conversation contents in the offline sales scene according to the text data, wherein the conversation contents can be continuous and associated conversation paragraphs or discontinuous and non-associated conversation paragraphs; the text data is obtained by, but not limited to, data collection on a network, purchase of a data storage supplier or natural language generation.
S2: and converting the conversation content into conversation voice.
Optionally, the dialog content is converted into a dialog voice by a TTS system.
TTS is one type of speech synthesis application that converts files stored in a computer, such as help files or web pages, into natural speech output. TTS can not only help visually impaired people read information on a computer, but also increase the readability of text documents. TTS applications include voice-driven mail and voice-sensitive systems, and are often used with voice recognition programs.
Optionally, the embodiment of the invention adopts a marytss open source library.
In this embodiment, the dialog contents designed in step S1 are converted into data in the form of dialog speech by an existing TTS system.
S3: and mixing the dialogue voice with the noise under the preset scene to obtain and output analog audio data.
Optionally, the dialogue voice can be played in a real scene, and real noise data and the dialogue voice in the real scene are collected at the same time.
In a specific implementation, the dialogue voice data converted in step S2 is played in a real off-line sale scene through an audio playing device, and relevant audio parameters such as volume and the like are adjusted, and at the same time, audio data including the dialogue voice, the off-line real dialogue and environmental noise is recorded, so as to obtain a simulated audio data including real scene noise, and the simulated audio data is output as voice data for training and perfecting the voice recognition technology.
Optionally, the existing analog noise data in the preset scene is obtained, and the dialogue voice is directly mixed with the analog noise data.
In the specific implementation, the existing recorded or simulated noise data is obtained, the recorded or simulated noise data is directly mixed with the speech-to-speech data according to a certain proportion, the noise data and the audio parameters of the dialogue speech are adjusted, the mixed simulated audio data is directly obtained, and the mixed simulated audio data is output as the speech data of the training and perfect speech recognition technology.
It should be noted that the steps illustrated in the above-described flow diagrams or in the flow diagrams of the figures may be performed in a computer system such as a set of computer-executable instructions and that, while the logic order is illustrated in the flow diagrams, in some cases, the steps illustrated or described may be performed in an order different than here.
The embodiment of the application provides an expansion system of voice data, which is suitable for the above expansion method of voice data. As used below, the terms "unit," "module," and the like may implement a combination of software and/or hardware of predetermined functions. Although the means described in the embodiments below are preferably implemented in software, an implementation in hardware or a combination of software and hardware is also possible and contemplated.
Fig. 2 is a block diagram of an expansion system of voice data according to the present invention, please refer to fig. 2, which includes:
dialog design unit 1: and designing the dialog content in a text form according to a preset scene.
Optionally, the text data in a preset scene may be acquired according to the preset scene, and the dialog content may be designed according to the acquired text data.
In specific implementation, a specified scene is preset, and in the embodiment, an offline sale scene is preset; firstly, acquiring a large amount of text data related to conversations existing in an offline sales scene, and designing conversation contents in the offline sales scene according to the text data, wherein the conversation contents can be continuous and associated conversation paragraphs or discontinuous and non-associated conversation paragraphs; the text data is obtained by, but not limited to, data collection on a network, purchase of a data storage supplier or natural language generation.
Text conversion unit 2: and converting the conversation content into conversation voice.
Optionally, the dialog content is converted into a dialog voice by a TTS system.
TTS is one type of speech synthesis application that converts files stored in a computer, such as help files or web pages, into natural speech output. TTS can not only help visually impaired people read information on a computer, but also increase the readability of text documents. TTS applications include voice-driven mail and voice-sensitive systems, and are often used with voice recognition programs.
In the implementation, the dialog contents designed in the dialog design unit 1 are converted into data in the form of dialog speech by an existing TTS system.
An audio mixing unit: and mixing the dialogue voice with the noise under the preset scene to obtain and output analog audio data.
Optionally, the dialogue voice can be played in a real scene, and real noise data and the dialogue voice in the real scene are collected at the same time.
In specific implementation, the dialogue voice data converted by the text conversion unit 2 is played in a real off-line sale scene through an audio playing device, and relevant audio parameters such as volume and the like are adjusted, and simultaneously, audio data including the dialogue voice, the off-line real dialogue and environmental noise are recorded, so that simulated audio data including real scene noise is obtained and is output as voice data for training and perfecting a voice recognition technology.
Optionally, the existing analog noise data in the preset scene is obtained, and the dialogue voice is directly mixed with the analog noise data.
In the specific implementation, the existing recorded or simulated noise data is obtained, the recorded or simulated noise data is directly mixed with the speech-to-speech data according to a certain proportion, the noise data and the audio parameters of the dialogue speech are adjusted, the mixed simulated audio data is directly obtained, and the mixed simulated audio data is output as the speech data of the training and perfect speech recognition technology.
In addition, a method of augmenting voice data described in conjunction with FIG. 1 may be implemented by an electronic device. Fig. 3 is a block diagram of an electronic device of the present invention.
The electronic device may comprise a processor 61 and a memory 62 in which computer program instructions are stored.
Specifically, the processor 61 may include a Central Processing Unit (CPU), or A Specific Integrated Circuit (ASIC), or may be configured as one or more Integrated circuits implementing the embodiments of the present Application.
Memory 62 may include, among other things, mass storage for data or instructions. By way of example, and not limitation, memory 62 may include a Hard Disk Drive (Hard Disk Drive, abbreviated HDD), a floppy Disk Drive, a Solid State Drive (SSD), flash memory, an optical Disk, a magneto-optical Disk, a tape, or a Universal Serial Bus (USB) Drive, or a combination of two or more of these. Memory 62 may include removable or non-removable (or fixed) media, where appropriate. The memory 62 may be internal or external to the data processing apparatus, where appropriate. In a particular embodiment, the memory 62 is a Non-Volatile (Non-Volatile) memory. In particular embodiments, Memory 62 includes Read-Only Memory (ROM) and Random Access Memory (RAM). The ROM may be mask-programmed ROM, Programmable ROM (PROM), Erasable PROM (EPROM), Electrically Erasable PROM (EEPROM), Electrically rewritable ROM (EEPROM), or FLASH Memory (FLASH), or a combination of two or more of these, where appropriate. The RAM may be a Static Random-Access Memory (SRAM) or a Dynamic Random-Access Memory (DRAM), where the DRAM may be a Fast Page Mode DRAM (Fast Page Mode Dynamic Random Access Memory, FPMDRAM), an Extended data output DRAM (Extended data Out Dynamic Random Access Memory, EDODRAM), a Synchronous DRAM (Synchronous Dynamic Random-Access Memory, SDRAM), and the like.
The memory 62 may be used to store or cache various data files that need to be processed and/or used for communication, as well as possible computer program instructions executed by the processor 61.
The processor 61 realizes any one of the above-described embodiments of the method of augmenting voice data by reading and executing computer program instructions stored in the memory 62.
In some of these embodiments, the electronic device may also include a communication interface 63 and a bus 60. As shown in fig. 3, the processor 61, the memory 62, and the communication interface 63 are connected via a bus 60 to complete communication therebetween.
The communication port 63 may be implemented with other components such as: the data communication is carried out among external equipment, image/data acquisition equipment, a database, external storage, an image/data processing workstation and the like.
The bus 60 includes hardware, software, or both to couple the components of the electronic device to one another. Bus 60 includes, but is not limited to, at least one of: data Bus (Data Bus), Address Bus (Address Bus), Control Bus (Control Bus), Expansion Bus (Expansion Bus), and Local Bus (Local Bus). By way of example, and not limitation, Bus 60 may include an Accelerated Graphics Port (AGP) or other Graphics Bus, an Enhanced Industry Standard Architecture (EISA) Bus, a Front-Side Bus (FSB), a HyperTransport (HT) Interconnect, an ISA (ISA) Bus, an InfiniBand (InfiniBand) Interconnect, a Low Pin Count (LPC) Bus, a memory Bus, a Microchannel Architecture (MCA) Bus, a PCI (Peripheral Component Interconnect) Bus, a PCI-Express (PCI-X) Bus, a Serial Advanced Technology Attachment (SATA Technology, SATA) Bus, abbreviated VLB) bus or other suitable bus or a combination of two or more of these. Bus 60 may include one or more buses, where appropriate. Although specific buses are described and shown in the embodiments of the present application, any suitable bus or interconnect is contemplated by the present application.
The electronic device can execute the method for expanding the voice data in the embodiment of the application.
In addition, in combination with the method for expanding voice data in the foregoing embodiments, the embodiments of the present application may provide a computer-readable storage medium to implement the method. The computer readable storage medium having stored thereon computer program instructions; the computer program instructions, when executed by a processor, implement any one of the above-described methods of augmenting voice data.
And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, and an optical disk.
The technical features of the embodiments described above may be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the embodiments described above are not described, but should be considered as being within the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the protection scope of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims (10)

1. A method for augmenting voice data, comprising:
a dialogue design step, wherein the dialogue content in a text form is designed according to a preset scene;
a text conversion step of converting the dialogue content into dialogue voice;
and an audio mixing step, mixing the dialogue voice with the noise under the preset scene to obtain analog audio data and output the analog audio data.
2. The method of augmenting voice data according to claim 1, wherein said dialog design step further comprises: according to a preset scene, text data under the preset scene is obtained, and conversation content is designed through the obtained text data.
3. The method of augmenting voice data according to claim 1, wherein said text conversion step further comprises: and converting the conversation content into conversation voice through a TTS system.
4. The method of augmenting voice data according to claim 1, wherein the audio mixing step comprises:
and playing the dialogue voice in a real scene, and simultaneously acquiring real noise data and the dialogue voice in the real scene.
5. The method of augmenting voice data according to claim 1, wherein the audio mixing step comprises:
and acquiring the existing simulation noise data in the preset scene, and directly mixing the dialogue voice with the simulation noise data.
6. An augmentation system for voice data, comprising:
the dialogue design unit is used for acquiring text data under a preset scene according to the preset scene and designing dialogue contents through the acquired text data;
the text conversion unit is used for converting the conversation content into conversation voice through a TTS system;
and the audio mixing unit is used for mixing the dialogue voice with the noise under the preset scene to obtain analog audio data and outputting the analog audio data.
7. The augmentation system of voice data according to claim 6, wherein the audio mixing unit includes:
and playing the dialogue voice in a real scene, and simultaneously acquiring real noise data and the dialogue voice in the real scene.
8. The augmentation system of voice data according to claim 6, wherein the audio mixing unit includes:
and acquiring the existing simulation noise data in the preset scene, and directly mixing the dialogue voice with the simulation noise data.
9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the method of augmenting speech data according to any one of claims 1 to 5 when executing the computer program.
10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out a method of augmenting speech data according to any one of claims 1 to 5.
CN202011369921.7A 2020-11-30 2020-11-30 Method and system for expanding voice data, electronic equipment and storage medium Pending CN112530399A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011369921.7A CN112530399A (en) 2020-11-30 2020-11-30 Method and system for expanding voice data, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011369921.7A CN112530399A (en) 2020-11-30 2020-11-30 Method and system for expanding voice data, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN112530399A true CN112530399A (en) 2021-03-19

Family

ID=74994863

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011369921.7A Pending CN112530399A (en) 2020-11-30 2020-11-30 Method and system for expanding voice data, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN112530399A (en)

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109285538A (en) * 2018-09-19 2019-01-29 宁波大学 A kind of mobile phone source title method under the additive noise environment based on normal Q transform domain
CN110544469A (en) * 2019-09-04 2019-12-06 秒针信息技术有限公司 Training method and device of voice recognition model, storage medium and electronic device
CN110807333A (en) * 2019-10-30 2020-02-18 腾讯科技(深圳)有限公司 Semantic processing method and device of semantic understanding model and storage medium
CN110807332A (en) * 2019-10-30 2020-02-18 腾讯科技(深圳)有限公司 Training method of semantic understanding model, semantic processing method, semantic processing device and storage medium
CN110853672A (en) * 2019-11-08 2020-02-28 山东师范大学 Data expansion method and device for audio scene classification
CN111326174A (en) * 2019-12-31 2020-06-23 四川长虹电器股份有限公司 Method for automatically synthesizing test corpus in far-field voice interference scene
WO2020207375A1 (en) * 2019-04-12 2020-10-15 腾讯科技(深圳)有限公司 Instant messaging application-based data processing method, apparatus, device, and storage medium
CN111816187A (en) * 2020-07-03 2020-10-23 中国人民解放军空军预警学院 Deep neural network-based voice feature mapping method in complex environment
CN111859092A (en) * 2020-07-29 2020-10-30 苏州思必驰信息科技有限公司 Text corpus amplification method and device, electronic equipment and storage medium
CN111883137A (en) * 2020-07-31 2020-11-03 龙马智芯(珠海横琴)科技有限公司 Text processing method and device based on voice recognition

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109285538A (en) * 2018-09-19 2019-01-29 宁波大学 A kind of mobile phone source title method under the additive noise environment based on normal Q transform domain
WO2020207375A1 (en) * 2019-04-12 2020-10-15 腾讯科技(深圳)有限公司 Instant messaging application-based data processing method, apparatus, device, and storage medium
CN110544469A (en) * 2019-09-04 2019-12-06 秒针信息技术有限公司 Training method and device of voice recognition model, storage medium and electronic device
CN110807333A (en) * 2019-10-30 2020-02-18 腾讯科技(深圳)有限公司 Semantic processing method and device of semantic understanding model and storage medium
CN110807332A (en) * 2019-10-30 2020-02-18 腾讯科技(深圳)有限公司 Training method of semantic understanding model, semantic processing method, semantic processing device and storage medium
CN110853672A (en) * 2019-11-08 2020-02-28 山东师范大学 Data expansion method and device for audio scene classification
CN111326174A (en) * 2019-12-31 2020-06-23 四川长虹电器股份有限公司 Method for automatically synthesizing test corpus in far-field voice interference scene
CN111816187A (en) * 2020-07-03 2020-10-23 中国人民解放军空军预警学院 Deep neural network-based voice feature mapping method in complex environment
CN111859092A (en) * 2020-07-29 2020-10-30 苏州思必驰信息科技有限公司 Text corpus amplification method and device, electronic equipment and storage medium
CN111883137A (en) * 2020-07-31 2020-11-03 龙马智芯(珠海横琴)科技有限公司 Text processing method and device based on voice recognition

Similar Documents

Publication Publication Date Title
JP7280386B2 (en) Multilingual speech synthesis and cross-language voice cloning
US20180349495A1 (en) Audio data processing method and apparatus, and computer storage medium
Isewon et al. Design and implementation of text to speech conversion for visually impaired people
CN110797006B (en) End-to-end speech synthesis method, device and storage medium
US9761219B2 (en) System and method for distributed text-to-speech synthesis and intelligibility
US6751592B1 (en) Speech synthesizing apparatus, and recording medium that stores text-to-speech conversion program and can be read mechanically
JP2005502102A (en) Speech-speech generation system and method
CN104899192B (en) For the apparatus and method interpreted automatically
Abushariah et al. Phonetically rich and balanced text and speech corpora for Arabic language
CN112786004A (en) Speech synthesis method, electronic device, and storage device
CN112365878A (en) Speech synthesis method, device, equipment and computer readable storage medium
Panda et al. An efficient model for text-to-speech synthesis in Indian languages
Sangeetha et al. Text to speech synthesis system for tamil
Alam et al. Text to speech for Bangla language using festival
CN109859746B (en) TTS-based voice recognition corpus generation method and system
CN113421571B (en) Voice conversion method and device, electronic equipment and storage medium
CN113409761B (en) Speech synthesis method, speech synthesis device, electronic device, and computer-readable storage medium
CN112530399A (en) Method and system for expanding voice data, electronic equipment and storage medium
Soman et al. Corpus driven malayalam text-to-speech synthesis for interactive voice response system
Ghimire et al. Enhancing the quality of nepali text-to-speech systems
JP2004347732A (en) Automatic language identification method and system
CN113870833A (en) Speech synthesis related system, method, device and equipment
Ravi et al. Text-to-speech synthesis system for Kannada language
Dessai et al. Development of Konkani TTS system using concatenative synthesis
WO2023197206A1 (en) Personalized and dynamic text to speech voice cloning using incompletely trained text to speech models

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination