CN113096639A - Method and device for generating voice map - Google Patents
Method and device for generating voice map Download PDFInfo
- Publication number
- CN113096639A CN113096639A CN201911319032.7A CN201911319032A CN113096639A CN 113096639 A CN113096639 A CN 113096639A CN 201911319032 A CN201911319032 A CN 201911319032A CN 113096639 A CN113096639 A CN 113096639A
- Authority
- CN
- China
- Prior art keywords
- map
- text
- voice
- speech
- model
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 18
- 238000013507 mapping Methods 0.000 claims description 21
- 230000010354 integration Effects 0.000 claims description 14
- 230000007246 mechanism Effects 0.000 claims description 9
- 238000012549 training Methods 0.000 claims description 8
- 239000000284 extract Substances 0.000 claims 1
- 238000012545 processing Methods 0.000 description 11
- 238000010586 diagram Methods 0.000 description 10
- 239000013598 vector Substances 0.000 description 7
- 230000006854 communication Effects 0.000 description 4
- 230000006870 function Effects 0.000 description 4
- 230000002093 peripheral effect Effects 0.000 description 4
- 230000008569 process Effects 0.000 description 4
- 230000001364 causal effect Effects 0.000 description 3
- 230000000694 effects Effects 0.000 description 3
- 239000011159 matrix material Substances 0.000 description 3
- 238000013528 artificial neural network Methods 0.000 description 2
- MJBPUQUGJNAPAZ-AWEZNQCLSA-N butin Chemical compound C1([C@@H]2CC(=O)C3=CC=C(C=C3O2)O)=CC=C(O)C(O)=C1 MJBPUQUGJNAPAZ-AWEZNQCLSA-N 0.000 description 2
- 238000010606 normalization Methods 0.000 description 2
- 230000011218 segmentation Effects 0.000 description 2
- 230000007704 transition Effects 0.000 description 2
- HWWIYXKSCZCMFV-ZETCQYMHSA-N 2-[[(2s)-1-acetylpyrrolidine-2-carbonyl]-nitrosoamino]acetic acid Chemical compound CC(=O)N1CCC[C@H]1C(=O)N(CC(O)=O)N=O HWWIYXKSCZCMFV-ZETCQYMHSA-N 0.000 description 1
- MJBPUQUGJNAPAZ-UHFFFAOYSA-N Butine Natural products O1C2=CC(O)=CC=C2C(=O)CC1C1=CC=C(O)C(O)=C1 MJBPUQUGJNAPAZ-UHFFFAOYSA-N 0.000 description 1
- 101100297738 Danio rerio plekho1a gene Proteins 0.000 description 1
- 101150075622 UL80 gene Proteins 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 230000006835 compression Effects 0.000 description 1
- 238000007906 compression Methods 0.000 description 1
- 238000004590 computer program Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000003062 neural network model Methods 0.000 description 1
- 238000007781 pre-processing Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 239000007787 solid Substances 0.000 description 1
- 238000004804 winding Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/02—Methods for producing synthetic speech; Speech synthesisers
- G10L13/027—Concept to speech synthesisers; Generation of natural phrases from machine-based concepts
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q30/00—Commerce
- G06Q30/06—Buying, selling or leasing transactions
- G06Q30/0601—Electronic shopping [e-shopping]
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/02—Methods for producing synthetic speech; Speech synthesisers
- G10L13/04—Details of speech synthesis systems, e.g. synthesiser structure or memory management
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Business, Economics & Management (AREA)
- General Physics & Mathematics (AREA)
- Multimedia (AREA)
- Acoustics & Sound (AREA)
- Artificial Intelligence (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- Life Sciences & Earth Sciences (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Human Computer Interaction (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Accounting & Taxation (AREA)
- Finance (AREA)
- Development Economics (AREA)
- Economics (AREA)
- Marketing (AREA)
- Strategic Management (AREA)
- General Business, Economics & Management (AREA)
- Machine Translation (AREA)
Abstract
The invention discloses a method and a device for generating a voice map, which convert a segment of characters into voice through a character-to-voice model and integrate the voice and the map into the voice map.
Description
Technical Field
The present invention relates to a mapping generation technology, and more particularly, to a method and apparatus for generating a voice mapping.
Background
In order to add interest to the communication process, existing communication software (such as Line, WeChat, etc.), and further provide users with voice mapping. At present, voice mapping requires a user to purchase an on-shelf voice mapping commodity from a mall of communication software, and pictures and corresponding voices of the voice mapping commodities are fixed and have no use elasticity.
Disclosure of Invention
In view of the above, the embodiment of the invention provides a method and a device for generating a voice map.
In one embodiment, the method for generating a voice map comprises: obtaining a segment of characters; converting the segment of text into speech via a text-to-speech model; obtaining a map; and integrating the voice and the map.
In one embodiment, the voice mapping device comprises a text input module, a text-to-speech module and a mapping integration module. The character input module is used for obtaining a section of characters. The text-to-speech module carries a text-to-speech model to convert the segment of text into speech. The mapping integration module integrates the mapping and the voice into a voice mapping.
In summary, according to the embodiments of the present invention, the voice uttered by the person specified by the user can be synthesized by the machine and combined with the mapping specified by the user to form the voice mapping, and the voice content can also be written by the user.
Drawings
Fig. 1 is a schematic diagram of a hardware architecture of a voice map generating device according to an embodiment of the present invention.
Fig. 2 is a schematic software architecture diagram of a voice map generating device according to an embodiment of the present invention.
FIG. 3 is a flowchart illustrating a method for generating a voice map according to an embodiment of the present invention.
Fig. 4 is a schematic diagram of a text-to-speech model according to an embodiment of the present invention.
FIG. 5 is a block diagram of a text encoder according to an embodiment of the present invention.
FIG. 6 is a block diagram of an audio encoder according to an embodiment of the present invention.
Fig. 7 is a block diagram of a decoder according to an embodiment of the invention.
Wherein, the reference numbers:
voice mapping generating apparatus 100
Non-transitory computer readable recording medium 123
Text-to-speech module 270
Steps S301, S302, S303, S304
Non-causal convolutional layer 4112
First cause and effect convolution layer 431
Second causal convolutional layer 433
Detailed Description
Referring to fig. 1, a hardware architecture of a voice map generating apparatus 100 according to an embodiment of the invention is shown. The voice map generating apparatus 100 is one or more computer systems (here, a processing apparatus 120 is taken as an example) with computing capability, such as a personal computer, a notebook computer, a smart phone, a tablet computer, a server cluster, and so on. The voice map generating apparatus 100 can generate a voice map so that a user can use the voice map, for example: sent to the interlocutor in the communication software.
The hardware of the processing device 120 of the voice map generating device 100 includes a processor 121, a memory 122, a non-transitory computer readable recording medium 123, a peripheral interface 124, and a bus 125 for the above components to communicate with each other. The bus 125 includes, but is not limited to, one or more combinations of a system bus, a memory bus, a peripheral bus, and the like. The processor 121 includes, but is not limited to, a Central Processing Unit (CPU)1213 and a neural Network Processor (NPU) 1215. Memory 122 includes, but is not limited to, volatile memory 1224 (such as Random Access Memory (RAM)) and non-volatile memory 1226 (such as Read Only Memory (ROM)). The non-transitory computer readable recording medium 123 may be, for example, a hard disk, a solid state disk, etc., for storing a computer program product (hereinafter, "software") comprising instructions that, when executed by the processor 121 of the computer system, cause the computer system to perform the voice map generating method.
The peripheral interface 124 is used for connecting the sound receiving device 110 and the input device 130. The sound receiving device 110 is used for capturing the voice of the user and includes a single microphone or a plurality of microphones (e.g. a microphone array). The microphone may be of the type such as a moving coil microphone, a condenser microphone, a microelectromechanical microphone, etc. The input device 130 is used for inputting characters by a user, such as a keyboard, a touch pad (in cooperation with handwriting recognition software), a writing pad, a mouse (in cooperation with a virtual keyboard), and the like.
In some embodiments, any two of the sound receiving apparatus 110, the processing apparatus 120, and the input device 130 may be implemented in a single unitary form. For example, the sound receiving device 110 and the processing device 120 are implemented as a single device of a tablet computer, and are connected to an external input device 130 (e.g., a keyboard). Or, for example, the sound receiving device 110, the processing device 120 and the input device 130 are implemented as a single device of a notebook computer.
In some embodiments, the sound receiving apparatus 110, the processing apparatus 120 and the input device 130 may be separate individuals. For example, the processing device 120 is a personal computer, and is connected to the external sound receiving device 110 and the input device 130.
In some embodiments, processing device 120 includes more than two computer systems, such as: a personal computer and a server. The server performs a voice map generation process. The personal computer is built in or externally connected with a radio device 110 and an input device 130 so as to transmit the voice and the input text of the user to the server through the network and receive the voice mapping returned by the server through the network.
Referring to fig. 2, a software architecture of the voice map generating apparatus 100 according to an embodiment of the invention is shown. As shown in fig. 2, the software of the voice map generating apparatus 100 includes: a recording module 210, a corpus 220, a model training module 230, a weight database 240, a text input module 250, a chartlet library 260, a text-to-speech module 270, and a chartlet integration module 280. The recording module 210, the corpus 220, the model training module 230, and the weight database 240 are used for training a text-to-speech neural network model (hereinafter, referred to as "text-to-speech model"); the text input module 250, the chartlet library 260, the text-to-speech module 270, and the chartlet integration module 280 use the trained weight database 240 to generate the speech chartlets.
First, a part of training is explained. The recording module 210 and the corpus 220 are used to provide a corpus of one or more persons, which refers to voice data, i.e. voice files spoken by the person. For example, the user may use the recording module 210 to record the voice of the user received by the sound receiving device 110 into the corpus. Corpus 220 stores prerecorded corpora of a person or persons. In some embodiments, corpus 220 also stores text corresponding to the content of each corpus. The person may be the user himself, or a friend or a family thereof, a public figure, or the like.
The model training module 230 inputs a plurality of corpora and corresponding words belonging to a person into the word-to-speech model to obtain a model weight corresponding to the person. The model weights are stored in weight database 240 for recall by text-to-speech module 270. Here, the text-to-speech model is a Sequence-to-Sequence (Sequence-to-Sequence) model.
In some embodiments, the model training module 230 may perform pre-processing on the corpus to be input, such as filtering, adjusting volume, time-domain-to-frequency-domain conversion, dynamic compression, denoising, making the audio format consistent, and the like. The text corresponding to the corpus may be stored in the corpus 220 or input via the input device 130.
In some embodiments, the user's corpus may be obtained by using only the recording module 210 in conjunction with the sound receiving device 110, and thus the corpus 220 may not be available. In other embodiments, only the corpus stored in the corpus 220 may be used, and the recording module 210 and the sound receiving device 110 may not be provided.
Next, how to generate the voice map will be described. Referring to fig. 2 and 3 together, fig. 3 is a flowchart of a voice map generating method according to an embodiment of the invention. In step S301, the user inputs a text by operating the input device 130, the text input module 250 displays an input screen (for example, provides an input field), and then the text input module 250 obtains a section of text input by the user in the input screen. In step S302, after the text-to-speech module 270 loads the text-to-speech model, the text is input into the text-to-speech model from the input end of the text-to-speech module 270. The text-to-speech module 270 then takes the converted speech from the output of the text-to-speech model. In step S303, the map integration module 280 obtains a map from the map library 260. The map can be a still picture or a moving picture (e.g., an APNG file). In step S304, the map integration module 280 integrates the voice and the map into a voice map.
In some embodiments, the integration is a voice map that integrates the voice and map into a single file, such as in a movie format. In other embodiments, the voice posters are separate files, such as a voice file and a map file, and the integration is to associate the voice posters so that the corresponding voice and map can be played simultaneously when the voice posters are played.
In some embodiments, the map can be obtained by the map integration module 280 providing a selection screen (e.g., providing a map menu), and the user can select the map in the map library by operating the input device 130. Thus, the map integration module 280 receives the user's map selection and retrieves the corresponding map from the map library according to the map selection.
In some embodiments, the text-to-speech module 270 provides another selection screen (e.g., a menu of people) for the user to operate the input device 130 to select which person's voice to synthesize the speech. Thus, the text-to-speech module 270 receives a voice selection corresponding to a person and retrieves the corresponding model weight of the person from the weight database 240 according to the voice selection. Accordingly, the text-to-speech module 270 applies the extracted model weight to the text-to-speech model, so that the speech of the text segment can be formed as if the person uttered the text segment.
The text-to-speech model is explained next. Fig. 4 is a schematic diagram of a structure of a text-to-speech model according to an embodiment of the present invention. The text-to-speech model includes an encoder 410, Attention mechanism (Attention)420, decoder 430, post network (PostNet)440, and Vocoder (Vocoder) 450.
The encoder 410 includes a text encoder (TextEncoder)411 and an audio encoder (AudioEncoder) 412. Referring to fig. 5 and fig. 6 respectively, fig. 5 is a schematic diagram of a text encoder 411 according to an embodiment of the invention, and fig. 6 is a schematic diagram of an audio encoder 412 according to an embodiment of the invention. In one embodiment, the text encoder 411 includes a Character Embedding (Character Embedding) layer 4111, a Non-causal Convolution (Non-causal Convolution) layer 4112, and four Highway Convolution (Highway Convolution) layers 4113. In one embodiment, the audio encoder 412 includes three Causal Convolution (vehicular Convolution) layers 4121 and four highway Convolution layers 4122. However, the text encoder 411 and the audio encoder 412 in the embodiment of the present invention are not limited to the above-mentioned embodiments.
Referring to fig. 7, a schematic diagram of a decoder 430 (or audio decoder) according to an embodiment of the present invention is shown. In one embodiment, the decoder 430 includes a first cause and effect convolutional layer 431, four highway convolutional layers 432, two second cause and effect convolutional layers 433, and a logistic function (Sigmoid) layer 434. The decoder 430 according to the embodiment of the present invention is not limited to the above-mentioned components.
In one embodiment, attention mechanism 420 maps the look-up to the correct input process given a look-up (query) and a key-value (key-value) table, and the output is in the form of a weighted sum, with the weights determined by the look-up, key, and value. Referring to equation 1, the output of the text encoder 411 is a key value. Where L is the entered text, K is the key, and V is the value. Referring to equation 2, the output of the audio encoder 412 is a look-up (Q). Wherein M is1:F,1:TThe input corpus audio is Mel-cepstrum, which is two-dimensional information of F x T. F is the number of Mel filterbanks, and T is the number of audio time frames (frames). The matching degree of the characters and the voice is Q, KT/√ d, after the SoftMax function normalization process, the Attention weight (Attention) is given as shown in equation 3. Wherein d is dimension, KTThe transition matrix is K, and A is the attention weight value. The inner product of the value and the attention weight (as shown in equation 4) is input to the audio decoder 430 to obtain the speech feature vector, as shown in equation 5. Wherein, Y1:F,2:T+1Is the speech feature vector, F is the number of Mel's filterbanks, T is the audio time frame number, and R' is the output of the attention mechanism.
(K, V) ═ TextEncoder (L) (formula 1)
Q=AudioEncoder(M1:F,1:T) (formula 2)
A=SoftMax(QKTV. d) (formula 3)
R is V A (formula 4)
Y1:F,2:T+1AudioDec (R') (formula 5)
The attention mechanism 420 is not limited to the above-mentioned embodiments, butIn another embodiment, attention mechanism 420 maps the look-up to the correct input process given a look-up (query) and a key-value (key-value) table, and the output is in the form of a weighted sum, with the weights determined by the look-up, key, and value. Referring to equation 6, the output of the text encoder (TextEncoder)411 is a plurality of key values. Wherein L is the input character, K ═ K1,...,Kn]Is n bonds, V ═ V1,...,Vn]Are the corresponding n values. Referring to equation 7, the output of the audio encoder 412 is n lookups (Q ═ Q)1,...,Qn]). Wherein M is1:F,1:TThe input corpus audio is Mel-cepstrum, which is two-dimensional information of F x T. F is the number of Mel filterbanks, and T is the number of audio time frames (frames). For the ith group of key value and search pairing, the matching degree of the characters and the voice is QiKi TV. d. After the SoftMax function normalization, the Attention weight (Attention) of the ith group is obtained, as shown in equation 8. Wherein d is dimension, Ki TIs KiA transition matrix ofiIs the ith group attention weight value. After the value of each group is multiplied by the attention weight value (as shown in equation 9) and added (Concatenate), the sum is input to the audio decoder 430 to obtain the speech feature vector, as shown in equation 10. Wherein, Y1:F,2:T+1Is the speech feature vector, F is the number of Mel filterbanks, T is the audio time frame number, and R is the output of the attention mechanism.
(K, V) ═ TextEncoder (L) (formula 6)
Wherein K and V are each n keys and values, and the number of n may be 10, 20, but not limited thereto.
Q=AudioEncoder(M1:F,1:T) (formula 7)
Wherein Q is n lookups, and the number of n may be 10, 20, but not limited thereto.
Ai=SoftMax(QiKi TV. d) (formula 8)
Wherein A isiCalculated using the ith key of the n keys of equation 6 and the ith lookup of the n lookups of equation 7. A. theiThe number of (c) is n as same as K, V, Q.
R=Concatenate(Vi*Ai) (formula 9)
Wherein A isiIs n A in formula 8iI th of (1), ViIs the ith of the n values in equation 6. A of each pairiAnd ViAnd performing matrix multiplication and adding (Concatenate) to obtain the final R.
Y1:F,2:T+1AudioDec (R) (formula 10)
The post-network (PostNet)440 optimizes the speech feature vectors, in other words, the post-network 440 optimizes the speech feature vectors outputted from the decoder 430, thereby reducing the noise and pop of the outputted audio to improve the quality of the outputted audio.
A Vocoder (Vocoder)450 converts the speech feature vectors into speech output. The vocoder 450 can be implemented by using the open source software "World" or "Straight", but the embodiment of the present invention is not limited thereto.
In some embodiments, the text may be pre-processed before being input into the text-to-speech model, for example: for Chinese characters converted into coded character strings corresponding to phonetic symbols, word segmentation is performed on a segment of characters (for example, by jieba software or Chinese word segmentation system in CKIP of China research institute), and for polyphones, correct tone can be found out in a table look-up manner or adjustment can be performed according to a three-tone rule.
In summary, according to the embodiments of the present invention, the voice uttered by the person specified by the user can be synthesized by the machine and combined with the mapping specified by the user to form the voice mapping, and the voice content can also be written by the user.
Although the present invention has been described with reference to the above embodiments, it should be understood that various changes and modifications can be made therein by those skilled in the art without departing from the spirit and scope of the invention.
Claims (10)
1. A method for generating a voice map, comprising:
obtaining a segment of characters;
converting the segment of text into speech via a text-to-speech model;
obtaining a map; and
integrating the voice and the map.
2. The method of generating a voice map according to claim 1, further comprising:
receiving a voice selection corresponding to a person;
selecting a model weight corresponding to the person from a weight database according to the sound; and
applying the model weight to the text-to-speech model.
3. The method of generating a voice map according to claim 1, further comprising:
receiving the linguistic data of corresponding personnel; and
inputting the corpus to the text-to-speech model to obtain a model weight corresponding to the person.
4. The method of claim 1, wherein the text-to-speech model comprises an encoder, attention mechanism, decoder, post-network and vocoder connected in sequence.
5. The method as claimed in claim 1, wherein the obtaining of the map is receiving a map selection, and the map is retrieved from a map library according to the map selection.
6. A voice map generating apparatus, comprising:
a character input module for obtaining a segment of characters;
a text-to-speech module carrying a text-to-speech model for converting the segment of text to speech; and
and the mapping integration module integrates the mapping and the voice into a voice mapping.
7. The apparatus of claim 6, wherein the map integration module receives a voice selection corresponding to a person, the apparatus further comprises a weight database, and the text-to-speech module extracts a model weight corresponding to the person from the weight database according to the voice selection and applies the model weight to the text-to-speech model.
8. The apparatus of claim 6, further comprising a model training module for receiving the corpus of the corresponding person and inputting the corpus into the text-to-speech model to obtain the model weight of the corresponding person.
9. The apparatus of claim 6, wherein the text-to-speech model comprises an encoder, attention mechanism, decoder, post-network and vocoder connected in sequence.
10. The apparatus of claim 6, further comprising a map library, wherein the map integration module receives a map selection and retrieves the map from the map library according to the map selection.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911319032.7A CN113096639B (en) | 2019-12-19 | 2019-12-19 | Voice map generation method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911319032.7A CN113096639B (en) | 2019-12-19 | 2019-12-19 | Voice map generation method and device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113096639A true CN113096639A (en) | 2021-07-09 |
CN113096639B CN113096639B (en) | 2024-05-31 |
Family
ID=76662749
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201911319032.7A Active CN113096639B (en) | 2019-12-19 | 2019-12-19 | Voice map generation method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113096639B (en) |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
TW201042987A (en) * | 2008-10-17 | 2010-12-01 | Commw Intellectual Property Holdings Inc | Intuitive voice navigation |
CN103095685A (en) * | 2012-12-18 | 2013-05-08 | 上海量明科技发展有限公司 | Instant messaging composite icon recording method, client terminal and system |
CN106339201A (en) * | 2016-09-14 | 2017-01-18 | 北京金山安全软件有限公司 | Map processing method and device and electronic equipment |
TW201737663A (en) * | 2016-04-13 | 2017-10-16 | Zheng Cai Shen Cloud Computing Co Ltd | Personalized audio sticker generation system applied in instant messaging and method thereof capable of linking up speech audio signal to a sticker for increasing the interest of instant messaging |
TW201738720A (en) * | 2016-04-25 | 2017-11-01 | 剛谷科技股份有限公司 | Communication platform providing picture with sound |
CN107330961A (en) * | 2017-07-10 | 2017-11-07 | 湖北燿影科技有限公司 | A kind of audio-visual conversion method of word and system |
KR20180084469A (en) * | 2017-01-17 | 2018-07-25 | 네이버 주식회사 | Apparatus and method for providing voice data |
CN110379411A (en) * | 2018-04-11 | 2019-10-25 | 阿里巴巴集团控股有限公司 | For the phoneme synthesizing method and device of target speaker |
-
2019
- 2019-12-19 CN CN201911319032.7A patent/CN113096639B/en active Active
Patent Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
TW201042987A (en) * | 2008-10-17 | 2010-12-01 | Commw Intellectual Property Holdings Inc | Intuitive voice navigation |
CN103095685A (en) * | 2012-12-18 | 2013-05-08 | 上海量明科技发展有限公司 | Instant messaging composite icon recording method, client terminal and system |
TW201737663A (en) * | 2016-04-13 | 2017-10-16 | Zheng Cai Shen Cloud Computing Co Ltd | Personalized audio sticker generation system applied in instant messaging and method thereof capable of linking up speech audio signal to a sticker for increasing the interest of instant messaging |
CN107294836A (en) * | 2016-04-13 | 2017-10-24 | 正财神云端科技有限公司 | Personalized audio map generation system and method applied to instant messaging |
TW201738720A (en) * | 2016-04-25 | 2017-11-01 | 剛谷科技股份有限公司 | Communication platform providing picture with sound |
CN106339201A (en) * | 2016-09-14 | 2017-01-18 | 北京金山安全软件有限公司 | Map processing method and device and electronic equipment |
KR20180084469A (en) * | 2017-01-17 | 2018-07-25 | 네이버 주식회사 | Apparatus and method for providing voice data |
CN107330961A (en) * | 2017-07-10 | 2017-11-07 | 湖北燿影科技有限公司 | A kind of audio-visual conversion method of word and system |
CN110379411A (en) * | 2018-04-11 | 2019-10-25 | 阿里巴巴集团控股有限公司 | For the phoneme synthesizing method and device of target speaker |
Also Published As
Publication number | Publication date |
---|---|
CN113096639B (en) | 2024-05-31 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112735373B (en) | Speech synthesis method, device, equipment and storage medium | |
JP7106680B2 (en) | Text-to-Speech Synthesis in Target Speaker's Voice Using Neural Networks | |
CN111276120B (en) | Speech synthesis method, apparatus and computer-readable storage medium | |
WO2020232860A1 (en) | Speech synthesis method and apparatus, and computer readable storage medium | |
CN112151030B (en) | Multi-mode-based complex scene voice recognition method and device | |
CN111949784A (en) | Outbound method and device based on intention recognition | |
CN112786007A (en) | Speech synthesis method, device, readable medium and electronic equipment | |
CN112786008A (en) | Speech synthesis method, device, readable medium and electronic equipment | |
CN114038484B (en) | Voice data processing method, device, computer equipment and storage medium | |
CN112927674A (en) | Voice style migration method and device, readable medium and electronic equipment | |
CN113205793B (en) | Audio generation method and device, storage medium and electronic equipment | |
CN114882862A (en) | Voice processing method and related equipment | |
CN110781329A (en) | Image searching method and device, terminal equipment and storage medium | |
CN114999443A (en) | Voice generation method and device, storage medium and electronic equipment | |
CN118696371A (en) | Optimizing Conformer inference performance | |
CN113077783A (en) | Method and device for amplifying Chinese speech corpus, electronic equipment and storage medium | |
CN113345410A (en) | Training method of general speech and target speech synthesis model and related device | |
CN113096639B (en) | Voice map generation method and device | |
US20230298565A1 (en) | Using Non-Parallel Voice Conversion for Speech Conversion Models | |
TWI732390B (en) | Device and method for producing a voice sticker | |
CN114464163A (en) | Method, device, equipment, storage medium and product for training speech synthesis model | |
CN113744369A (en) | Animation generation method, system, medium and electronic terminal | |
Kiran Reddy et al. | DNN-based cross-lingual voice conversion using Bottleneck Features | |
KR20220138669A (en) | Electronic device and method for providing personalized audio information | |
CN113096633B (en) | Information film generation method and device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |