CN111402887A - Method and device for escaping characters by voice - Google Patents
Method and device for escaping characters by voice Download PDFInfo
- Publication number
- CN111402887A CN111402887A CN201811542192.3A CN201811542192A CN111402887A CN 111402887 A CN111402887 A CN 111402887A CN 201811542192 A CN201811542192 A CN 201811542192A CN 111402887 A CN111402887 A CN 111402887A
- Authority
- CN
- China
- Prior art keywords
- voice
- signal
- model
- sequence
- processing
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/26—Speech to text systems
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification
- G10L17/02—Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification
- G10L17/04—Training, enrolment or model building
Abstract
The invention discloses a method and a device for escaping characters by voice, comprising the following steps: preprocessing the acquired voice to obtain voice characteristics corresponding to the voice; storing the voice characteristics into a database model base, and performing matching processing to obtain a voice signal; processing the voice signal based on a preset language model, and determining a voice sequence corresponding to the voice signal, wherein the preset language model represents a processing model for polyphones in the voice signal; and decoding the voice sequence based on a preset acoustic modeling model to obtain character information corresponding to the voice sequence. The invention realizes the purposes of more accurate voice recognition and character conversion and meeting the requirements of users.
Description
Technical Field
The present invention relates to the field of audio processing technologies, and in particular, to a method and an apparatus for escaping text by voice.
Background
In many scenarios, for example: conferences, training, interviewing, lecture matches and the like require recording of audio content, which is usually recorded in audio form and recorded by professional personnel by listening to the audio and converting the audio into corresponding text. The manner in which records are manually converted can be time consuming and inefficient.
With the development of intelligent technology, methods for converting voice into text have been developed, and the generated audio is converted into text for output. Because the above-mentioned speech conversion text technology has certain requirements for environment and speaker's mode and characteristics, it is difficult to have many languages or some low-frequency words and specialized terms in the speech, and the accuracy of converting the speech into text can be influenced by accent, dialect or different timbres, and further it can not conform to the essential meaning of user's speech.
Disclosure of Invention
In view of the above problems, the present invention provides a method and an apparatus for escaping a word from a voice, which achieve the purpose of more accurate voice recognition and word conversion and meeting the requirements of users.
In order to achieve the purpose, the invention provides the following technical scheme:
a method of escaping text in speech, comprising:
preprocessing the acquired voice to obtain voice characteristics corresponding to the voice;
storing the voice characteristics into a database model base, and performing matching processing to obtain a voice signal;
processing the voice signal based on a preset language model, and determining a voice sequence corresponding to the voice signal, wherein the preset language model represents a processing model for polyphones in the voice signal;
and decoding the voice sequence based on a preset acoustic modeling model to obtain character information corresponding to the voice sequence.
Optionally, the preprocessing the acquired voice to obtain a voice feature corresponding to the voice includes:
acquiring voice;
and extracting the voice characteristics of the voice passing through a unit area perpendicular to the sound wave propagation direction in unit time, wherein the voice characteristics comprise one or more of average sound energy, sound intensity, audio characteristics, pitch period, signal-to-noise ratio and harmonic-to-noise ratio.
Optionally, the storing the voice features into a database model, and performing matching processing to obtain a voice signal includes:
storing the voice features into a data model base;
and matching the voice characteristics with the voice standard in the data model base to obtain the voice signal with interference sound filtered.
Optionally, the processing the voice signal based on a preset language model, and determining a voice sequence corresponding to the voice signal includes:
creating a language model based on the set of speech signals;
framing the voice signal through the language model, and determining a phoneme matched with each frame of the voice signal;
calculating and obtaining left and right phonemes of each frame according to the phonemes matched with each frame of the voice signal;
and determining a voice sequence corresponding to the voice signal according to each frame of phoneme and the left and right phonemes of the voice signal.
Optionally, the decoding the voice sequence based on the preset acoustic modeling to obtain text information corresponding to the voice sequence includes:
processing the voice sequence based on a preset acoustic modeling model to obtain acoustic representation information corresponding to the voice sequence;
decoding the acoustic representation information to obtain a character search path corresponding to the acoustic representation information;
and determining an optimal search path in the text search paths, and determining text information corresponding to the voice sequence according to the optimal search path.
An apparatus for escaping text by speech, comprising:
the preprocessing unit is used for preprocessing the acquired voice to obtain voice characteristics corresponding to the voice;
the matching processing unit is used for storing the voice characteristics into a database model base and performing matching processing to obtain voice signals;
the language model processing unit is used for processing the voice signal based on a preset language model and determining a voice sequence corresponding to the voice signal, wherein the preset language model represents a processing model for polyphones in the voice signal;
and the acoustic modeling processing unit is used for decoding the voice sequence based on a preset acoustic modeling to obtain the character information corresponding to the voice sequence.
Optionally, the pre-processing unit comprises:
an acquisition subunit configured to acquire a voice;
and the extracting subunit is used for extracting the voice characteristics of the voice passing through a unit area perpendicular to the sound wave propagation direction in unit time, wherein the voice characteristics comprise one or more of average sound energy, sound intensity, audio characteristics, pitch period, signal-to-noise ratio and harmonic-to-noise ratio.
Optionally, the matching processing unit includes:
the storage subunit is used for storing the voice characteristics into a data model base;
and the matching subunit is used for matching the voice characteristics with the voice standard in the data model base to obtain the voice signal with interference sound filtered.
Optionally, the language model processing unit includes:
a model creation subunit configured to create a language model based on the speech signal set;
the framing subunit is used for framing the voice signal through the language model and determining a phoneme matched with each frame of the voice signal;
the calculating subunit is used for calculating and obtaining left and right phonemes of each frame according to the phonemes matched with each frame of the voice signal;
and the determining subunit is used for determining a voice sequence corresponding to the voice signal according to each frame of phoneme and the left and right phonemes of the voice signal.
Optionally, the acoustic modeling processing unit includes:
the voice sequence processing subunit is used for processing the voice sequence based on a preset acoustic modeling model to obtain acoustic representation information corresponding to the voice sequence;
the decoding processing subunit is used for decoding the acoustic representation information to obtain a text search path corresponding to the acoustic representation information;
and the text determining subunit is used for determining an optimal search path in the text search paths and determining text information corresponding to the voice sequence according to the optimal search path.
Compared with the prior art, the invention provides a method and a device for converting words into voice, which are used for processing the acquired voice to obtain a voice signal, then processing the voice signal based on a language model, performing multi-voice word processing on the voice signal to enable the processed voice sequence to better meet the context and emotion, and then converting the voice sequence into words based on a voice building model generated through mass data training, so that the words converted from voice are more accurate, can be more matched with the emotion of a speaker and meet the requirements of users.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.
Fig. 1 is a schematic flow chart of a method for escaping a text by voice according to an embodiment of the present invention;
FIG. 2 is a diagram illustrating another method for escaping text from a speech sound according to an embodiment of the present invention;
fig. 3 is a schematic structural diagram of a device for escaping text by voice according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The terms "first" and "second," and the like in the description and claims of the present invention and the above-described drawings are used for distinguishing between different objects and not for describing a particular order. Furthermore, the terms "comprising" and "having," as well as any variations thereof, are intended to cover non-exclusive inclusions. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not set forth for a listed step or element but may include steps or elements not listed.
In the embodiment of the present invention, a method for escaping a text by voice is provided, referring to fig. 1, including:
and S11, preprocessing the acquired voice to obtain the voice characteristics corresponding to the voice.
Because can be based on the different of environment or receive the different restrictions of the accent of the speaker, dialect, tone quality in the pronunciation of acquireing, can make the pronunciation demonstrate different characteristics, in order to make the processing to pronunciation more accurate, need carry out the preliminary treatment to the pronunciation, extract corresponding speech characteristic, specifically include:
acquiring voice;
and extracting the voice characteristics of the voice passing through a unit area perpendicular to the sound wave propagation direction in unit time, wherein the voice characteristics comprise one or more of average sound energy, sound intensity, audio characteristics, pitch period, signal-to-noise ratio and harmonic-to-noise ratio.
After the voice is input, extracting the voice characteristics of the voice, such as average sound energy, sound intensity, audio frequency height, estimated pitch period, signal-to-noise ratio, harmonic-to-noise ratio and the like of the voice passing through a unit area perpendicular to the sound wave propagation direction in unit time. The speech features in the unit time are acquired because the speaker may be influenced by emotion or context, so that the speech features in different times are different, and further, the acquisition of the speech features in the unit time makes the subsequent processing more accurate. And collecting the unit area can make the analysis range more accurate.
And S12, storing the voice characteristics into a database model base, and performing matching processing to obtain a voice signal.
After the voice features are obtained, the voice features or data values corresponding to the voice features are stored in a data model base, and voice signals meeting preset requirements are finally obtained through continuous training matching and comparison. I.e. the speech signals have been filtered out of relevant interfering information, such as murmurs in the environment, voice sighs or applause of live feedback.
The sound data can be classified in the database model to distinguish which are interfering sound data and which are normal speaker speech data.
S13, processing the voice signal based on a preset language model, and determining a voice sequence corresponding to the voice signal;
the preset language model represents a processing model for polyphones in the voice signal.
Since the language model may be understood as performing reduction processing on polyphonic characters, the processing based on the language model may specifically include:
creating a language model based on the set of speech signals;
framing a voice signal through a language model, and determining a phoneme matched with each frame of the voice signal;
calculating and obtaining left and right phonemes of each frame according to the phonemes matched with each frame of the voice signal;
and determining a voice sequence corresponding to the voice signal according to each frame of phoneme and the left and right phonemes of the voice signal.
The language model is obtained by training and predicting based on a language signal set collected by history. The main purpose of constructing the language model is to calculate the probability of a sentence occurrence model, that is, in the language model, semantic analysis processing is performed on a speech signal to obtain a context environment corresponding to the speech signal. Specifically, the language model is constructed by mainly using a probability model for calculating the occurrence of a sentence, and the language model is used for determining which word sequence has the highest probability, or for a plurality of words, the probability of the occurrence of the next word can be predicted. The speech signal is framed by using a preset algorithm, the phoneme corresponding to each frame is found for each phoneme, the phoneme model parameters are estimated from the characteristics of the frames, and then the frame of the adjacent factors is only required to be judged to be the left factor or the right factor, so that a language model library with higher accuracy is obtained by repeated training, and the speech sequence can be determined.
And S14, decoding the voice sequence based on the preset acoustic modeling to obtain the character information corresponding to the voice sequence.
After processing the speech signal, the speech needs to be converted into text, where an opportunity modeling is needed for processing, which may include:
processing the voice sequence based on a preset acoustic modeling model to obtain acoustic representation information corresponding to the voice sequence;
decoding the acoustic representation information to obtain a text search path corresponding to the acoustic representation information;
and determining an optimal search path in the text search paths, and determining text information corresponding to the voice sequence according to the optimal search path.
In particular, the created acoustic modeling may be understood as an output that models the utterance to convert the speech input into an acoustic representation, or may be understood as the probability that the speech belongs to an acoustic symbol, i.e. a model that describes the conversion between speech and state.
Transition probability densities based on hidden Markov models are used in the process to model recurrent neural networks. And after acoustic modeling is completed, voice recognition can be carried out on the unknown voice sequence based on the acoustic modeling model so as to obtain characters through conversion.
The invention provides a method for converting words into voice, which comprises the steps of processing acquired voice to obtain a voice signal, processing the voice signal based on a language model, performing multi-voice word processing on the voice signal to enable a processed voice sequence to better meet context and emotion, and converting the voice sequence into words based on a voice construction model generated through mass data training, so that the words converted from voice are more accurate, can be more matched with the emotion of a speaker, and meet the requirements of users.
Referring to fig. 2, another method for escaping text by voice is provided in the embodiment of the present invention.
After voice input, voice characteristics of average sound energy, sound intensity, audio frequency height, estimated pitch period, signal-to-noise ratio, harmonic-to-noise ratio and the like in unit area in voice in a direction perpendicular to the sound wave propagation direction are extracted, characteristic data values of the voices are stored in a data model base, and expected voice signals are obtained finally through continuous training, matching and comparing.
The method comprises the following steps of constructing a language model, mainly using a probability model for calculating the occurrence of a sentence, determining which word sequence has higher possibility by using the language model, or predicting the occurrence possibility of the next word for a plurality of words, and framing an obtained voice signal by using a preset EM algorithm, wherein the EM algorithm comprises an E step and an M step, and the E step: using a BPTT algorithm to optimize neural network parameters, and M: and re-searching the optimal alignment relation by using the output of the neural network. Specifically, the speech signal is framed, the factor is located in each frame by using the step E, all frames corresponding to each factor are found by using the step M, and the parameters of the phoneme model are estimated from the characteristics of the frames. After alignment, GMM training is performed for each state, followed by looping the E and M steps. The E step only needs to judge whether the frame of the adjacent phoneme is the left phoneme or the right phoneme. Thus, the language model base with higher accuracy is obtained by repeated training.
A voice-built model is then created, i.e. an output that can be understood as a modeling of the utterance to be able to convert the speech input into an acoustic representation, or as the probability that the speech belongs to an acoustic symbol, i.e. a model that is used to describe the conversion between speech and state. The method comprises the steps of using transition probability density of a hidden Markov model to carry out modeling by a recurrent neural network, completing an acoustic modeling model, and carrying out voice recognition on an unknown voice frame sequence based on the acoustic model, wherein the process is generally called a search decoding process, then after the decoding process is generally given a search network (each node of the network can be a phrase) formed by connecting the Markov model according to grammar and a dictionary, selecting one or more optimal paths from all possible search paths as recognition results to be transferred into characters, wherein the optimal conditions generally meet the maximum posterior probability, and the paths can be understood as phrase strings of the phrases appearing in the dictionary.
The hidden Markov model HMM means that the external environment of the internal state of the Markov model is invisible, and the external environment can only see the output value at each moment. For speech recognition systems, the output values are typically acoustic features calculated from individual frames. When the HMM is used for describing a speech signal, two assumptions are needed, namely that the transition of an internal state is only related to a previous state, and that the transition of an output value is related to a current state or a current state, so that the complexity of a model can be reduced.
In the embodiment provided by the invention, the recognition based on artificial intelligence and the algorithm training have certain recognition inaccuracy problem for the pronunciation recognition with inaccurate pronunciation, and the difference of sentence punctuation exists for the pronunciation with certain emotion, so that the pronunciation can be understood according to the context, and the approximate meaning can be analyzed through the artificial intelligence. A large number of databases are trained on the acoustic mode, the recognition rate of low-frequency words is improved, different timbre differences are improved, meanwhile, the emotion color recognition is improved, and corresponding expressions and symbols can be made for voice system recognition of maternal and child emotion colors to enhance the recognition accuracy.
The embodiment of the present invention further provides a device for escaping a text by voice, referring to fig. 3, including:
the preprocessing unit is used for preprocessing the acquired voice to obtain voice characteristics corresponding to the voice;
the matching processing unit is used for storing the voice characteristics into a database model base and performing matching processing to obtain a voice signal;
the language model processing unit is used for processing the voice signal based on a preset language model and determining a voice sequence corresponding to the voice signal, wherein the preset language model represents a processing model for polyphones in the voice signal;
and the acoustic modeling processing unit is used for decoding the voice sequence based on a preset acoustic modeling to obtain the character information corresponding to the voice sequence.
Optionally, the pre-processing unit comprises:
an acquisition subunit configured to acquire a voice;
and the extracting subunit is used for extracting the voice characteristics of the voice passing through a unit area perpendicular to the sound wave propagation direction in unit time, wherein the voice characteristics comprise one or more of average sound energy, sound intensity, audio characteristics, pitch period, signal-to-noise ratio and harmonic-to-noise ratio.
Optionally, the matching processing unit includes:
the storage subunit is used for storing the voice characteristics into a data model base;
and the matching subunit is used for matching the voice characteristics with the voice standard in the data model base to obtain the voice signal with interference sound filtered.
Optionally, the language model processing unit includes:
a model creation subunit configured to create a language model based on the speech signal set;
the framing subunit is used for framing the voice signals through the language model and determining phonemes matched with each frame of the voice signals;
the calculating subunit is used for calculating and obtaining left and right phonemes of each frame according to the phonemes matched with each frame of the voice signal;
and the determining subunit is used for determining a voice sequence corresponding to the voice signal according to each frame of phoneme and the left and right phonemes of the voice signal.
Optionally, the acoustic modeling processing unit includes:
the voice sequence processing subunit is used for processing the voice sequence based on a preset acoustic modeling model to obtain acoustic representation information corresponding to the voice sequence;
the decoding processing subunit is used for decoding the acoustic representation information to obtain a text search path corresponding to the acoustic representation information;
and the character determining subunit is used for determining an optimal search path in the character search paths and determining character information corresponding to the voice sequence according to the optimal search path.
The invention provides a device for converting a voice into a character, which is characterized in that the voice acquired in a preprocessing unit and a matching processing unit is processed to obtain a voice signal, then the voice signal is processed based on a language model in a language model processing unit and a voice model building processing unit, and the voice signal can be subjected to multi-voice character processing, so that a processed voice sequence can better meet the context and emotion, and then the voice sequence is converted into the character based on a voice model building generated through mass data training, so that the character converted from the voice is more accurate and can be more matched with the emotion of a speaker, and the requirements of users are met.
The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.
Claims (10)
1. A method for escaping characters by voice is characterized by comprising the following steps:
preprocessing the acquired voice to obtain voice characteristics corresponding to the voice;
storing the voice characteristics into a database model base, and performing matching processing to obtain a voice signal;
processing the voice signal based on a preset language model, and determining a voice sequence corresponding to the voice signal, wherein the preset language model represents a processing model for polyphones in the voice signal;
and decoding the voice sequence based on a preset acoustic modeling model to obtain character information corresponding to the voice sequence.
2. The method according to claim 1, wherein the preprocessing the acquired speech to obtain speech features corresponding to the speech comprises:
acquiring voice;
and extracting the voice characteristics of the voice passing through a unit area perpendicular to the sound wave propagation direction in unit time, wherein the voice characteristics comprise one or more of average sound energy, sound intensity, audio characteristics, pitch period, signal-to-noise ratio and harmonic-to-noise ratio.
3. The method of claim 1, wherein storing the speech features in a database model and performing a matching process to obtain a speech signal comprises:
storing the voice features into a data model base;
and matching the voice characteristics with the voice standard in the data model base to obtain the voice signal with interference sound filtered.
4. The method according to claim 1, wherein the processing the speech signal based on a preset language model to determine a speech sequence corresponding to the speech signal comprises:
creating a language model based on the set of speech signals;
framing the voice signal through the language model, and determining a phoneme matched with each frame of the voice signal;
calculating and obtaining left and right phonemes of each frame according to the phonemes matched with each frame of the voice signal;
and determining a voice sequence corresponding to the voice signal according to each frame of phoneme and the left and right phonemes of the voice signal.
5. The method according to claim 1, wherein the decoding the speech sequence based on the preset acoustic modeling to obtain text information corresponding to the speech sequence comprises:
processing the voice sequence based on a preset acoustic modeling model to obtain acoustic representation information corresponding to the voice sequence;
decoding the acoustic representation information to obtain a character search path corresponding to the acoustic representation information;
and determining an optimal search path in the text search paths, and determining text information corresponding to the voice sequence according to the optimal search path.
6. An apparatus for escaping text from a voice, comprising:
the preprocessing unit is used for preprocessing the acquired voice to obtain voice characteristics corresponding to the voice;
the matching processing unit is used for storing the voice characteristics into a database model base and performing matching processing to obtain voice signals;
the language model processing unit is used for processing the voice signal based on a preset language model and determining a voice sequence corresponding to the voice signal, wherein the preset language model represents a processing model for polyphones in the voice signal;
and the acoustic modeling processing unit is used for decoding the voice sequence based on a preset acoustic modeling to obtain the character information corresponding to the voice sequence.
7. The apparatus of claim 6, wherein the pre-processing unit comprises:
an acquisition subunit configured to acquire a voice;
and the extracting subunit is used for extracting the voice characteristics of the voice passing through a unit area perpendicular to the sound wave propagation direction in unit time, wherein the voice characteristics comprise one or more of average sound energy, sound intensity, audio characteristics, pitch period, signal-to-noise ratio and harmonic-to-noise ratio.
8. The apparatus of claim 6, wherein the matching processing unit comprises:
the storage subunit is used for storing the voice characteristics into a data model base;
and the matching subunit is used for matching the voice characteristics with the voice standard in the data model base to obtain the voice signal with interference sound filtered.
9. The apparatus according to claim 6, wherein the language model processing unit comprises:
a model creation subunit configured to create a language model based on the speech signal set;
the framing subunit is used for framing the voice signal through the language model and determining a phoneme matched with each frame of the voice signal;
the calculating subunit is used for calculating and obtaining left and right phonemes of each frame according to the phonemes matched with each frame of the voice signal;
and the determining subunit is used for determining a voice sequence corresponding to the voice signal according to each frame of phoneme and the left and right phonemes of the voice signal.
10. The apparatus of claim 6, wherein the acoustic modeling processing unit comprises:
the voice sequence processing subunit is used for processing the voice sequence based on a preset acoustic modeling model to obtain acoustic representation information corresponding to the voice sequence;
the decoding processing subunit is used for decoding the acoustic representation information to obtain a text search path corresponding to the acoustic representation information;
and the text determining subunit is used for determining an optimal search path in the text search paths and determining text information corresponding to the voice sequence according to the optimal search path.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811542192.3A CN111402887A (en) | 2018-12-17 | 2018-12-17 | Method and device for escaping characters by voice |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811542192.3A CN111402887A (en) | 2018-12-17 | 2018-12-17 | Method and device for escaping characters by voice |
Publications (1)
Publication Number | Publication Date |
---|---|
CN111402887A true CN111402887A (en) | 2020-07-10 |
Family
ID=71435820
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811542192.3A Pending CN111402887A (en) | 2018-12-17 | 2018-12-17 | Method and device for escaping characters by voice |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111402887A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20200410292A1 (en) * | 2019-06-28 | 2020-12-31 | International Business Machines Corporation | Machine learned historically accurate temporal classification of objects |
CN114125506A (en) * | 2020-08-28 | 2022-03-01 | 上海哔哩哔哩科技有限公司 | Voice auditing method and device |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103578464A (en) * | 2013-10-18 | 2014-02-12 | 威盛电子股份有限公司 | Language model establishing method, speech recognition method and electronic device |
CN103903619A (en) * | 2012-12-28 | 2014-07-02 | 安徽科大讯飞信息科技股份有限公司 | Method and system for improving accuracy of speech recognition |
WO2016101577A1 (en) * | 2014-12-24 | 2016-06-30 | 中兴通讯股份有限公司 | Voice recognition method, client and terminal device |
WO2017076222A1 (en) * | 2015-11-06 | 2017-05-11 | 阿里巴巴集团控股有限公司 | Speech recognition method and apparatus |
CN107705787A (en) * | 2017-09-25 | 2018-02-16 | 北京捷通华声科技股份有限公司 | A kind of audio recognition method and device |
US20180137109A1 (en) * | 2016-11-11 | 2018-05-17 | The Charles Stark Draper Laboratory, Inc. | Methodology for automatic multilingual speech recognition |
-
2018
- 2018-12-17 CN CN201811542192.3A patent/CN111402887A/en active Pending
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103903619A (en) * | 2012-12-28 | 2014-07-02 | 安徽科大讯飞信息科技股份有限公司 | Method and system for improving accuracy of speech recognition |
CN103578464A (en) * | 2013-10-18 | 2014-02-12 | 威盛电子股份有限公司 | Language model establishing method, speech recognition method and electronic device |
WO2016101577A1 (en) * | 2014-12-24 | 2016-06-30 | 中兴通讯股份有限公司 | Voice recognition method, client and terminal device |
WO2017076222A1 (en) * | 2015-11-06 | 2017-05-11 | 阿里巴巴集团控股有限公司 | Speech recognition method and apparatus |
US20180137109A1 (en) * | 2016-11-11 | 2018-05-17 | The Charles Stark Draper Laboratory, Inc. | Methodology for automatic multilingual speech recognition |
CN107705787A (en) * | 2017-09-25 | 2018-02-16 | 北京捷通华声科技股份有限公司 | A kind of audio recognition method and device |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20200410292A1 (en) * | 2019-06-28 | 2020-12-31 | International Business Machines Corporation | Machine learned historically accurate temporal classification of objects |
US11636282B2 (en) * | 2019-06-28 | 2023-04-25 | International Business Machines Corporation | Machine learned historically accurate temporal classification of objects |
CN114125506A (en) * | 2020-08-28 | 2022-03-01 | 上海哔哩哔哩科技有限公司 | Voice auditing method and device |
CN114125506B (en) * | 2020-08-28 | 2024-03-19 | 上海哔哩哔哩科技有限公司 | Voice auditing method and device |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109147758B (en) | Speaker voice conversion method and device | |
CN109410914B (en) | Method for identifying Jiangxi dialect speech and dialect point | |
WO2018121757A1 (en) | Method and system for speech broadcast of text | |
CN102779508B (en) | Sound bank generates Apparatus for () and method therefor, speech synthesis system and method thereof | |
KR101688240B1 (en) | System and method for automatic speech to text conversion | |
KR100815115B1 (en) | An Acoustic Model Adaptation Method Based on Pronunciation Variability Analysis for Foreign Speech Recognition and apparatus thereof | |
Athanaselis et al. | ASR for emotional speech: clarifying the issues and enhancing performance | |
CN111105785B (en) | Text prosody boundary recognition method and device | |
CN112767958A (en) | Zero-learning-based cross-language tone conversion system and method | |
JPH09500223A (en) | Multilingual speech recognition system | |
CN109243460A (en) | A method of automatically generating news or interrogation record based on the local dialect | |
CN112349289B (en) | Voice recognition method, device, equipment and storage medium | |
CN110853616A (en) | Speech synthesis method, system and storage medium based on neural network | |
CN111489743A (en) | Operation management analysis system based on intelligent voice technology | |
Chittaragi et al. | Acoustic-phonetic feature based Kannada dialect identification from vowel sounds | |
CN111402887A (en) | Method and device for escaping characters by voice | |
US20230252971A1 (en) | System and method for speech processing | |
CN111583965A (en) | Voice emotion recognition method, device, equipment and storage medium | |
JP2001109490A (en) | Method for constituting voice recognition device, its recognition device and voice recognition method | |
Cahyaningtyas et al. | Synthesized speech quality of Indonesian natural text-to-speech by using HTS and CLUSTERGEN | |
CN111833869B (en) | Voice interaction method and system applied to urban brain | |
Woods et al. | A robust ensemble model for spoken language recognition | |
JP3727436B2 (en) | Voice original optimum collation apparatus and method | |
Akesh et al. | Real-Time Subtitle Generator for Sinhala Speech | |
CN113990288B (en) | Method for automatically generating and deploying voice synthesis model by voice customer service |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |