CN111402887A - Method and device for escaping characters by voice - Google Patents

Method and device for escaping characters by voice Download PDF

Info

Publication number
CN111402887A
CN111402887A CN201811542192.3A CN201811542192A CN111402887A CN 111402887 A CN111402887 A CN 111402887A CN 201811542192 A CN201811542192 A CN 201811542192A CN 111402887 A CN111402887 A CN 111402887A
Authority
CN
China
Prior art keywords
voice
signal
model
sequence
processing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201811542192.3A
Other languages
Chinese (zh)
Inventor
陈长伟
杨晓亮
田丹
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Evomedia Technology Co ltd
Original Assignee
Beijing Evomedia Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Evomedia Technology Co ltd filed Critical Beijing Evomedia Technology Co ltd
Priority to CN201811542192.3A priority Critical patent/CN111402887A/en
Publication of CN111402887A publication Critical patent/CN111402887A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/02Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/04Training, enrolment or model building

Abstract

The invention discloses a method and a device for escaping characters by voice, comprising the following steps: preprocessing the acquired voice to obtain voice characteristics corresponding to the voice; storing the voice characteristics into a database model base, and performing matching processing to obtain a voice signal; processing the voice signal based on a preset language model, and determining a voice sequence corresponding to the voice signal, wherein the preset language model represents a processing model for polyphones in the voice signal; and decoding the voice sequence based on a preset acoustic modeling model to obtain character information corresponding to the voice sequence. The invention realizes the purposes of more accurate voice recognition and character conversion and meeting the requirements of users.

Description

Method and device for escaping characters by voice
Technical Field
The present invention relates to the field of audio processing technologies, and in particular, to a method and an apparatus for escaping text by voice.
Background
In many scenarios, for example: conferences, training, interviewing, lecture matches and the like require recording of audio content, which is usually recorded in audio form and recorded by professional personnel by listening to the audio and converting the audio into corresponding text. The manner in which records are manually converted can be time consuming and inefficient.
With the development of intelligent technology, methods for converting voice into text have been developed, and the generated audio is converted into text for output. Because the above-mentioned speech conversion text technology has certain requirements for environment and speaker's mode and characteristics, it is difficult to have many languages or some low-frequency words and specialized terms in the speech, and the accuracy of converting the speech into text can be influenced by accent, dialect or different timbres, and further it can not conform to the essential meaning of user's speech.
Disclosure of Invention
In view of the above problems, the present invention provides a method and an apparatus for escaping a word from a voice, which achieve the purpose of more accurate voice recognition and word conversion and meeting the requirements of users.
In order to achieve the purpose, the invention provides the following technical scheme:
a method of escaping text in speech, comprising:
preprocessing the acquired voice to obtain voice characteristics corresponding to the voice;
storing the voice characteristics into a database model base, and performing matching processing to obtain a voice signal;
processing the voice signal based on a preset language model, and determining a voice sequence corresponding to the voice signal, wherein the preset language model represents a processing model for polyphones in the voice signal;
and decoding the voice sequence based on a preset acoustic modeling model to obtain character information corresponding to the voice sequence.
Optionally, the preprocessing the acquired voice to obtain a voice feature corresponding to the voice includes:
acquiring voice;
and extracting the voice characteristics of the voice passing through a unit area perpendicular to the sound wave propagation direction in unit time, wherein the voice characteristics comprise one or more of average sound energy, sound intensity, audio characteristics, pitch period, signal-to-noise ratio and harmonic-to-noise ratio.
Optionally, the storing the voice features into a database model, and performing matching processing to obtain a voice signal includes:
storing the voice features into a data model base;
and matching the voice characteristics with the voice standard in the data model base to obtain the voice signal with interference sound filtered.
Optionally, the processing the voice signal based on a preset language model, and determining a voice sequence corresponding to the voice signal includes:
creating a language model based on the set of speech signals;
framing the voice signal through the language model, and determining a phoneme matched with each frame of the voice signal;
calculating and obtaining left and right phonemes of each frame according to the phonemes matched with each frame of the voice signal;
and determining a voice sequence corresponding to the voice signal according to each frame of phoneme and the left and right phonemes of the voice signal.
Optionally, the decoding the voice sequence based on the preset acoustic modeling to obtain text information corresponding to the voice sequence includes:
processing the voice sequence based on a preset acoustic modeling model to obtain acoustic representation information corresponding to the voice sequence;
decoding the acoustic representation information to obtain a character search path corresponding to the acoustic representation information;
and determining an optimal search path in the text search paths, and determining text information corresponding to the voice sequence according to the optimal search path.
An apparatus for escaping text by speech, comprising:
the preprocessing unit is used for preprocessing the acquired voice to obtain voice characteristics corresponding to the voice;
the matching processing unit is used for storing the voice characteristics into a database model base and performing matching processing to obtain voice signals;
the language model processing unit is used for processing the voice signal based on a preset language model and determining a voice sequence corresponding to the voice signal, wherein the preset language model represents a processing model for polyphones in the voice signal;
and the acoustic modeling processing unit is used for decoding the voice sequence based on a preset acoustic modeling to obtain the character information corresponding to the voice sequence.
Optionally, the pre-processing unit comprises:
an acquisition subunit configured to acquire a voice;
and the extracting subunit is used for extracting the voice characteristics of the voice passing through a unit area perpendicular to the sound wave propagation direction in unit time, wherein the voice characteristics comprise one or more of average sound energy, sound intensity, audio characteristics, pitch period, signal-to-noise ratio and harmonic-to-noise ratio.
Optionally, the matching processing unit includes:
the storage subunit is used for storing the voice characteristics into a data model base;
and the matching subunit is used for matching the voice characteristics with the voice standard in the data model base to obtain the voice signal with interference sound filtered.
Optionally, the language model processing unit includes:
a model creation subunit configured to create a language model based on the speech signal set;
the framing subunit is used for framing the voice signal through the language model and determining a phoneme matched with each frame of the voice signal;
the calculating subunit is used for calculating and obtaining left and right phonemes of each frame according to the phonemes matched with each frame of the voice signal;
and the determining subunit is used for determining a voice sequence corresponding to the voice signal according to each frame of phoneme and the left and right phonemes of the voice signal.
Optionally, the acoustic modeling processing unit includes:
the voice sequence processing subunit is used for processing the voice sequence based on a preset acoustic modeling model to obtain acoustic representation information corresponding to the voice sequence;
the decoding processing subunit is used for decoding the acoustic representation information to obtain a text search path corresponding to the acoustic representation information;
and the text determining subunit is used for determining an optimal search path in the text search paths and determining text information corresponding to the voice sequence according to the optimal search path.
Compared with the prior art, the invention provides a method and a device for converting words into voice, which are used for processing the acquired voice to obtain a voice signal, then processing the voice signal based on a language model, performing multi-voice word processing on the voice signal to enable the processed voice sequence to better meet the context and emotion, and then converting the voice sequence into words based on a voice building model generated through mass data training, so that the words converted from voice are more accurate, can be more matched with the emotion of a speaker and meet the requirements of users.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.
Fig. 1 is a schematic flow chart of a method for escaping a text by voice according to an embodiment of the present invention;
FIG. 2 is a diagram illustrating another method for escaping text from a speech sound according to an embodiment of the present invention;
fig. 3 is a schematic structural diagram of a device for escaping text by voice according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The terms "first" and "second," and the like in the description and claims of the present invention and the above-described drawings are used for distinguishing between different objects and not for describing a particular order. Furthermore, the terms "comprising" and "having," as well as any variations thereof, are intended to cover non-exclusive inclusions. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not set forth for a listed step or element but may include steps or elements not listed.
In the embodiment of the present invention, a method for escaping a text by voice is provided, referring to fig. 1, including:
and S11, preprocessing the acquired voice to obtain the voice characteristics corresponding to the voice.
Because can be based on the different of environment or receive the different restrictions of the accent of the speaker, dialect, tone quality in the pronunciation of acquireing, can make the pronunciation demonstrate different characteristics, in order to make the processing to pronunciation more accurate, need carry out the preliminary treatment to the pronunciation, extract corresponding speech characteristic, specifically include:
acquiring voice;
and extracting the voice characteristics of the voice passing through a unit area perpendicular to the sound wave propagation direction in unit time, wherein the voice characteristics comprise one or more of average sound energy, sound intensity, audio characteristics, pitch period, signal-to-noise ratio and harmonic-to-noise ratio.
After the voice is input, extracting the voice characteristics of the voice, such as average sound energy, sound intensity, audio frequency height, estimated pitch period, signal-to-noise ratio, harmonic-to-noise ratio and the like of the voice passing through a unit area perpendicular to the sound wave propagation direction in unit time. The speech features in the unit time are acquired because the speaker may be influenced by emotion or context, so that the speech features in different times are different, and further, the acquisition of the speech features in the unit time makes the subsequent processing more accurate. And collecting the unit area can make the analysis range more accurate.
And S12, storing the voice characteristics into a database model base, and performing matching processing to obtain a voice signal.
After the voice features are obtained, the voice features or data values corresponding to the voice features are stored in a data model base, and voice signals meeting preset requirements are finally obtained through continuous training matching and comparison. I.e. the speech signals have been filtered out of relevant interfering information, such as murmurs in the environment, voice sighs or applause of live feedback.
The sound data can be classified in the database model to distinguish which are interfering sound data and which are normal speaker speech data.
S13, processing the voice signal based on a preset language model, and determining a voice sequence corresponding to the voice signal;
the preset language model represents a processing model for polyphones in the voice signal.
Since the language model may be understood as performing reduction processing on polyphonic characters, the processing based on the language model may specifically include:
creating a language model based on the set of speech signals;
framing a voice signal through a language model, and determining a phoneme matched with each frame of the voice signal;
calculating and obtaining left and right phonemes of each frame according to the phonemes matched with each frame of the voice signal;
and determining a voice sequence corresponding to the voice signal according to each frame of phoneme and the left and right phonemes of the voice signal.
The language model is obtained by training and predicting based on a language signal set collected by history. The main purpose of constructing the language model is to calculate the probability of a sentence occurrence model, that is, in the language model, semantic analysis processing is performed on a speech signal to obtain a context environment corresponding to the speech signal. Specifically, the language model is constructed by mainly using a probability model for calculating the occurrence of a sentence, and the language model is used for determining which word sequence has the highest probability, or for a plurality of words, the probability of the occurrence of the next word can be predicted. The speech signal is framed by using a preset algorithm, the phoneme corresponding to each frame is found for each phoneme, the phoneme model parameters are estimated from the characteristics of the frames, and then the frame of the adjacent factors is only required to be judged to be the left factor or the right factor, so that a language model library with higher accuracy is obtained by repeated training, and the speech sequence can be determined.
And S14, decoding the voice sequence based on the preset acoustic modeling to obtain the character information corresponding to the voice sequence.
After processing the speech signal, the speech needs to be converted into text, where an opportunity modeling is needed for processing, which may include:
processing the voice sequence based on a preset acoustic modeling model to obtain acoustic representation information corresponding to the voice sequence;
decoding the acoustic representation information to obtain a text search path corresponding to the acoustic representation information;
and determining an optimal search path in the text search paths, and determining text information corresponding to the voice sequence according to the optimal search path.
In particular, the created acoustic modeling may be understood as an output that models the utterance to convert the speech input into an acoustic representation, or may be understood as the probability that the speech belongs to an acoustic symbol, i.e. a model that describes the conversion between speech and state.
Transition probability densities based on hidden Markov models are used in the process to model recurrent neural networks. And after acoustic modeling is completed, voice recognition can be carried out on the unknown voice sequence based on the acoustic modeling model so as to obtain characters through conversion.
The invention provides a method for converting words into voice, which comprises the steps of processing acquired voice to obtain a voice signal, processing the voice signal based on a language model, performing multi-voice word processing on the voice signal to enable a processed voice sequence to better meet context and emotion, and converting the voice sequence into words based on a voice construction model generated through mass data training, so that the words converted from voice are more accurate, can be more matched with the emotion of a speaker, and meet the requirements of users.
Referring to fig. 2, another method for escaping text by voice is provided in the embodiment of the present invention.
After voice input, voice characteristics of average sound energy, sound intensity, audio frequency height, estimated pitch period, signal-to-noise ratio, harmonic-to-noise ratio and the like in unit area in voice in a direction perpendicular to the sound wave propagation direction are extracted, characteristic data values of the voices are stored in a data model base, and expected voice signals are obtained finally through continuous training, matching and comparing.
The method comprises the following steps of constructing a language model, mainly using a probability model for calculating the occurrence of a sentence, determining which word sequence has higher possibility by using the language model, or predicting the occurrence possibility of the next word for a plurality of words, and framing an obtained voice signal by using a preset EM algorithm, wherein the EM algorithm comprises an E step and an M step, and the E step: using a BPTT algorithm to optimize neural network parameters, and M: and re-searching the optimal alignment relation by using the output of the neural network. Specifically, the speech signal is framed, the factor is located in each frame by using the step E, all frames corresponding to each factor are found by using the step M, and the parameters of the phoneme model are estimated from the characteristics of the frames. After alignment, GMM training is performed for each state, followed by looping the E and M steps. The E step only needs to judge whether the frame of the adjacent phoneme is the left phoneme or the right phoneme. Thus, the language model base with higher accuracy is obtained by repeated training.
A voice-built model is then created, i.e. an output that can be understood as a modeling of the utterance to be able to convert the speech input into an acoustic representation, or as the probability that the speech belongs to an acoustic symbol, i.e. a model that is used to describe the conversion between speech and state. The method comprises the steps of using transition probability density of a hidden Markov model to carry out modeling by a recurrent neural network, completing an acoustic modeling model, and carrying out voice recognition on an unknown voice frame sequence based on the acoustic model, wherein the process is generally called a search decoding process, then after the decoding process is generally given a search network (each node of the network can be a phrase) formed by connecting the Markov model according to grammar and a dictionary, selecting one or more optimal paths from all possible search paths as recognition results to be transferred into characters, wherein the optimal conditions generally meet the maximum posterior probability, and the paths can be understood as phrase strings of the phrases appearing in the dictionary.
The hidden Markov model HMM means that the external environment of the internal state of the Markov model is invisible, and the external environment can only see the output value at each moment. For speech recognition systems, the output values are typically acoustic features calculated from individual frames. When the HMM is used for describing a speech signal, two assumptions are needed, namely that the transition of an internal state is only related to a previous state, and that the transition of an output value is related to a current state or a current state, so that the complexity of a model can be reduced.
In the embodiment provided by the invention, the recognition based on artificial intelligence and the algorithm training have certain recognition inaccuracy problem for the pronunciation recognition with inaccurate pronunciation, and the difference of sentence punctuation exists for the pronunciation with certain emotion, so that the pronunciation can be understood according to the context, and the approximate meaning can be analyzed through the artificial intelligence. A large number of databases are trained on the acoustic mode, the recognition rate of low-frequency words is improved, different timbre differences are improved, meanwhile, the emotion color recognition is improved, and corresponding expressions and symbols can be made for voice system recognition of maternal and child emotion colors to enhance the recognition accuracy.
The embodiment of the present invention further provides a device for escaping a text by voice, referring to fig. 3, including:
the preprocessing unit is used for preprocessing the acquired voice to obtain voice characteristics corresponding to the voice;
the matching processing unit is used for storing the voice characteristics into a database model base and performing matching processing to obtain a voice signal;
the language model processing unit is used for processing the voice signal based on a preset language model and determining a voice sequence corresponding to the voice signal, wherein the preset language model represents a processing model for polyphones in the voice signal;
and the acoustic modeling processing unit is used for decoding the voice sequence based on a preset acoustic modeling to obtain the character information corresponding to the voice sequence.
Optionally, the pre-processing unit comprises:
an acquisition subunit configured to acquire a voice;
and the extracting subunit is used for extracting the voice characteristics of the voice passing through a unit area perpendicular to the sound wave propagation direction in unit time, wherein the voice characteristics comprise one or more of average sound energy, sound intensity, audio characteristics, pitch period, signal-to-noise ratio and harmonic-to-noise ratio.
Optionally, the matching processing unit includes:
the storage subunit is used for storing the voice characteristics into a data model base;
and the matching subunit is used for matching the voice characteristics with the voice standard in the data model base to obtain the voice signal with interference sound filtered.
Optionally, the language model processing unit includes:
a model creation subunit configured to create a language model based on the speech signal set;
the framing subunit is used for framing the voice signals through the language model and determining phonemes matched with each frame of the voice signals;
the calculating subunit is used for calculating and obtaining left and right phonemes of each frame according to the phonemes matched with each frame of the voice signal;
and the determining subunit is used for determining a voice sequence corresponding to the voice signal according to each frame of phoneme and the left and right phonemes of the voice signal.
Optionally, the acoustic modeling processing unit includes:
the voice sequence processing subunit is used for processing the voice sequence based on a preset acoustic modeling model to obtain acoustic representation information corresponding to the voice sequence;
the decoding processing subunit is used for decoding the acoustic representation information to obtain a text search path corresponding to the acoustic representation information;
and the character determining subunit is used for determining an optimal search path in the character search paths and determining character information corresponding to the voice sequence according to the optimal search path.
The invention provides a device for converting a voice into a character, which is characterized in that the voice acquired in a preprocessing unit and a matching processing unit is processed to obtain a voice signal, then the voice signal is processed based on a language model in a language model processing unit and a voice model building processing unit, and the voice signal can be subjected to multi-voice character processing, so that a processed voice sequence can better meet the context and emotion, and then the voice sequence is converted into the character based on a voice model building generated through mass data training, so that the character converted from the voice is more accurate and can be more matched with the emotion of a speaker, and the requirements of users are met.
The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (10)

1. A method for escaping characters by voice is characterized by comprising the following steps:
preprocessing the acquired voice to obtain voice characteristics corresponding to the voice;
storing the voice characteristics into a database model base, and performing matching processing to obtain a voice signal;
processing the voice signal based on a preset language model, and determining a voice sequence corresponding to the voice signal, wherein the preset language model represents a processing model for polyphones in the voice signal;
and decoding the voice sequence based on a preset acoustic modeling model to obtain character information corresponding to the voice sequence.
2. The method according to claim 1, wherein the preprocessing the acquired speech to obtain speech features corresponding to the speech comprises:
acquiring voice;
and extracting the voice characteristics of the voice passing through a unit area perpendicular to the sound wave propagation direction in unit time, wherein the voice characteristics comprise one or more of average sound energy, sound intensity, audio characteristics, pitch period, signal-to-noise ratio and harmonic-to-noise ratio.
3. The method of claim 1, wherein storing the speech features in a database model and performing a matching process to obtain a speech signal comprises:
storing the voice features into a data model base;
and matching the voice characteristics with the voice standard in the data model base to obtain the voice signal with interference sound filtered.
4. The method according to claim 1, wherein the processing the speech signal based on a preset language model to determine a speech sequence corresponding to the speech signal comprises:
creating a language model based on the set of speech signals;
framing the voice signal through the language model, and determining a phoneme matched with each frame of the voice signal;
calculating and obtaining left and right phonemes of each frame according to the phonemes matched with each frame of the voice signal;
and determining a voice sequence corresponding to the voice signal according to each frame of phoneme and the left and right phonemes of the voice signal.
5. The method according to claim 1, wherein the decoding the speech sequence based on the preset acoustic modeling to obtain text information corresponding to the speech sequence comprises:
processing the voice sequence based on a preset acoustic modeling model to obtain acoustic representation information corresponding to the voice sequence;
decoding the acoustic representation information to obtain a character search path corresponding to the acoustic representation information;
and determining an optimal search path in the text search paths, and determining text information corresponding to the voice sequence according to the optimal search path.
6. An apparatus for escaping text from a voice, comprising:
the preprocessing unit is used for preprocessing the acquired voice to obtain voice characteristics corresponding to the voice;
the matching processing unit is used for storing the voice characteristics into a database model base and performing matching processing to obtain voice signals;
the language model processing unit is used for processing the voice signal based on a preset language model and determining a voice sequence corresponding to the voice signal, wherein the preset language model represents a processing model for polyphones in the voice signal;
and the acoustic modeling processing unit is used for decoding the voice sequence based on a preset acoustic modeling to obtain the character information corresponding to the voice sequence.
7. The apparatus of claim 6, wherein the pre-processing unit comprises:
an acquisition subunit configured to acquire a voice;
and the extracting subunit is used for extracting the voice characteristics of the voice passing through a unit area perpendicular to the sound wave propagation direction in unit time, wherein the voice characteristics comprise one or more of average sound energy, sound intensity, audio characteristics, pitch period, signal-to-noise ratio and harmonic-to-noise ratio.
8. The apparatus of claim 6, wherein the matching processing unit comprises:
the storage subunit is used for storing the voice characteristics into a data model base;
and the matching subunit is used for matching the voice characteristics with the voice standard in the data model base to obtain the voice signal with interference sound filtered.
9. The apparatus according to claim 6, wherein the language model processing unit comprises:
a model creation subunit configured to create a language model based on the speech signal set;
the framing subunit is used for framing the voice signal through the language model and determining a phoneme matched with each frame of the voice signal;
the calculating subunit is used for calculating and obtaining left and right phonemes of each frame according to the phonemes matched with each frame of the voice signal;
and the determining subunit is used for determining a voice sequence corresponding to the voice signal according to each frame of phoneme and the left and right phonemes of the voice signal.
10. The apparatus of claim 6, wherein the acoustic modeling processing unit comprises:
the voice sequence processing subunit is used for processing the voice sequence based on a preset acoustic modeling model to obtain acoustic representation information corresponding to the voice sequence;
the decoding processing subunit is used for decoding the acoustic representation information to obtain a text search path corresponding to the acoustic representation information;
and the text determining subunit is used for determining an optimal search path in the text search paths and determining text information corresponding to the voice sequence according to the optimal search path.
CN201811542192.3A 2018-12-17 2018-12-17 Method and device for escaping characters by voice Pending CN111402887A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811542192.3A CN111402887A (en) 2018-12-17 2018-12-17 Method and device for escaping characters by voice

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811542192.3A CN111402887A (en) 2018-12-17 2018-12-17 Method and device for escaping characters by voice

Publications (1)

Publication Number Publication Date
CN111402887A true CN111402887A (en) 2020-07-10

Family

ID=71435820

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811542192.3A Pending CN111402887A (en) 2018-12-17 2018-12-17 Method and device for escaping characters by voice

Country Status (1)

Country Link
CN (1) CN111402887A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200410292A1 (en) * 2019-06-28 2020-12-31 International Business Machines Corporation Machine learned historically accurate temporal classification of objects
CN114125506A (en) * 2020-08-28 2022-03-01 上海哔哩哔哩科技有限公司 Voice auditing method and device

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103578464A (en) * 2013-10-18 2014-02-12 威盛电子股份有限公司 Language model establishing method, speech recognition method and electronic device
CN103903619A (en) * 2012-12-28 2014-07-02 安徽科大讯飞信息科技股份有限公司 Method and system for improving accuracy of speech recognition
WO2016101577A1 (en) * 2014-12-24 2016-06-30 中兴通讯股份有限公司 Voice recognition method, client and terminal device
WO2017076222A1 (en) * 2015-11-06 2017-05-11 阿里巴巴集团控股有限公司 Speech recognition method and apparatus
CN107705787A (en) * 2017-09-25 2018-02-16 北京捷通华声科技股份有限公司 A kind of audio recognition method and device
US20180137109A1 (en) * 2016-11-11 2018-05-17 The Charles Stark Draper Laboratory, Inc. Methodology for automatic multilingual speech recognition

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103903619A (en) * 2012-12-28 2014-07-02 安徽科大讯飞信息科技股份有限公司 Method and system for improving accuracy of speech recognition
CN103578464A (en) * 2013-10-18 2014-02-12 威盛电子股份有限公司 Language model establishing method, speech recognition method and electronic device
WO2016101577A1 (en) * 2014-12-24 2016-06-30 中兴通讯股份有限公司 Voice recognition method, client and terminal device
WO2017076222A1 (en) * 2015-11-06 2017-05-11 阿里巴巴集团控股有限公司 Speech recognition method and apparatus
US20180137109A1 (en) * 2016-11-11 2018-05-17 The Charles Stark Draper Laboratory, Inc. Methodology for automatic multilingual speech recognition
CN107705787A (en) * 2017-09-25 2018-02-16 北京捷通华声科技股份有限公司 A kind of audio recognition method and device

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200410292A1 (en) * 2019-06-28 2020-12-31 International Business Machines Corporation Machine learned historically accurate temporal classification of objects
US11636282B2 (en) * 2019-06-28 2023-04-25 International Business Machines Corporation Machine learned historically accurate temporal classification of objects
CN114125506A (en) * 2020-08-28 2022-03-01 上海哔哩哔哩科技有限公司 Voice auditing method and device
CN114125506B (en) * 2020-08-28 2024-03-19 上海哔哩哔哩科技有限公司 Voice auditing method and device

Similar Documents

Publication Publication Date Title
CN109147758B (en) Speaker voice conversion method and device
CN109410914B (en) Method for identifying Jiangxi dialect speech and dialect point
WO2018121757A1 (en) Method and system for speech broadcast of text
CN102779508B (en) Sound bank generates Apparatus for () and method therefor, speech synthesis system and method thereof
KR101688240B1 (en) System and method for automatic speech to text conversion
KR100815115B1 (en) An Acoustic Model Adaptation Method Based on Pronunciation Variability Analysis for Foreign Speech Recognition and apparatus thereof
Athanaselis et al. ASR for emotional speech: clarifying the issues and enhancing performance
CN111105785B (en) Text prosody boundary recognition method and device
CN112767958A (en) Zero-learning-based cross-language tone conversion system and method
JPH09500223A (en) Multilingual speech recognition system
CN109243460A (en) A method of automatically generating news or interrogation record based on the local dialect
CN112349289B (en) Voice recognition method, device, equipment and storage medium
CN110853616A (en) Speech synthesis method, system and storage medium based on neural network
CN111489743A (en) Operation management analysis system based on intelligent voice technology
Chittaragi et al. Acoustic-phonetic feature based Kannada dialect identification from vowel sounds
CN111402887A (en) Method and device for escaping characters by voice
US20230252971A1 (en) System and method for speech processing
CN111583965A (en) Voice emotion recognition method, device, equipment and storage medium
JP2001109490A (en) Method for constituting voice recognition device, its recognition device and voice recognition method
Cahyaningtyas et al. Synthesized speech quality of Indonesian natural text-to-speech by using HTS and CLUSTERGEN
CN111833869B (en) Voice interaction method and system applied to urban brain
Woods et al. A robust ensemble model for spoken language recognition
JP3727436B2 (en) Voice original optimum collation apparatus and method
Akesh et al. Real-Time Subtitle Generator for Sinhala Speech
CN113990288B (en) Method for automatically generating and deploying voice synthesis model by voice customer service

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination