CN111402887A

CN111402887A - Method and device for escaping characters by voice

Info

Publication number: CN111402887A
Application number: CN201811542192.3A
Authority: CN
Inventors: 陈长伟; 杨晓亮; 田丹
Original assignee: Beijing Evomedia Technology Co ltd
Current assignee: Beijing Evomedia Technology Co ltd
Priority date: 2018-12-17
Filing date: 2018-12-17
Publication date: 2020-07-10

Abstract

The invention discloses a method and a device for escaping characters by voice, comprising the following steps: preprocessing the acquired voice to obtain voice characteristics corresponding to the voice; storing the voice characteristics into a database model base, and performing matching processing to obtain a voice signal; processing the voice signal based on a preset language model, and determining a voice sequence corresponding to the voice signal, wherein the preset language model represents a processing model for polyphones in the voice signal; and decoding the voice sequence based on a preset acoustic modeling model to obtain character information corresponding to the voice sequence. The invention realizes the purposes of more accurate voice recognition and character conversion and meeting the requirements of users.

Description

Method and device for escaping characters by voice

Technical Field

The present invention relates to the field of audio processing technologies, and in particular, to a method and an apparatus for escaping text by voice.

Background

In many scenarios, for example: conferences, training, interviewing, lecture matches and the like require recording of audio content, which is usually recorded in audio form and recorded by professional personnel by listening to the audio and converting the audio into corresponding text. The manner in which records are manually converted can be time consuming and inefficient.

With the development of intelligent technology, methods for converting voice into text have been developed, and the generated audio is converted into text for output. Because the above-mentioned speech conversion text technology has certain requirements for environment and speaker's mode and characteristics, it is difficult to have many languages or some low-frequency words and specialized terms in the speech, and the accuracy of converting the speech into text can be influenced by accent, dialect or different timbres, and further it can not conform to the essential meaning of user's speech.

Disclosure of Invention

In view of the above problems, the present invention provides a method and an apparatus for escaping a word from a voice, which achieve the purpose of more accurate voice recognition and word conversion and meeting the requirements of users.

In order to achieve the purpose, the invention provides the following technical scheme:

a method of escaping text in speech, comprising:

preprocessing the acquired voice to obtain voice characteristics corresponding to the voice;

storing the voice characteristics into a database model base, and performing matching processing to obtain a voice signal;

processing the voice signal based on a preset language model, and determining a voice sequence corresponding to the voice signal, wherein the preset language model represents a processing model for polyphones in the voice signal;

and decoding the voice sequence based on a preset acoustic modeling model to obtain character information corresponding to the voice sequence.

Optionally, the preprocessing the acquired voice to obtain a voice feature corresponding to the voice includes:

acquiring voice;

and extracting the voice characteristics of the voice passing through a unit area perpendicular to the sound wave propagation direction in unit time, wherein the voice characteristics comprise one or more of average sound energy, sound intensity, audio characteristics, pitch period, signal-to-noise ratio and harmonic-to-noise ratio.

Optionally, the storing the voice features into a database model, and performing matching processing to obtain a voice signal includes:

storing the voice features into a data model base;

and matching the voice characteristics with the voice standard in the data model base to obtain the voice signal with interference sound filtered.

Optionally, the processing the voice signal based on a preset language model, and determining a voice sequence corresponding to the voice signal includes:

creating a language model based on the set of speech signals;

framing the voice signal through the language model, and determining a phoneme matched with each frame of the voice signal;

calculating and obtaining left and right phonemes of each frame according to the phonemes matched with each frame of the voice signal;

and determining a voice sequence corresponding to the voice signal according to each frame of phoneme and the left and right phonemes of the voice signal.

Optionally, the decoding the voice sequence based on the preset acoustic modeling to obtain text information corresponding to the voice sequence includes:

processing the voice sequence based on a preset acoustic modeling model to obtain acoustic representation information corresponding to the voice sequence;

decoding the acoustic representation information to obtain a character search path corresponding to the acoustic representation information;

and determining an optimal search path in the text search paths, and determining text information corresponding to the voice sequence according to the optimal search path.

An apparatus for escaping text by speech, comprising:

the preprocessing unit is used for preprocessing the acquired voice to obtain voice characteristics corresponding to the voice;

the matching processing unit is used for storing the voice characteristics into a database model base and performing matching processing to obtain voice signals;

the language model processing unit is used for processing the voice signal based on a preset language model and determining a voice sequence corresponding to the voice signal, wherein the preset language model represents a processing model for polyphones in the voice signal;

and the acoustic modeling processing unit is used for decoding the voice sequence based on a preset acoustic modeling to obtain the character information corresponding to the voice sequence.

Optionally, the pre-processing unit comprises:

an acquisition subunit configured to acquire a voice;

and the extracting subunit is used for extracting the voice characteristics of the voice passing through a unit area perpendicular to the sound wave propagation direction in unit time, wherein the voice characteristics comprise one or more of average sound energy, sound intensity, audio characteristics, pitch period, signal-to-noise ratio and harmonic-to-noise ratio.

Optionally, the matching processing unit includes:

the storage subunit is used for storing the voice characteristics into a data model base;

and the matching subunit is used for matching the voice characteristics with the voice standard in the data model base to obtain the voice signal with interference sound filtered.

Optionally, the language model processing unit includes:

a model creation subunit configured to create a language model based on the speech signal set;

the framing subunit is used for framing the voice signal through the language model and determining a phoneme matched with each frame of the voice signal;

the calculating subunit is used for calculating and obtaining left and right phonemes of each frame according to the phonemes matched with each frame of the voice signal;

and the determining subunit is used for determining a voice sequence corresponding to the voice signal according to each frame of phoneme and the left and right phonemes of the voice signal.

Optionally, the acoustic modeling processing unit includes:

the voice sequence processing subunit is used for processing the voice sequence based on a preset acoustic modeling model to obtain acoustic representation information corresponding to the voice sequence;

the decoding processing subunit is used for decoding the acoustic representation information to obtain a text search path corresponding to the acoustic representation information;

and the text determining subunit is used for determining an optimal search path in the text search paths and determining text information corresponding to the voice sequence according to the optimal search path.

Compared with the prior art, the invention provides a method and a device for converting words into voice, which are used for processing the acquired voice to obtain a voice signal, then processing the voice signal based on a language model, performing multi-voice word processing on the voice signal to enable the processed voice sequence to better meet the context and emotion, and then converting the voice sequence into words based on a voice building model generated through mass data training, so that the words converted from voice are more accurate, can be more matched with the emotion of a speaker and meet the requirements of users.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

Fig. 1 is a schematic flow chart of a method for escaping a text by voice according to an embodiment of the present invention;

FIG. 2 is a diagram illustrating another method for escaping text from a speech sound according to an embodiment of the present invention;

fig. 3 is a schematic structural diagram of a device for escaping text by voice according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The terms "first" and "second," and the like in the description and claims of the present invention and the above-described drawings are used for distinguishing between different objects and not for describing a particular order. Furthermore, the terms "comprising" and "having," as well as any variations thereof, are intended to cover non-exclusive inclusions. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not set forth for a listed step or element but may include steps or elements not listed.

In the embodiment of the present invention, a method for escaping a text by voice is provided, referring to fig. 1, including:

and S11, preprocessing the acquired voice to obtain the voice characteristics corresponding to the voice.

Because can be based on the different of environment or receive the different restrictions of the accent of the speaker, dialect, tone quality in the pronunciation of acquireing, can make the pronunciation demonstrate different characteristics, in order to make the processing to pronunciation more accurate, need carry out the preliminary treatment to the pronunciation, extract corresponding speech characteristic, specifically include:

acquiring voice;

After the voice is input, extracting the voice characteristics of the voice, such as average sound energy, sound intensity, audio frequency height, estimated pitch period, signal-to-noise ratio, harmonic-to-noise ratio and the like of the voice passing through a unit area perpendicular to the sound wave propagation direction in unit time. The speech features in the unit time are acquired because the speaker may be influenced by emotion or context, so that the speech features in different times are different, and further, the acquisition of the speech features in the unit time makes the subsequent processing more accurate. And collecting the unit area can make the analysis range more accurate.

And S12, storing the voice characteristics into a database model base, and performing matching processing to obtain a voice signal.

After the voice features are obtained, the voice features or data values corresponding to the voice features are stored in a data model base, and voice signals meeting preset requirements are finally obtained through continuous training matching and comparison. I.e. the speech signals have been filtered out of relevant interfering information, such as murmurs in the environment, voice sighs or applause of live feedback.

The sound data can be classified in the database model to distinguish which are interfering sound data and which are normal speaker speech data.

S13, processing the voice signal based on a preset language model, and determining a voice sequence corresponding to the voice signal;

the preset language model represents a processing model for polyphones in the voice signal.

Since the language model may be understood as performing reduction processing on polyphonic characters, the processing based on the language model may specifically include:

creating a language model based on the set of speech signals;

framing a voice signal through a language model, and determining a phoneme matched with each frame of the voice signal;

The language model is obtained by training and predicting based on a language signal set collected by history. The main purpose of constructing the language model is to calculate the probability of a sentence occurrence model, that is, in the language model, semantic analysis processing is performed on a speech signal to obtain a context environment corresponding to the speech signal. Specifically, the language model is constructed by mainly using a probability model for calculating the occurrence of a sentence, and the language model is used for determining which word sequence has the highest probability, or for a plurality of words, the probability of the occurrence of the next word can be predicted. The speech signal is framed by using a preset algorithm, the phoneme corresponding to each frame is found for each phoneme, the phoneme model parameters are estimated from the characteristics of the frames, and then the frame of the adjacent factors is only required to be judged to be the left factor or the right factor, so that a language model library with higher accuracy is obtained by repeated training, and the speech sequence can be determined.

And S14, decoding the voice sequence based on the preset acoustic modeling to obtain the character information corresponding to the voice sequence.

After processing the speech signal, the speech needs to be converted into text, where an opportunity modeling is needed for processing, which may include:

decoding the acoustic representation information to obtain a text search path corresponding to the acoustic representation information;

In particular, the created acoustic modeling may be understood as an output that models the utterance to convert the speech input into an acoustic representation, or may be understood as the probability that the speech belongs to an acoustic symbol, i.e. a model that describes the conversion between speech and state.

Transition probability densities based on hidden Markov models are used in the process to model recurrent neural networks. And after acoustic modeling is completed, voice recognition can be carried out on the unknown voice sequence based on the acoustic modeling model so as to obtain characters through conversion.

The invention provides a method for converting words into voice, which comprises the steps of processing acquired voice to obtain a voice signal, processing the voice signal based on a language model, performing multi-voice word processing on the voice signal to enable a processed voice sequence to better meet context and emotion, and converting the voice sequence into words based on a voice construction model generated through mass data training, so that the words converted from voice are more accurate, can be more matched with the emotion of a speaker, and meet the requirements of users.

Referring to fig. 2, another method for escaping text by voice is provided in the embodiment of the present invention.

After voice input, voice characteristics of average sound energy, sound intensity, audio frequency height, estimated pitch period, signal-to-noise ratio, harmonic-to-noise ratio and the like in unit area in voice in a direction perpendicular to the sound wave propagation direction are extracted, characteristic data values of the voices are stored in a data model base, and expected voice signals are obtained finally through continuous training, matching and comparing.

The method comprises the following steps of constructing a language model, mainly using a probability model for calculating the occurrence of a sentence, determining which word sequence has higher possibility by using the language model, or predicting the occurrence possibility of the next word for a plurality of words, and framing an obtained voice signal by using a preset EM algorithm, wherein the EM algorithm comprises an E step and an M step, and the E step: using a BPTT algorithm to optimize neural network parameters, and M: and re-searching the optimal alignment relation by using the output of the neural network. Specifically, the speech signal is framed, the factor is located in each frame by using the step E, all frames corresponding to each factor are found by using the step M, and the parameters of the phoneme model are estimated from the characteristics of the frames. After alignment, GMM training is performed for each state, followed by looping the E and M steps. The E step only needs to judge whether the frame of the adjacent phoneme is the left phoneme or the right phoneme. Thus, the language model base with higher accuracy is obtained by repeated training.

A voice-built model is then created, i.e. an output that can be understood as a modeling of the utterance to be able to convert the speech input into an acoustic representation, or as the probability that the speech belongs to an acoustic symbol, i.e. a model that is used to describe the conversion between speech and state. The method comprises the steps of using transition probability density of a hidden Markov model to carry out modeling by a recurrent neural network, completing an acoustic modeling model, and carrying out voice recognition on an unknown voice frame sequence based on the acoustic model, wherein the process is generally called a search decoding process, then after the decoding process is generally given a search network (each node of the network can be a phrase) formed by connecting the Markov model according to grammar and a dictionary, selecting one or more optimal paths from all possible search paths as recognition results to be transferred into characters, wherein the optimal conditions generally meet the maximum posterior probability, and the paths can be understood as phrase strings of the phrases appearing in the dictionary.

The hidden Markov model HMM means that the external environment of the internal state of the Markov model is invisible, and the external environment can only see the output value at each moment. For speech recognition systems, the output values are typically acoustic features calculated from individual frames. When the HMM is used for describing a speech signal, two assumptions are needed, namely that the transition of an internal state is only related to a previous state, and that the transition of an output value is related to a current state or a current state, so that the complexity of a model can be reduced.

In the embodiment provided by the invention, the recognition based on artificial intelligence and the algorithm training have certain recognition inaccuracy problem for the pronunciation recognition with inaccurate pronunciation, and the difference of sentence punctuation exists for the pronunciation with certain emotion, so that the pronunciation can be understood according to the context, and the approximate meaning can be analyzed through the artificial intelligence. A large number of databases are trained on the acoustic mode, the recognition rate of low-frequency words is improved, different timbre differences are improved, meanwhile, the emotion color recognition is improved, and corresponding expressions and symbols can be made for voice system recognition of maternal and child emotion colors to enhance the recognition accuracy.

The embodiment of the present invention further provides a device for escaping a text by voice, referring to fig. 3, including:

the matching processing unit is used for storing the voice characteristics into a database model base and performing matching processing to obtain a voice signal;

Optionally, the pre-processing unit comprises:

an acquisition subunit configured to acquire a voice;

Optionally, the matching processing unit includes:

Optionally, the language model processing unit includes:

the framing subunit is used for framing the voice signals through the language model and determining phonemes matched with each frame of the voice signals;

Optionally, the acoustic modeling processing unit includes:

and the character determining subunit is used for determining an optimal search path in the character search paths and determining character information corresponding to the voice sequence according to the optimal search path.

The invention provides a device for converting a voice into a character, which is characterized in that the voice acquired in a preprocessing unit and a matching processing unit is processed to obtain a voice signal, then the voice signal is processed based on a language model in a language model processing unit and a voice model building processing unit, and the voice signal can be subjected to multi-voice character processing, so that a processed voice sequence can better meet the context and emotion, and then the voice sequence is converted into the character based on a voice model building generated through mass data training, so that the character converted from the voice is more accurate and can be more matched with the emotion of a speaker, and the requirements of users are met.

The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A method for escaping characters by voice is characterized by comprising the following steps:

2. The method according to claim 1, wherein the preprocessing the acquired speech to obtain speech features corresponding to the speech comprises:

acquiring voice;

3. The method of claim 1, wherein storing the speech features in a database model and performing a matching process to obtain a speech signal comprises:

storing the voice features into a data model base;

4. The method according to claim 1, wherein the processing the speech signal based on a preset language model to determine a speech sequence corresponding to the speech signal comprises:

creating a language model based on the set of speech signals;

5. The method according to claim 1, wherein the decoding the speech sequence based on the preset acoustic modeling to obtain text information corresponding to the speech sequence comprises:

6. An apparatus for escaping text from a voice, comprising:

7. The apparatus of claim 6, wherein the pre-processing unit comprises:

an acquisition subunit configured to acquire a voice;

8. The apparatus of claim 6, wherein the matching processing unit comprises:

9. The apparatus according to claim 6, wherein the language model processing unit comprises:

10. The apparatus of claim 6, wherein the acoustic modeling processing unit comprises: