CN110349567A - The recognition methods and device of voice signal, storage medium and electronic device - Google Patents

The recognition methods and device of voice signal, storage medium and electronic device Download PDF

Info

Publication number
CN110349567A
CN110349567A CN201910741238.2A CN201910741238A CN110349567A CN 110349567 A CN110349567 A CN 110349567A CN 201910741238 A CN201910741238 A CN 201910741238A CN 110349567 A CN110349567 A CN 110349567A
Authority
CN
China
Prior art keywords
target
phoneme
voice signal
langua0
training data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910741238.2A
Other languages
Chinese (zh)
Other versions
CN110349567B (en
Inventor
韦林煊
董文伟
林炳怀
张劲松
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
BEIJING LANGUAGE AND CULTURE UNIVERSITY
Tencent Technology Shenzhen Co Ltd
Original Assignee
BEIJING LANGUAGE AND CULTURE UNIVERSITY
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by BEIJING LANGUAGE AND CULTURE UNIVERSITY, Tencent Technology Shenzhen Co Ltd filed Critical BEIJING LANGUAGE AND CULTURE UNIVERSITY
Priority to CN201910741238.2A priority Critical patent/CN110349567B/en
Publication of CN110349567A publication Critical patent/CN110349567A/en
Application granted granted Critical
Publication of CN110349567B publication Critical patent/CN110349567B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • G10L2015/025Phonemes, fenemes or fenones being the recognition units
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/225Feedback of the input speech

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Electrically Operated Instructional Devices (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a kind of recognition methods of voice signal and device, storage medium and electronic device.Wherein, this method comprises: obtaining the first voice signal of the first target langua0 corresponding with the target text of the first target langua0 in target application;The recognition result identified by Model of Target Recognition to the first voice signal is obtained in target application, wherein, target acoustical model in Model of Target Recognition is the model being trained using the first training data of the first target langua0 and the second training data of the second target langua0 to initial acoustic model, and target acoustical model is used to export the probability that each frame signal in the first voice signal corresponds to the target phoneme in the first target langua0;In the case where recognition result indicates to have the phoneme of pronunciation partially accidentally in the first voice signal, character corresponding with the phoneme of pronunciation partially accidentally in target text is marked in target application.The present invention solves the technical problem of the inclined error detection inaccuracy of voice in the related technology.

Description

The recognition methods and device of voice signal, storage medium and electronic device
Technical field
The present invention relates to voice field, recognition methods and device in particular to a kind of voice signal, storage medium And electronic device.
Background technique
In the prior art, it in the application program of the inclined error detection of voice, is substituted using phoneme corresponding in single expectation Pronunciation partially accidentally.Due to sound pronunciation there is different words person's horizontal spans big, acoustic difference outstanding feature, lacking quantity Under conditions of sufficient pronunciation data, the acoustic model robustness of automatic error detection partially is poor.
For above-mentioned problem, currently no effective solution has been proposed.
Summary of the invention
The embodiment of the invention provides a kind of recognition methods of voice signal and device, storage medium and electronic device, with At least solve the technical problem of the inclined error detection inaccuracy of voice in the related technology.
According to an aspect of an embodiment of the present invention, a kind of recognition methods of voice signal is provided, comprising: answer in target With middle the first voice signal for obtaining above-mentioned first target langua0 corresponding with the target text of the first target langua0;It is answered in above-mentioned target With the middle recognition result for obtaining and being identified by Model of Target Recognition to above-mentioned first voice signal, wherein above-mentioned target identification Target acoustical model in model is the second training using the first training data and the second target langua0 of above-mentioned first target langua0 The model that data are trained initial acoustic model, above-mentioned target acoustical model is for exporting above-mentioned first voice signal In each frame signal correspond to above-mentioned first target langua0 in target phoneme probability;Above-mentioned first is indicated in above-mentioned recognition result In voice signal exist pronunciation partially accidentally phoneme in the case where, marked in above-mentioned target application in above-mentioned target text with it is above-mentioned The corresponding character of phoneme of pronunciation partially accidentally.
According to another aspect of an embodiment of the present invention, a kind of identification device of voice signal is additionally provided, comprising: first obtains Modulus block, for obtaining the first language of above-mentioned first target langua0 corresponding with the target text of the first target langua0 in target application Sound signal;Second obtains module, for obtaining by Model of Target Recognition in above-mentioned target application to above-mentioned first voice signal The recognition result identified, wherein the target acoustical model in above-mentioned Model of Target Recognition is using above-mentioned first target langua0 The first training data and the second target langua0 the second training data model that initial acoustic model is trained, it is above-mentioned Target acoustical model is used to export the target that each frame signal in above-mentioned first voice signal corresponds in above-mentioned first target langua0 The probability of phoneme;Mark module, for indicating there is the sound of pronunciation partially accidentally in above-mentioned first voice signal in above-mentioned recognition result In the case where element, character corresponding with the phoneme of above-mentioned pronunciation partially accidentally in above-mentioned target text is marked in above-mentioned target application.
Optionally, above-mentioned apparatus further include: third obtains module, for obtaining in target application and the first target langua0 Before first voice signal of corresponding above-mentioned first target langua0 of target text, above-mentioned first instruction of above-mentioned first target langua0 is obtained Practice above-mentioned second training data of data and above-mentioned second target langua0, wherein above-mentioned first training data includes above-mentioned first mesh Language the first true training data and above-mentioned first target langua0 the first simulated training data, above-mentioned second training data includes Second true training data of above-mentioned second target langua0 and the second simulated training data of above-mentioned second target langua0;First determines mould Block, for using above-mentioned first training data of above-mentioned first target langua0 and above-mentioned second training data of above-mentioned second target langua0 Above-mentioned initial acoustic model is trained, above-mentioned target acoustical model is obtained.
Optionally, above-mentioned first determining module includes: the first determination unit, for by above-mentioned the of above-mentioned first target langua0 The first phoneme in one training data is input to the full articulamentum in above-mentioned initial acoustic model, obtains above-mentioned full articulamentum output Above-mentioned first training data in above-mentioned first phoneme be above-mentioned first target langua0 in first object phoneme the first probability; Second determination unit is input to for the second phoneme in above-mentioned second training data by above-mentioned second target langua0 and above-mentioned connects entirely Layer is connect, above-mentioned second phoneme obtained in above-mentioned second training data of above-mentioned full articulamentum output is in above-mentioned second target langua0 The second target phoneme the second probability;First acquisition unit, in above-mentioned first object phoneme and above-mentioned second target sound In the case that element is similar and above-mentioned first probability is greater than first threshold, above-mentioned second probability is greater than second threshold, above-mentioned the is obtained Identical fisrt feature between one phoneme and above-mentioned second phoneme;Third determination unit, in above-mentioned fisrt feature and second In the case that similarity between feature is greater than third threshold value, above-mentioned initial acoustic model is determined as above-mentioned target acoustical mould Type, wherein above-mentioned second feature identical feature between above-mentioned first object phoneme and above-mentioned second target phoneme.
Optionally, above-mentioned third obtains module, comprising: second acquisition unit, for obtaining the first object with above-mentioned first The first real speech information that target langua0 issues, wherein above-mentioned first true training data includes above-mentioned first real speech letter Breath;Third acquiring unit, the second real speech information issued for obtaining the second object with above-mentioned first target langua0, wherein The sound channel length of above-mentioned second real speech information is greater than the sound channel length of above-mentioned first real speech information;4th determines list Member turns for carrying out sound channel using phonetic feature of the sound channel length normalization VTLN algorithm to above-mentioned second real speech information Change, obtain above-mentioned first simulated training data, wherein the sound channel length of the voice messaging in above-mentioned first simulated training data with The sound channel length of above-mentioned first real speech information is equal;4th acquiring unit, for obtaining third object with above-mentioned second mesh Language issue third real speech information, wherein above-mentioned second true training data includes above-mentioned third real speech information; 5th acquiring unit, the 4th real speech information issued for obtaining the 4th object with above-mentioned second target langua0, wherein above-mentioned The sound channel length of 4th real speech information is greater than the sound channel length of above-mentioned third real speech information;5th determination unit is used In carrying out sound channel conversion using phonetic feature of the above-mentioned VTLN algorithm to above-mentioned 4th real speech information, above-mentioned second mould is obtained Quasi- training data, wherein the sound channel length of the voice messaging in above-mentioned second simulated training data and above-mentioned third real speech The sound channel length of information is equal.
Optionally, above-mentioned second module is obtained, comprising: the 6th determination unit, for being carried out to above-mentioned first voice signal Feature extraction obtains the frame signal characteristic information of above-mentioned first voice signal;7th determination unit, for above-mentioned frame signal is special It levies in the above-mentioned target acoustical model of information input, obtains each in above-mentioned first voice signal of above-mentioned target acoustical model output Frame signal corresponds to posterior probability, wherein above-mentioned posterior probability is for indicating that above-mentioned each frame signal corresponds to above-mentioned first purpose The probability of the corresponding target phoneme of the probability of target phoneme in language;8th determination unit, for being calculated using pronunciation wellness GOP The corresponding posterior probability of each frame signal determines each frame letter in above-mentioned first voice signal in method and above-mentioned first voice signal Number corresponding phoneme whether and target phoneme there are deviations, obtain above-mentioned recognition result.
Optionally, above-mentioned 6th determination unit includes: the first determining subelement, for working as according to preset algorithm to acquisition The carry out signal enhancing of preceding voice signal, obtains the first enhancing voice signal;Second determines subelement, for increasing to above-mentioned first Strong signal carries out windowing operation, obtains the first adding window voice signal;Third determines subelement, for above-mentioned first adding window voice Each frame voice signal in signal carries out fast Fourier FFT transform, obtains frequency corresponding with above-mentioned first adding window voice signal Domain signal;4th determines that subelement obtains above-mentioned first voice signal for being filtered extraction by frame to above-mentioned frequency-region signal Frame signal characteristic information.
Optionally, above-mentioned apparatus further include: alignment module, for utilizing the wellness GOP algorithm and above-mentioned first that pronounces The corresponding posterior probability of each frame signal determines the corresponding phoneme of each frame signal in above-mentioned first voice signal in voice signal Whether and target phoneme there are deviations, after obtaining above-mentioned recognition result, by the first voice signal above-mentioned in above-mentioned recognition result In the corresponding phoneme of each frame signal be aligned with above-mentioned target phoneme.
Another aspect according to an embodiment of the present invention, additionally provides a kind of storage medium, and meter is stored in the storage medium Calculation machine program, wherein the computer program is arranged to execute the recognition methods of above-mentioned voice signal when operation.
Another aspect according to an embodiment of the present invention, additionally provides a kind of electronic device, including memory, processor and deposits Store up the computer program that can be run on a memory and on a processor, wherein above-mentioned processor passes through computer program and executes The recognition methods of above-mentioned voice signal.
In embodiments of the present invention, using the acquisition corresponding with the target text of the first target langua0 first in target application First voice signal of target langua0;The knowledge identified by Model of Target Recognition to the first voice signal is obtained in target application Other result, wherein the target acoustical model in Model of Target Recognition is using the first training data of the first target langua0 and second The model that second training data of target langua0 is trained initial acoustic model, target acoustical model is for exporting first Each frame signal corresponds to the probability of the target phoneme in the first target langua0 in voice signal;The first voice is indicated in recognition result In the case where there is the phoneme of pronunciation partially accidentally in signal, the phoneme pair in target text with pronunciation partially accidentally is marked in target application The mode for the character answered passes through the target acoustical model using the first target langua0 and the corresponding training data training of the second target langua0 To the identification that the first voice signal carries out, achievees the purpose that training data is more various, accurately identified voice to realize The technical effect whether signal misses partially, and then solve the technical problem of the inclined error detection inaccuracy of voice in the related technology.
Detailed description of the invention
The drawings described herein are used to provide a further understanding of the present invention, constitutes part of this application, this hair Bright illustrative embodiments and their description are used to explain the present invention, and are not constituted improper limitations of the present invention.In the accompanying drawings:
Fig. 1 is a kind of signal of the application environment of the recognition methods of optional voice signal according to an embodiment of the present invention Figure;
Fig. 2 is a kind of flow chart of the recognition methods of optional voice signal according to an embodiment of the present invention;
Fig. 3 is a kind of software application schematic diagram of optional inclined error detection of voice signal according to an embodiment of the present invention;
Fig. 4 is a kind of schematic diagram of optional trained acoustic model according to an embodiment of the present invention;
Fig. 5 is the friendship that a kind of optional user according to an embodiment of the present invention carries out pronunciation exercises using English study software Mutual frame diagram;
Fig. 6 is a kind of schematic diagram of optional voice signal conversion according to an embodiment of the present invention;
Fig. 7 is a kind of schematic diagram of optional voice signal hierarchical structure according to an embodiment of the present invention;
Fig. 8 is a kind of structural schematic diagram of the identification device of optional voice signal according to an embodiment of the present invention;
Fig. 9 is a kind of structural schematic diagram of optional electronic device according to an embodiment of the present invention.
Specific embodiment
In order to enable those skilled in the art to better understand the solution of the present invention, below in conjunction in the embodiment of the present invention Attached drawing, technical scheme in the embodiment of the invention is clearly and completely described, it is clear that described embodiment is only The embodiment of a part of the invention, instead of all the embodiments.Based on the embodiments of the present invention, ordinary skill people The model that the present invention protects all should belong in member's every other embodiment obtained without making creative work It encloses.
It should be noted that description and claims of this specification and term " first " in above-mentioned attached drawing, " Two " etc. be to be used to distinguish similar objects, without being used to describe a particular order or precedence order.It should be understood that using in this way Data be interchangeable under appropriate circumstances, so as to the embodiment of the present invention described herein can in addition to illustrating herein or Sequence other than those of description is implemented.In addition, term " includes " and " having " and their any deformation, it is intended that cover Cover it is non-exclusive include, for example, the process, method, system, product or equipment for containing a series of steps or units are not necessarily limited to Step or unit those of is clearly listed, but may include be not clearly listed or for these process, methods, product Or other step or units that equipment is intrinsic.
According to an aspect of an embodiment of the present invention, a kind of recognition methods of voice signal is provided, optionally, as one The optional embodiment of kind, the recognition methods of above-mentioned voice signal can be, but not limited to be applied in environment as shown in Figure 1.
The first voice signal can be obtained by target application with operational objective application in user equipment 102 in Fig. 1. It include memory 104 in user equipment 102, for storing the first voice signal, processor 106, for handling the first voice Signal.User equipment 102 and server 112 can carry out data interaction by network 110.It include data in server 112 Library 114, for storing operation data, processing engine 116, for handling operation data.As shown in Figure 1, can be in user equipment The first voice signal is obtained in the target application installed on 102, wherein the first voice signal uses the first target langua0 to issue 's.User equipment 102 obtains the recognition result identified by Model of Target Recognition to the first voice signal in target application, Wherein, the target acoustical model in Model of Target Recognition is the first training data and the second target langua0 using the first target langua0 The model that second training data is trained initial acoustic model, target acoustical model is for exporting the first voice signal In each frame signal correspond to the first target langua0 in target phoneme probability;User equipment 102 indicates first in recognition result In the case where there is the phoneme of pronunciation partially accidentally in voice signal, the sound in target text with pronunciation partially accidentally is marked in target application The corresponding character of element.
Optionally, the recognition methods of above-mentioned voice signal can be, but not limited to the user equipment for being applied to that data can be calculated In the client run on 102, above-mentioned user equipment 102 can for mobile phone, tablet computer, laptop, PC machine etc., on Stating network 110 can include but is not limited to wireless network or cable network.Wherein, which includes: WIFI and other realities The network now wirelessly communicated.Above-mentioned cable network can include but is not limited to: wide area network, Metropolitan Area Network (MAN), local area network.Above-mentioned server 112 can include but is not limited to any hardware device that can be calculated.
Optionally, as an alternative embodiment, as shown in Fig. 2, the recognition methods of above-mentioned voice signal includes:
S202: the first voice of the first target langua0 corresponding with the target text of the first target langua0 is obtained in target application Signal;
S204: obtaining the recognition result identified by Model of Target Recognition to the first voice signal in target application, Wherein, the target acoustical model in Model of Target Recognition is the first training data and the second target langua0 using the first target langua0 The model that second training data is trained initial acoustic model, target acoustical model is for exporting the first voice signal In each frame signal correspond to the first target langua0 in target phoneme probability;
S206: it in the case where recognition result indicates to have the phoneme of pronunciation partially accidentally in the first voice signal, is answered in target With character corresponding with the phoneme of pronunciation partially accidentally in middle label target text.
Optionally, the recognition methods of above-mentioned voice signal can be, but not limited to the field applied to speech recognition, such as send out In the field of the inclined error detection of sound.It can be adapted in the inclined error detection scene of K12 phoneme pronunciation in any foreign language learning.
Optionally, the method in the present embodiment include but is not limited to apply the end computer PC, mobile terminal (mobile phone, plate, The systems such as vehicle-mounted).
Optionally, among the above the first target langua0 and the second target langua0 includes but is not limited to English, Chinese etc..For example, Voice " my name is Linda " is inputted in target application.
Optionally, above-mentioned target application includes but is not limited to the application program of speech detection.For example, English study software, Chinese studying software etc..As shown in figure 3, being to detect English equivalents in English study software to miss process partially.It is sent out using English During sound software carries out pronunciation exercises, pronunciation first is carried out to specified text and is read aloud, then the system of English study software The detection that backstage can pronounce to the voice of learner partially accidentally, after detected, can feed back to enunciator.Wherein, grey sound The part of element is exactly after passing through algorithm detection, where the pronunciation of enunciator misses partially.
Optionally, in the present embodiment, target acoustical model include but is not limited to be neural network model.Learn in detection In the scene which phoneme has occurred partially accidentally, the performance of target acoustical model is very important for the pronunciation of person, its fine or not meeting Directly affect subsequent detection performance.Target acoustical model depends on training language as a statistical model, its performance quality Material can accurate Characterization entirety received pronunciation distribution.If lacking enough and suitable training corpus is to realize that high-performance is missed partially One of main problem of detection.Since the pronunciation of different learners is there is words person's individual difference, language proficiency is high or low etc. Feature.So needing to make up the shortage of training data by technological means appropriate.For example, following theoretical influence can be passed through The training of target acoustical model:
1) L2 acquisition is theoretical: learner when learning two languages (Second Language, be abbreviated as L2) pronunciation, for It is substituted in L2 with learner's mother tongue (First language, be abbreviated as L1) phoneme similarity, the phoneme that will use L1, this is Constitute one of the important phoneme of pronunciation partially accidentally.
2) transfer learning of deep learning is theoretical: different data and task utilize depth there may be inherent relevance The implicit level parameter of degree neural network tries to obtain this relevance, so that it may the knowledge use obtained from a task Into the solution of another task.
Optionally, such as in the speech detection scene of English, using transfer learning method, with goal task K12 English There are language inclined error detection of pronouncing the data compared with High relevancy as far as possible to cover to come in, and construct the inclined erroneous detection survey technology of performance robust. Specific strategy is as follows:
1) with time-delay neural network (Time Delay Neural Network, referred to as TDNN) model for Acoustic Modeling Method.
2) (Vocal Tract Length Normalization, referred to as VTLN) method pair is normalized with sound channel length English L1 adult's speech characteristic parameter maps, and generates simulation K12 English characteristic parameter library.
3) Chinese L1 adult's speech characteristic parameter is mapped in VTLN method, it is special generates simulation K12 standard Chinese Levy parameter library.
4) multi-task learning method is used, introduces Chinese K12 training data (containing true and analogue data) in input layer, with And English K12 training data (containing true and analogue data), the voice recognition tasks of Chinese and English are respectively set in output layer, By potential transfer learning mechanism, the acoustic mode of the Chinese/English pronunciation high robust for height variation row K12 is obtained Type.
5) it can use English Phonetics output node obtained and implement the inclined error detection of K12 pronunciation of English.
6) characterization English Phonetics output node and Chinese speech output node obtained are compared, high robust is obtained The inclined error detection of K12 pronunciation of English.
Optionally, for the inclined erroneous detection method of determining and calculating of the K12 pronunciation of English for realizing robust, the present embodiment acoustic model modeling method Schematic diagram as shown in Figure 4 (training part), by the second training data of the first training data of pronunciation of English and Chinese speech pronunciation Corpus as training objective acoustic model.The first voice signal is detected in conjunction with the feature of two kinds of training datas.
Through this embodiment, pass through the second instruction using the first training data and the second target langua0 using the first target langua0 Practice data initial acoustic model is trained to obtain target acoustical model, training expects the two kinds of target langua0s used, is not Single corpus.The accuracy for increasing the output of target acoustical model, improves the robustness for obtaining inclined error detection model.
In an alternative embodiment, corresponding with the target text of the first target langua0 first is obtained in target application Before first voice signal of target langua0, method further include:
S1 obtains the first training data of the first target langua0 and the second training data of the second target langua0, wherein the first instruction Practicing data includes the first true training data of the first target langua0 and the first simulated training data of the first target langua0, the second training Data include the second true training data of the second target langua0 and the second simulated training data of the second target langua0;
S2, using the first training data of the first target langua0 and the second training data of the second target langua0 to initial acoustic mould Type is trained, and obtains target acoustical model.
Optionally, as shown in figure 4, for example in the scene of children learning English, can using the pronunciation of English of children as The true training data of the first of first target langua0, using adult pronunciation of English as the first simulated training number of the first target langua0 According to.Using the Chinese speech pronunciation of children as the second true training data of the second target langua0, using adult Chinese speech pronunciation as second Second simulated training data of target langua0.The common trait training expected simultaneously using two kinds obtains target acoustical model.
Through this embodiment, target acoustical model is obtained using the voice training of the different enunciators of two kinds of target langua0s, it can To improve the robustness of target acoustical model, increase the accuracy of the inclined error detection of sound pronunciation.It is more applicable for different pronunciations Object.
In an alternative embodiment, using the second instruction of the first training data of the first target langua0 and the second target langua0 Practice data to be trained initial acoustic model, obtain target acoustical model, comprising:
S1, the full connection the first phoneme in the first training data of the first target langua0 being input in initial acoustic model Layer obtains that the first phoneme in the first training data of full articulamentum output is the first object phoneme in the first target langua0 One probability;
It optionally, may include multiple full articulamentums in initial acoustic model, such as 6, successively by the first training data In phoneme be input in initial acoustic model.Each phoneme for obtaining full articulamentum output is the target in the first target langua0 The probability of phoneme.
The second phoneme in second training data of the second target langua0 is input to full articulamentum, obtains full articulamentum by S2 The second phoneme in second training data of output is the second probability of the second target phoneme in the second target langua0;
Optionally, successively the phoneme in the second training data is input in initial acoustic model.It is defeated to obtain full articulamentum Each phoneme out is the probability of the target phoneme in the second target langua0.
S3, it is big that and first probability similar to the second target phoneme in first object phoneme is greater than first threshold, the second probability In the case where second threshold, identical fisrt feature between the first phoneme and the second phoneme is obtained;
S4, in the case that the similarity between fisrt feature and second feature is greater than third threshold value, by initial acoustic mould Type is determined as target acoustical model, wherein second feature identical feature between first object phoneme and the second target phoneme.
Optionally, in the present embodiment, for example, similar (such as hair of " p " of first object phoneme and the second target phoneme Sound), the pronunciation of the first phoneme is that the probability of the pronunciation of " p " is 90%, and the pronunciation of the second phoneme is that the probability of the pronunciation of " p " is 85%.Then think that there are identical fisrt feature between the first phoneme and the second phoneme.And in identical fisrt feature and second In the case that similarity between feature reaches third threshold value, determine that target acoustical model has reached certain the number of iterations, Convergence, target acoustical model are more accurate to the testing result of the first voice signal.
Optionally, in the case that the similarity between fisrt feature and second feature is less than third threshold value, then illustrate mesh It is not converged to mark acoustic model, then continues with training data and is trained, until reaching convergence.
Through this embodiment, by extracting the identical feature between phoneme, training obtains target acoustical model, increases The robustness of target acoustical model.
In an alternative embodiment, the first training data of the first target langua0 and the second instruction of the second target langua0 are obtained Practice data, comprising:
S1 obtains the first real speech information that the first object is issued with the first target langua0, wherein the first true training number According to including the first real speech information;
S2 obtains the second real speech information that the second object is issued with the first target langua0, wherein the second real speech letter The sound channel length of breath is greater than the sound channel length of the first real speech information;
S3 carries out sound channel using phonetic feature of the sound channel length normalization VTLN algorithm to the second real speech information and turns Change, obtain the first simulated training data, wherein the sound channel length of the voice messaging in the first simulated training data is true with first The sound channel length of voice messaging is equal;
S4 obtains the third real speech information that third object is issued with the second target langua0, wherein the second true training number According to including third real speech information;
S5 obtains the 4th real speech information that the 4th object is issued with the second target langua0, wherein the 4th real speech letter The sound channel length of breath is greater than the sound channel length of third real speech information;
S6 carries out sound channel conversion using phonetic feature of the VTLN algorithm to the 4th real speech information, obtains the second simulation Training data, wherein the sound channel length of the voice messaging in the second simulated training data and the sound channel of third real speech information Equal length.
Optionally, in the present embodiment, such as in the scene of children learning English, the pronunciation of English of children can be made For the first true training data of the first target langua0, using adult pronunciation of English as the first simulated training number of the first target langua0 According to.Using the Chinese speech pronunciation of children as the second true training data of the second target langua0, using adult Chinese speech pronunciation as second Second simulated training data of target langua0.The common trait training expected simultaneously using two kinds obtains target acoustical model.Due to Adult pronunciation channel length is longer than the pronunciation channel length of children, in the scene for children speech detection, need by The true pronunciation channel of people is converted to the voice with the pronunciation channel equal length of children, to increase the data bulk of training.
Through this embodiment, adult voice is converted using using VTLN algorithm, it can be by different pronunciations pair The voice of elephant it is anticipated that increase the quantity of training data, while can be improved the accuracy of model training, improve mesh as training Mark the robustness of acoustic model.
In an alternative embodiment, it obtains in target application and the first voice signal is carried out by Model of Target Recognition The recognition result of identification, comprising:
S1 carries out feature extraction to the first voice signal, obtains the frame signal characteristic information of the first voice signal;
S2 inputs frame signal characteristic information in target acoustical model, obtains the first voice of target acoustical model output Each frame signal corresponds to posterior probability in signal, wherein posterior probability is for indicating that each frame signal corresponds to the first target langua0 In target phoneme the corresponding target phoneme of probability probability;
S3 is determined using the corresponding posterior probability of frame signal each in pronunciation wellness GOP algorithm and the first voice signal In first voice signal the corresponding phoneme of each frame signal whether and target phoneme there are deviations, obtain recognition result.
Optionally, in the present embodiment, such as in the scene of English Phonetics detection, as shown in figure 5, using English for user Learning software carries out the interactive frame figure of pronunciation exercises, is divided into two parts of client and server-side.Client part is used Family carries out pronunciation exercises against English study software (such as the first voice signal of input).User issues under English study software records The first voice signal audio after, send it to server end, after the detection that server is pronounced partially accidentally, will miss and carry out partially User is returned to, and prompts user's suggestion for revision.After server end describes the audio for receiving user pronunciation practice, to user Pronunciation carries out the overall process of the inclined error detection of pronunciation of phone-level, also illustrates to detected pronunciation partially accidentally in server end simultaneously After information, client is returned to, so that user carries out next practice.
Optionally, server end detection process the following steps are included:
S501: feature extraction, the frame signal characteristic information of other first voice signal of frame level are carried out to the first voice signal.
S502: frame signal characteristic information is inputted in target acoustical model, obtains the first language of target acoustical model output Each frame signal corresponds to posterior probability in sound signal, and each frame pronunciation that posterior probability represents learner most probable is wanted to deliver Phoneme.
Since target acoustical model is often trained with mother tongue person's data, so can also treat as is with the view of mother tongue person For learner's hair at what, target acoustical model used by the present embodiment is a language based on HMM-TDNN from the point of view of angle The model of sound identification framework, the principle is as follows:
Wherein, p (x | w) is target acoustical model part, and w is the pronunciation text of the first voice signal, and x is that learner is current Pronunciation, Probability p (x | w) is intended to issue the fine or not degree of phoneme pronunciation representated by current text if then characterizing learner.
S503: true using the corresponding posterior probability of frame signal each in pronunciation wellness GOP algorithm and the first voice signal In fixed first voice signal the corresponding phoneme of each frame signal whether and target phoneme there are deviations, obtain recognition result.
GOP algorithm is the alignment information of the posterior probability and phone-level according to the frame-layer grade of target acoustical model output (which sound user should send out), the posterior probability of frame-layer grade is incorporated into the posterior probability of phoneme level, and (user is practical to be sent out Which sound), it, can be to each pronunciation by comparing the probability size of phoneme and the practical phoneme sent out of user that user should send out Whether have occurred and is accidentally judged partially.Include: known in GOP algorithm employed in the present embodiment
Wherein, for indicating sample speech signal, p is used to indicate the phoneme in the first voice signal o;The ts and te points It Yong Yu not indicate that phoneme starts the phoneme terminated with phoneme index;P (p) is used to indicate the posteriority of the phoneme in the first voice signal Probability;Q is for indicating phone set.
After detection GOP marking, which phoneme that will currently pronounce is partially accidentally that accidentally what hair return to client at The user at end.
Through this embodiment, by the process in Fig. 5, go out whether the first voice signal is deposited using target acoustical model inspection It is missing partially, is increasing the accuracy of detection.
In an alternative embodiment, feature extraction is carried out to the first voice signal, obtains the frame of the first voice signal Signal characteristic information, comprising:
S1 obtains the first enhancing voice letter according to preset algorithm to the carry out signal enhancing of the current speech signal of acquisition Number;
S2 carries out windowing operation to the first enhancing signal, obtains the first adding window voice signal;
S3 carries out fast Fourier FFT transform to each frame voice signal in the first adding window voice signal, obtains and first The corresponding frequency-region signal of adding window voice signal;
S4 is filtered extraction by frame to frequency-region signal, obtains the frame signal characteristic information of the first voice signal.
Optionally, in the present embodiment, according to preset algorithm to the carry out signal enhancing of the current speech signal of acquisition, i.e., It is the preemphasis processing carried out to current speech signal.It mainly will carry out a degree of increasing to the high frequency of voice signal By force, the influence of removal oral cavity radiation.Including following formula:
Y (n)=x (n)-α x (n-1);
Wherein, y (n) is the first voice signal, and x (n) is current speech signal, and x (n-1) is to obtain current speech information The voice signal of the voice messaging at a upper time point at time point, α are parameter preset (such as 0.98).
Optionally, framing operation is carried out to the first enhancing signal, using 25ms as frame length, 10ms was frame shifting, by several seconds First enhances signal decomposition into the voice segments sequence of one group of 25ms long, and carries out adding window to each segment voice in this sequence Processing.Generally add Hamming window.
Optionally, FFT transform is carried out to every a bit of voice, voice signal can be transformed from the time domain to frequency domain, such as Fig. 6 It is shown.
Optionally, extraction is filtered by frame to frequency-region signal, i.e., the voice frame sequence difference by this group on frequency domain Mel filtering is carried out by frame and is extracted into the available feature of following model, and essence is an Information Compression and abstract process.This rank The feature that section can extract is varied, such as spectrum signature (mel frequency cepstral coefficient (Mel Frequency Cepstrum Coefficient, referred to as MFCC), FBANK, PLP etc.), frequecy characteristic (fundamental frequency, formant etc.), temporal signatures (duration), Energy feature etc..Feature used in the present embodiment is energy of the FBANK feature plus the 3 fundamental frequency features for being and 1 dimension of 23 dimensions Feature.After have passed through this module, one section of learner pronunciation just becomes one group of characteristic sequence that can represent its pronunciation, such as schemes The feature of frame-layer grade shown in 7.
Gone out by the processing of the signal characteristic to the first voice signal using target acoustical model inspection by this implementation First voice signal increases the accuracy of detection with the presence or absence of missing partially.
In an alternative embodiment, each frame signal in pronunciation wellness GOP algorithm and the first voice signal is utilized Corresponding posterior probability determine in the first voice signal the corresponding phoneme of each frame signal whether and target phoneme there are deviations, obtain To after recognition result, method further include:
S1, by the corresponding phoneme of each frame signal is aligned with target phoneme in the first voice signal in recognition result.
Optionally, in the present embodiment, based on speech recognition framework and alignment techniques can be forced, by the first voice signal In text carry out phone-level alignment, so that it may know the position of each phoneme in voice segments, and in this position, What user should send out is any phoneme.
In conclusion the present embodiment multitask transfer learning (Time Delay Neural Network, referred to as TDNN) under acoustic model modelling support, introduce the database of various target langua0s comprehensively, for example, American English children, American English it is adult, in Four kinds of mother tongue pronunciation libraries such as state children and Chinese Adult.It can be adapted for the inclined error detection of K12 pronunciation of any language, and can be effective The problem for alleviating its task related data deficiency, to further increase its detection performance.
In addition, the present embodiment in the detection in state K12 children English pronunciation phonemes error rate index, only makes compared to tradition Improve 20% or more with a kind of system of children's corpus of target langua0 is opposite.Can effectively it overcome in the inclined erroneous detection of K12 children pronunciation The problem of lacking enough suitable training datas in examining system, improves the robustness for obtaining inclined error detection model.
Method combining with pronunciation inspection software in the present embodiment can more accurately detected pair in K12 children pronunciation With wrong pronunciation, the marking based on voice quality is allowed more to have something to base on.It can accurately prompt and most be answered in K12 children pronunciation The phoneme of the correction misses partially, so that children limited focus on most important can be changed by mistake on just partially.In this way They can more efficiently improve oracy with more confidence.
It should be noted that for the various method embodiments described above, for simple description, therefore, it is stated as a series of Combination of actions, but those skilled in the art should understand that, the present invention is not limited by the sequence of acts described because According to the present invention, some steps may be performed in other sequences or simultaneously.Secondly, those skilled in the art should also know It knows, the embodiments described in the specification are all preferred embodiments, and related actions and modules is not necessarily of the invention It is necessary.
Other side according to an embodiment of the present invention additionally provides a kind of for implementing the identification side of above-mentioned voice signal The identification device of the voice signal of method.As shown in figure 8, the device includes:
First obtains module 82, for obtaining corresponding with the target text of the first target langua0 above-mentioned the in target application First voice signal of one target langua0;
Second obtains module 84, is believed by Model of Target Recognition above-mentioned first voice for obtaining in above-mentioned target application Number recognition result identified, wherein the target acoustical model in above-mentioned Model of Target Recognition is using above-mentioned first purpose The model that first training data of language and the second training data of the second target langua0 are trained initial acoustic model, on State the mesh that target acoustical model corresponds in above-mentioned first target langua0 for exporting each frame signal in above-mentioned first voice signal The probability of mark with phonetic symbols element;
Mark module 86, for indicating there is the phoneme of pronunciation partially accidentally in above-mentioned first voice signal in above-mentioned recognition result In the case where, character corresponding with the phoneme of above-mentioned pronunciation partially accidentally in above-mentioned target text is marked in above-mentioned target application.
Optionally, above-mentioned apparatus further include:
Third obtains module, for obtaining corresponding with the target text of the first target langua0 above-mentioned first in target application Before first voice signal of target langua0, above-mentioned first training data and above-mentioned second target langua0 of above-mentioned first target langua0 are obtained Above-mentioned second training data, wherein above-mentioned first training data includes the first true training data of above-mentioned first target langua0 With the first simulated training data of above-mentioned first target langua0, above-mentioned second training data includes the second true of above-mentioned second target langua0 Second simulated training data of real training data and above-mentioned second target langua0;
First determining module, for using above-mentioned first training data and above-mentioned second target langua0 of above-mentioned first target langua0 Above-mentioned second training data above-mentioned initial acoustic model is trained, obtain above-mentioned target acoustical model.
Optionally, above-mentioned first determining module includes:
First determination unit is input to for the first phoneme in above-mentioned first training data by above-mentioned first target langua0 Full articulamentum in above-mentioned initial acoustic model obtains above-mentioned the in above-mentioned first training data of above-mentioned full articulamentum output One phoneme is the first probability of the first object phoneme in above-mentioned first target langua0;
Second determination unit is input to for the second phoneme in above-mentioned second training data by above-mentioned second target langua0 Above-mentioned full articulamentum, above-mentioned second phoneme obtained in above-mentioned second training data of above-mentioned full articulamentum output is above-mentioned second Second probability of the second target phoneme in target langua0;
First acquisition unit, for similar to above-mentioned second target phoneme and above-mentioned first in above-mentioned first object phoneme In the case that probability is greater than first threshold, above-mentioned second probability is greater than second threshold, above-mentioned first phoneme and above-mentioned second are obtained Identical fisrt feature between phoneme;
Third determination unit is greater than the feelings of third threshold value for the similarity between above-mentioned fisrt feature and second feature Under condition, above-mentioned initial acoustic model is determined as above-mentioned target acoustical model, wherein above-mentioned second feature is above-mentioned first object Identical feature between phoneme and above-mentioned second target phoneme.
Optionally, above-mentioned third obtains module, comprising:
Second acquisition unit, the first real speech information issued for obtaining the first object with above-mentioned first target langua0, Wherein, the above-mentioned first true training data includes above-mentioned first real speech information;
Third acquiring unit, the second real speech information issued for obtaining the second object with above-mentioned first target langua0, Wherein, the sound channel length of above-mentioned second real speech information is greater than the sound channel length of above-mentioned first real speech information;
4th determination unit, for the language using sound channel length normalization VTLN algorithm to above-mentioned second real speech information Sound feature carries out sound channel conversion, obtains above-mentioned first simulated training data, wherein the voice in above-mentioned first simulated training data The sound channel length of information is equal with the sound channel length of above-mentioned first real speech information;
4th acquiring unit, the third real speech information issued for obtaining third object with above-mentioned second target langua0, Wherein, the above-mentioned second true training data includes above-mentioned third real speech information;
5th acquiring unit, the 4th real speech information issued for obtaining the 4th object with above-mentioned second target langua0, Wherein, the sound channel length of above-mentioned 4th real speech information is greater than the sound channel length of above-mentioned third real speech information;
5th determination unit, for being carried out using phonetic feature of the above-mentioned VTLN algorithm to above-mentioned 4th real speech information Sound channel conversion, obtains above-mentioned second simulated training data, wherein the sound channel of the voice messaging in above-mentioned second simulated training data Length is equal with the sound channel length of above-mentioned third real speech information.
Optionally, above-mentioned second module is obtained, comprising:
6th determination unit obtains above-mentioned first voice signal for carrying out feature extraction to above-mentioned first voice signal Frame signal characteristic information;
7th determination unit obtains above-mentioned for inputting above-mentioned frame signal characteristic information in above-mentioned target acoustical model Each frame signal corresponds to posterior probability in above-mentioned first voice signal of target acoustical model output, wherein above-mentioned posterior probability For indicating the probability corresponding target phoneme of target phoneme that above-mentioned each frame signal corresponds in above-mentioned first target langua0 Probability;
8th determination unit, for utilizing each frame signal in pronunciation wellness GOP algorithm and above-mentioned first voice signal Corresponding posterior probability determines whether the corresponding phoneme of each frame signal exists inclined with target phoneme in above-mentioned first voice signal Difference obtains above-mentioned recognition result.
Optionally, above-mentioned 6th determination unit includes:
First determines that subelement is obtained for the carry out signal enhancing according to preset algorithm to the current speech signal of acquisition To the first enhancing voice signal;
Second determines subelement, for carrying out windowing operation to above-mentioned first enhancing signal, obtains the first adding window voice letter Number;
Third determines subelement, for carrying out quick Fu to each frame voice signal in above-mentioned first adding window voice signal Family name's FFT transform obtains frequency-region signal corresponding with above-mentioned first adding window voice signal;
4th determines subelement, for being filtered extraction by frame to above-mentioned frequency-region signal, obtains above-mentioned first voice letter Number frame signal characteristic information.
Optionally, above-mentioned apparatus further include:
Alignment module, for each frame signal pair in using pronunciation wellness GOP algorithm and above-mentioned first voice signal The posterior probability answered determine in above-mentioned first voice signal the corresponding phoneme of each frame signal whether and target phoneme there are deviation, After obtaining above-mentioned recognition result, by the corresponding phoneme of frame signal each in the first voice signal above-mentioned in above-mentioned recognition result with Above-mentioned target phoneme alignment.
Another aspect according to an embodiment of the present invention additionally provides a kind of for implementing the identification side of above-mentioned voice signal The electronic device of method is stored in the memory 902 as shown in figure 9, the electronic device includes memory 902 and processor 904 Computer program, the processor 904 are arranged to execute the step in any of the above-described embodiment of the method by computer program.
Optionally, in the present embodiment, above-mentioned electronic device can be located in multiple network equipments of computer network At least one network equipment.
Optionally, in the present embodiment, above-mentioned processor can be set to execute following steps by computer program:
S1: the first voice letter of the first target langua0 corresponding with the target text of the first target langua0 is obtained in target application Number;
S2: obtaining the recognition result identified by Model of Target Recognition to the first voice signal in target application, In, the target acoustical model in Model of Target Recognition be using the first target langua0 the first training data and the second target langua0 The model that two training datas are trained initial acoustic model, target acoustical model is for exporting in the first voice signal Each frame signal corresponds to the probability of the target phoneme in the first target langua0;
S3: in the case where recognition result indicates to have the phoneme of pronunciation partially accidentally in the first voice signal, in target application Character corresponding with the phoneme of pronunciation partially accidentally in middle label target text.
Optionally, it will appreciated by the skilled person that structure shown in Fig. 9 is only to illustrate, electronic device can also To be smart phone (such as Android phone, iOS mobile phone), tablet computer, palm PC and mobile internet device The terminal devices such as (Mobile Internet Devices, MID), PAD.Fig. 9 it does not cause to the structure of above-mentioned electronic device It limits.For example, electronic device may also include more perhaps less component (such as network interface) or tool than shown in Fig. 9 There is the configuration different from shown in Fig. 9.
Wherein, memory 902 can be used for storing software program and module, such as the voice signal in the embodiment of the present invention Recognition methods and the corresponding program instruction/module of device, the software journey that processor 904 is stored in memory 902 by operation Sequence and module realize the recognition methods of above-mentioned voice signal thereby executing various function application and data processing.It deposits Reservoir 902 may include high speed random access memory, can also include nonvolatile memory, as one or more magnetic storage fills It sets, flash memory or other non-volatile solid state memories.In some instances, memory 902 can further comprise relative to place The remotely located memory of device 904 is managed, these remote memories can pass through network connection to terminal.The example packet of above-mentioned network Include but be not limited to internet, intranet, local area network, mobile radio communication and combinations thereof.Wherein, memory 902 specifically can be with But it is not limited to use in the information such as the first voice signal.As an example, as shown in figure 9, in above-mentioned memory 902 can with but not It is limited to include the first acquisition module 82, second acquisition module 84, mark module 86 in the identification device of above-mentioned voice signal.This Outside, can also include but is not limited to above-mentioned voice signal identification device in other modular units, repeat no more in this example.
Optionally, above-mentioned transmitting device 906 is used to that data to be received or sent via a network.Above-mentioned network tool Body example may include cable network and wireless network.In an example, transmitting device 906 includes a network adapter (Network Interface Controller, NIC), can be connected by cable with other network equipments with router to It can be communicated with internet or local area network.In an example, transmitting device 906 is radio frequency (Radio Frequency, RF) Module is used to wirelessly be communicated with internet.
In addition, above-mentioned electronic device further include: display 908, for showing recognition result;With connection bus 910, it is used for Connect the modules component in above-mentioned electronic device.
The another aspect of embodiment according to the present invention, additionally provides a kind of storage medium, is stored in the storage medium Computer program, wherein the computer program is arranged to execute the step in any of the above-described embodiment of the method when operation.
Optionally, in the present embodiment, above-mentioned storage medium can be set to store by executing based on following steps Calculation machine program:
S1: the first voice letter of the first target langua0 corresponding with the target text of the first target langua0 is obtained in target application Number;
S2: obtaining the recognition result identified by Model of Target Recognition to the first voice signal in target application, In, the target acoustical model in Model of Target Recognition be using the first target langua0 the first training data and the second target langua0 The model that two training datas are trained initial acoustic model, target acoustical model is for exporting in the first voice signal Each frame signal corresponds to the probability of the target phoneme in the first target langua0;
S3: in the case where recognition result indicates to have the phoneme of pronunciation partially accidentally in the first voice signal, in target application Character corresponding with the phoneme of pronunciation partially accidentally in middle label target text.
Optionally, in the present embodiment, those of ordinary skill in the art will appreciate that in the various methods of above-described embodiment All or part of the steps be that the relevant hardware of terminal device can be instructed to complete by program, the program can store in In one computer readable storage medium, storage medium may include: flash disk, read-only memory (Read-Only Memory, ROM), random access device (RandomAccess Memory, RAM), disk or CD etc..
The serial number of the above embodiments of the invention is only for description, does not represent the advantages or disadvantages of the embodiments.
If the integrated unit in above-described embodiment is realized in the form of SFU software functional unit and as independent product When selling or using, it can store in above-mentioned computer-readable storage medium.Based on this understanding, skill of the invention Substantially all or part of the part that contributes to existing technology or the technical solution can be with soft in other words for art scheme The form of part product embodies, which is stored in a storage medium, including some instructions are used so that one Platform or multiple stage computers equipment (can be personal computer, server or network equipment etc.) execute each embodiment institute of the present invention State all or part of the steps of method.
In the above embodiment of the invention, it all emphasizes particularly on different fields to the description of each embodiment, does not have in some embodiment The part of detailed description, reference can be made to the related descriptions of other embodiments.
In several embodiments provided herein, it should be understood that disclosed client, it can be by others side Formula is realized.Wherein, the apparatus embodiments described above are merely exemplary, such as the division of the unit, and only one Kind of logical function partition, there may be another division manner in actual implementation, for example, multiple units or components can combine or It is desirably integrated into another system, or some features can be ignored or not executed.Another point, it is shown or discussed it is mutual it Between coupling, direct-coupling or communication connection can be through some interfaces, the INDIRECT COUPLING or communication link of unit or module It connects, can be electrical or other forms.
The unit as illustrated by the separation member may or may not be physically separated, aobvious as unit The component shown may or may not be physical unit, it can and it is in one place, or may be distributed over multiple In network unit.It can select some or all of unit therein according to the actual needs to realize the mesh of this embodiment scheme 's.
It, can also be in addition, the functional units in various embodiments of the present invention may be integrated into one processing unit It is that each unit physically exists alone, can also be integrated in one unit with two or more units.Above-mentioned integrated list Member both can take the form of hardware realization, can also realize in the form of software functional units.
The above is only a preferred embodiment of the present invention, it is noted that for the ordinary skill people of the art For member, various improvements and modifications may be made without departing from the principle of the present invention, these improvements and modifications are also answered It is considered as protection scope of the present invention.

Claims (10)

1. a kind of recognition methods of voice signal characterized by comprising
The first voice signal of first target langua0 corresponding with the target text of the first target langua0 is obtained in target application;
The recognition result identified by Model of Target Recognition to first voice signal is obtained in the target application, In, the target acoustical model in the Model of Target Recognition is the first training data and the second mesh using first target langua0 Language the second training data model that initial acoustic model is trained, the target acoustical model is for exporting institute State the probability that each frame signal in the first voice signal corresponds to the target phoneme in first target langua0;
In the case where the recognition result indicates to have the phoneme of pronunciation partially accidentally in first voice signal, in the target Character corresponding with the phoneme of the pronunciation partially accidentally in the target text is marked in.
2. the method according to claim 1, wherein the mesh obtained in target application with the first target langua0 Before the first voice signal for marking corresponding first target langua0 of text, the method also includes:
First training data of first target langua0 and second training data of second target langua0 are obtained, In, first training data include first target langua0 the first true training data and first target langua0 first Simulated training data, second training data include the second true training data and second mesh of second target langua0 Language the second simulated training data;
Use first training data of first target langua0 and second training data pair of second target langua0 The initial acoustic model is trained, and obtains the target acoustical model.
3. according to the method described in claim 2, it is characterized in that, using first target langua0 first training data The initial acoustic model is trained with second training data of second target langua0, obtains the target acoustical Model, comprising:
The first phoneme in first training data of first target langua0 is input in the initial acoustic model Full articulamentum, first phoneme obtained in first training data of the full articulamentum output is first purpose First probability of the first object phoneme in language;
The second phoneme in second training data of second target langua0 is input to the full articulamentum, is obtained described Second phoneme in second training data of full articulamentum output is the second target sound in second target langua0 Second probability of element;
It is greater than first threshold, described in the first object phoneme the first probability similar and described to second target phoneme In the case that second probability is greater than second threshold, identical first spy between first phoneme and second phoneme is obtained Sign;
In the case that similarity between the fisrt feature and second feature is greater than third threshold value, by the initial acoustic mould Type is determined as the target acoustical model, wherein the second feature is the first object phoneme and second target sound Identical feature between element.
4. according to the method described in claim 2, it is characterized in that, first training for obtaining first target langua0 Second training data of data and second target langua0, comprising:
Obtain the first real speech information that the first object is issued with first target langua0, wherein the described first true training Data include the first real speech information;
Obtain the second real speech information that the second object is issued with first target langua0, wherein second real speech The sound channel length of information is greater than the sound channel length of the first real speech information;
Sound channel conversion is carried out using phonetic feature of the sound channel length normalization VTLN algorithm to the second real speech information, is obtained To the first simulated training data, wherein the sound channel length of the voice messaging in the first simulated training data with it is described The sound channel length of first real speech information is equal;
Obtain the third real speech information that third object is issued with second target langua0, wherein the described second true training Data include the third real speech information;
Obtain the 4th real speech information that the 4th object is issued with second target langua0, wherein the 4th real speech The sound channel length of information is greater than the sound channel length of the third real speech information;
Sound channel conversion is carried out using phonetic feature of the VTLN algorithm to the 4th real speech information, obtains described second Simulated training data, wherein the sound channel length and the true language of the third of the voice messaging in the second simulated training data The sound channel length of message breath is equal.
5. the method according to claim 1, wherein obtaining in the target application by Model of Target Recognition pair The recognition result that first voice signal is identified, comprising:
Feature extraction is carried out to first voice signal, obtains the frame signal characteristic information of first voice signal;
The frame signal characteristic information is inputted in the target acoustical model, the described of the target acoustical model output is obtained Each frame signal corresponds to posterior probability in first voice signal, wherein the posterior probability is for indicating each frame signal The probability of the corresponding target phoneme of probability corresponding to the target phoneme in first target langua0;
Using described in the corresponding posterior probability determination of frame signal each in pronunciation wellness GOP algorithm and first voice signal In first voice signal the corresponding phoneme of each frame signal whether and target phoneme there are deviations, obtain the recognition result.
6. according to the method described in claim 5, it is characterized in that, being obtained to first voice signal progress feature extraction The frame signal characteristic information of first voice signal, comprising:
According to preset algorithm to the carry out signal enhancing of the current speech signal of acquisition, the first enhancing voice signal is obtained;
Windowing operation is carried out to the first enhancing signal, obtains the first adding window voice signal;
Fast Fourier FFT transform is carried out to each frame voice signal in the first adding window voice signal, is obtained and described the The corresponding frequency-region signal of one adding window voice signal;
Extraction is filtered by frame to the frequency-region signal, obtains the frame signal characteristic information of first voice signal.
7. according to the method described in claim 5, it is characterized in that, utilizing pronunciation wellness GOP algorithm and first voice The corresponding posterior probability of each frame signal determines in first voice signal whether is the corresponding phoneme of each frame signal in signal And there are deviations for target phoneme, after obtaining the recognition result, the method also includes:
The corresponding phoneme of frame signal each in first voice signal described in the recognition result is aligned with the target phoneme.
8. a kind of identification device of voice signal characterized by comprising
First obtains module, for obtaining first purpose corresponding with the target text of the first target langua0 in target application First voice signal of language;
Second obtains module, is carried out by Model of Target Recognition to first voice signal for obtaining in the target application The recognition result of identification, wherein the target acoustical model in the Model of Target Recognition is using first target langua0 The model that second training data of one training data and the second target langua0 is trained initial acoustic model, the target Acoustic model is used to export the target phoneme that each frame signal in first voice signal corresponds in first target langua0 Probability;
Mark module, for indicating there is the case where phoneme of pronunciation partially accidentally in first voice signal in the recognition result Under, character corresponding with the phoneme of the pronunciation partially accidentally in the target text is marked in the target application.
9. a kind of storage medium, the storage medium includes the program of storage, wherein described program executes aforesaid right when running It is required that method described in 1 to 7 any one.
10. a kind of electronic device, including memory and processor, which is characterized in that be stored with computer journey in the memory Sequence, the processor are arranged to execute side described in any one of claim 1 to 7 by the computer program Method.
CN201910741238.2A 2019-08-12 2019-08-12 Speech signal recognition method and device, storage medium and electronic device Active CN110349567B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910741238.2A CN110349567B (en) 2019-08-12 2019-08-12 Speech signal recognition method and device, storage medium and electronic device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910741238.2A CN110349567B (en) 2019-08-12 2019-08-12 Speech signal recognition method and device, storage medium and electronic device

Publications (2)

Publication Number Publication Date
CN110349567A true CN110349567A (en) 2019-10-18
CN110349567B CN110349567B (en) 2022-09-13

Family

ID=68184687

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910741238.2A Active CN110349567B (en) 2019-08-12 2019-08-12 Speech signal recognition method and device, storage medium and electronic device

Country Status (1)

Country Link
CN (1) CN110349567B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111312219A (en) * 2020-01-16 2020-06-19 上海携程国际旅行社有限公司 Telephone recording marking method, system, storage medium and electronic equipment
CN111724769A (en) * 2020-04-22 2020-09-29 深圳市伟文无线通讯技术有限公司 Production method of intelligent household voice recognition model
CN111986653A (en) * 2020-08-06 2020-11-24 杭州海康威视数字技术股份有限公司 Voice intention recognition method, device and equipment

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101785048A (en) * 2007-08-20 2010-07-21 微软公司 hmm-based bilingual (mandarin-english) tts techniques
CN104217713A (en) * 2014-07-15 2014-12-17 西北师范大学 Tibetan-Chinese speech synthesis method and device
CN106782603A (en) * 2016-12-22 2017-05-31 上海语知义信息技术有限公司 Intelligent sound evaluating method and system
CN107731228A (en) * 2017-09-20 2018-02-23 百度在线网络技术(北京)有限公司 The text conversion method and device of English voice messaging
CN109036464A (en) * 2018-09-17 2018-12-18 腾讯科技(深圳)有限公司 Pronounce error-detecting method, device, equipment and storage medium
CN109545244A (en) * 2019-01-29 2019-03-29 北京猎户星空科技有限公司 Speech evaluating method, device, electronic equipment and storage medium

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101785048A (en) * 2007-08-20 2010-07-21 微软公司 hmm-based bilingual (mandarin-english) tts techniques
CN104217713A (en) * 2014-07-15 2014-12-17 西北师范大学 Tibetan-Chinese speech synthesis method and device
CN106782603A (en) * 2016-12-22 2017-05-31 上海语知义信息技术有限公司 Intelligent sound evaluating method and system
CN107731228A (en) * 2017-09-20 2018-02-23 百度在线网络技术(北京)有限公司 The text conversion method and device of English voice messaging
CN109036464A (en) * 2018-09-17 2018-12-18 腾讯科技(深圳)有限公司 Pronounce error-detecting method, device, equipment and storage medium
CN109545244A (en) * 2019-01-29 2019-03-29 北京猎户星空科技有限公司 Speech evaluating method, device, electronic equipment and storage medium

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111312219A (en) * 2020-01-16 2020-06-19 上海携程国际旅行社有限公司 Telephone recording marking method, system, storage medium and electronic equipment
CN111312219B (en) * 2020-01-16 2023-11-28 上海携程国际旅行社有限公司 Telephone recording labeling method, system, storage medium and electronic equipment
CN111724769A (en) * 2020-04-22 2020-09-29 深圳市伟文无线通讯技术有限公司 Production method of intelligent household voice recognition model
CN111986653A (en) * 2020-08-06 2020-11-24 杭州海康威视数字技术股份有限公司 Voice intention recognition method, device and equipment

Also Published As

Publication number Publication date
CN110349567B (en) 2022-09-13

Similar Documents

Publication Publication Date Title
Chen et al. End-to-end neural network based automated speech scoring
CN110782921B (en) Voice evaluation method and device, storage medium and electronic device
CN106297800B (en) Self-adaptive voice recognition method and equipment
US7392187B2 (en) Method and system for the automatic generation of speech features for scoring high entropy speech
US7840404B2 (en) Method and system for using automatic generation of speech features to provide diagnostic feedback
CN103594087B (en) Improve the method and system of oral evaluation performance
US10937444B1 (en) End-to-end neural network based automated speech scoring
CN105845134A (en) Spoken language evaluation method through freely read topics and spoken language evaluation system thereof
CN108711420A (en) Multilingual hybrid model foundation, data capture method and device, electronic equipment
CN108711421A (en) A kind of voice recognition acoustic model method for building up and device and electronic equipment
CN103761975A (en) Method and device for oral evaluation
CN110415725B (en) Method and system for evaluating pronunciation quality of second language using first language data
CN111833853A (en) Voice processing method and device, electronic equipment and computer readable storage medium
CN103559894A (en) Method and system for evaluating spoken language
CN110246488A (en) Half optimizes the phonetics transfer method and device of CycleGAN model
CN109461437A (en) The verifying content generating method and relevant apparatus of lip reading identification
CN110223678A (en) Audio recognition method and system
CN109979257A (en) A method of partition operation is carried out based on reading English auto-scoring and is precisely corrected
CN112542158A (en) Voice analysis method, system, electronic device and storage medium
CN110349567A (en) The recognition methods and device of voice signal, storage medium and electronic device
CN113486970A (en) Reading capability evaluation method and device
CN109087633A (en) Voice assessment method, device and electronic equipment
Shafie et al. The model of Al-Quran recitation evaluation to support in Da’wah Technology media for self-learning of recitation using mobile apps
CN112309429A (en) Method, device and equipment for explosion loss detection and computer readable storage medium
CN109086387A (en) A kind of audio stream methods of marking, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant