CN110223679A - A kind of voice recognition input devices - Google Patents

A kind of voice recognition input devices Download PDF

Info

Publication number
CN110223679A
CN110223679A CN201910517750.9A CN201910517750A CN110223679A CN 110223679 A CN110223679 A CN 110223679A CN 201910517750 A CN201910517750 A CN 201910517750A CN 110223679 A CN110223679 A CN 110223679A
Authority
CN
China
Prior art keywords
word
unit
signal frame
signal
input devices
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910517750.9A
Other languages
Chinese (zh)
Inventor
杨阳
王国珍
黄克瑶
陈星海
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing Institute of Mechatronic Technology
Original Assignee
Nanjing Institute of Mechatronic Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing Institute of Mechatronic Technology filed Critical Nanjing Institute of Mechatronic Technology
Priority to CN201910517750.9A priority Critical patent/CN110223679A/en
Publication of CN110223679A publication Critical patent/CN110223679A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/183Speech classification or search using natural language modelling using context dependencies, e.g. language models
    • G10L15/19Grammatical context, e.g. disambiguation of the recognition hypotheses based on word sequence rules
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/223Execution procedure of a spoken command

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Machine Translation (AREA)

Abstract

A kind of voice recognition input devices.The present invention receives voice signal by microphone, it then passes sequentially through pretreatment unit, temporal segmentation unit language identification unit, sentence abstraction unit and its corresponding syntactic structure of identification acquisition is carried out to the voice signal received, with feature word, and then the instruction corresponding to it is confirmed.The present invention can be during interacting with user, simulate the language environment of oral communication, remember specifically to instruct without user, but by voice recognition input devices initiative recognition and extract the instruction in dialogue by way of dialogue, exports corresponding instruction to be controlled.As a result, the present invention can user-friendly various smart machines, improve the readable type of instruction and the accuracy rate of identification.

Description

A kind of voice recognition input devices
Technical field
The present invention relates to technical field of voice recognition, in particular to a kind of voice recognition input devices.
Background technique
Smart machine typically has been provided with voice interactive function, is controlled by the phonetic order that user issues equipment, Hardware is manually operated without user, thus it is more convenient.
But existing voice interactive function, or basis specific voice signal is identified.Beyond equipment Expression in instruction list will be unable to be identified by equipment, thus be unable to complete operation control.Existing machine learning techniques, Only for the identification of the articulation types such as the tone of user instruction itself, word speed, stress, pause, there is no be directed to semanteme originally The recognition mechanism of body.It is difficult to increase using for user for the numerous instructions for needing each equipment of user record in this way.
Summary of the invention
In view of the deficiencies of the prior art, the present invention provides a kind of voice recognition input devices, the present invention can with user into Row oral communication, and its pronunciation and phraseological mistake are pointed out in oral communication, it is capable of providing good language learning environment. The present invention specifically adopts the following technical scheme that.
Firstly, to achieve the above object, proposing a kind of voice recognition input devices voice recognition input devices, wherein packet Include: microphone, its rear end are connected with analog-digital converter, for receiving voice signal and inciting somebody to action according to the sample frequency not less than 6kHz It is converted to digital form, and exporting the voice signal is e (t);Pretreatment unit includes filter and feedback gain adjustment Device is connected to after the output end of the analog-digital converter, for filtering out the noise signal in voice signal e (t), and will be described In the range-adjusting to preset range of voice signal e (t), obtain useful signal E (t);Temporal segmentation unit, input terminal connection The output end of the pretreatment unit, for carrying out temporal segmentation sequentially in time to the useful signal E (t), in segmentation Before the preceding terminal once divided is arranged in the rear starting point once divided, guarantee that every group of signal frame E obtained is divided in front and backT1 (t), ET2(t) ..., ETN(t), N >=2, between include at least 1/2 part it is overlapped;Language identification unit, input End connects the output end of the temporal segmentation unit, for each group of signal frame ETn(t), [1, N] n ∈ carries out language respectively Identification, calculates each group of signal frame ETn(t) margin of error of word is corresponded in the tranining database that relative language identification is based on ERTn(t), and according to the word in tranining database closest to the signal frame language corresponding to each group of signal frame is exported Recognition result wTn(t), according to each group of signal frame ETn(t) language corresponding to the time sequencing splicing each group signal frame in is known Other result WTn(t) consistent part is corresponded in, is obtained read statement s (m), wherein [1, M] m ∈, and M is in the read statement Total words, m are the word serial number in the read statement;Sentence abstraction unit, input terminal connect language identification described in hand The output end of unit, for identifying marker characteristic word to each word progress feature identification in the read statement s (m), And word order feature, the sequence for the word order feature that identification marks is arranged as abstract word order combination S (m);Output unit, to institute It states abstract word order combination S (m) split and carry out language training respectively to each section after fractionation, obtain corresponding described abstract The instruction type of word order combination S (m), and determine according to the feature word in the read statement s (m) object of described instruction, According to described instruction type and the object output order of instruction.
Optionally, above-mentioned voice recognition input devices, wherein in the pretreatment unit, the filter is sliding window Port filter, the feedback gain adjustment device are negative feedback oscillator network, output end be also linked in sequence have frequency domain equalizer and Preemphasis network.
Optionally, above-mentioned voice recognition input devices, wherein the sliding-window filtering device includes being linked in sequence: sliding Dynamic window acquiring unit, for according at least 16ms window duration and setting step-length sequence read the voice signal e (t) The data of the middle correspondence window, wherein setting step-length is no more than 1/2 of window duration corresponding to the sliding window;Window Processing unit calculates separately the difference of its corresponding sliding window both ends voice signal to the data of each window, if the difference Value is greater than preset difference value range, then chooses the smallest at least two data replacement of difference in the voice signal in the sliding window Wherein difference exceeds the data of preset difference value range, exports replaced voice signal as sliding-window filtering result.
Optionally, above-mentioned voice recognition input devices, wherein the temporal segmentation unit specifically includes following sequential connection Each module: envelope computing unit seeks the amplitude of its temporal envelope for calculating the temporal envelope of the useful signal E (t) Average;Pause point detection unit, the average for being lower than the temporal envelope for searching amplitude in the useful signal E (t), And the duration exceeds the pause point of prefixed time interval;Overlapped partitioning unit is used for the pause point to effective letter Number E (t) is split, and the N group signal frame E that segmentation is obtainedT1(t),ET2(t) ..., ETN(t), N >=2 are extended at least Overlapped at least 1/2 between its previous group signal frame.
Optionally, above-mentioned voice recognition input devices, wherein the language identification unit includes the as follows of sequential connection Module is to realize to each group of signal frame ETn(t), [1, N] n ∈ carries out language identification: DFFT module, is used for the signal frame ETn(t) DFFT variation is carried out, frequency-region signal frame E ' is converted intoTn(k), [1, K] k ∈, K are frequency-region signal frame E ' after DFFT variationTn (k) frequency point ranges;Frequency domain filtering module, for the frequency-region signal frame E 'Tn(k) frequency domain filtering is carried out, frequency after filtering is calculated Domain signal frame Wherein, frequency domain filtering receptance function Wherein f (i+1)=N-2/f (i), f (i)= 1;Spectra calculation module, for calculating frequency-region signal frame after the filteringPower spectrumCharacteristic coefficient computing module, for the power spectrum PTn(k) discrete cosine transform is carried out Obtain characteristic coefficientNeural computing module, for the spy It levies coefficient C (t) and carries out convolutional neural networks backpropagation operation, comparison, which filters out, to be trained in advance in tranining database obtained Margin of error ERTn(t) the smallest word is as language recognition result wTn(t);Sentence output module, for respectively according to each Group signal frame ETn(t) each language recognition result w obtained is calculated separatelyTn(t), to the word of lap between signal frame Consistency judgement is carried out, exports wherein correspond to consistent word sequentially in time, obtain read statement s (m).
Optionally, above-mentioned voice recognition input devices, wherein to the read statement s (m) in the sentence abstraction unit In each word carry out feature identification include the part of speech for identifying each word, arrange its part of speech according to the sequence of each word, The word order feature is constituted, the abstract word order combination S (m) is obtained.
Optionally, above-mentioned voice recognition input devices, wherein in the output unit, to the abstract word order combination S (m) it carries out splitting and includes: the step of carrying out language training respectively to each section after fractionation
Firstly, screening out the qualifier in the abstract word order combination S (m), the abstract word order group after note screens out is combined into S ' (m);
Then, different clause in the syntactic model that abstract word order combination S ' (m) after screening out is obtained from preparatory training are calculated The Hamming distance of the combination of word order corresponding to sequence;It is the smallest to search Hamming distance in the syntactic model that training obtains in advance Clause sequence searches most matched instruction class therewith according to the clause sequence and the feature word in syntactic model Type, and determine according to the feature word in the read statement s (m) object of described instruction, according to described instruction type and refer to The object output order of order.
Voice recognition input devices as claimed in claim 6, wherein the qualifier includes: that part of speech is adjective, pair Word, number, article, conjunction word.
Beneficial effect
The present invention receives voice signal by microphone, then passes sequentially through pretreatment unit, temporal segmentation unit language Recognition unit, sentence abstraction unit carry out identification to the voice signal received and obtain its corresponding syntactic structure and feature list Word, and then confirm the instruction corresponding to it.The present invention during interacting with user, can simulate the language ring of oral communication Remember specifically to instruct without user, but by voice recognition input devices initiative recognition and extracted by way of dialogue in border Instruction in dialogue exports corresponding instruction to be controlled.As a result, the present invention can user-friendly various intelligence set It is standby, improve the readable type of instruction and the accuracy rate of identification.
In particular, being filtered calculating by way of window sliding during above-mentioned, calculating cost is lower, saves The process of frequency domain conversion, delay are few.Its segmentation carried out to time domain useful signal E (t) remains with front and back letter in cutting procedure The information of number frame, respectively can individually identify at least two overlapped signals in the subsequent language identification stage, More accurate recognition result is obtained by the comparison of front and back operation result at least twice.
Other features and advantages of the present invention will be illustrated in the following description, also, partly becomes from specification It obtains it is clear that understand through the implementation of the invention.
Detailed description of the invention
Attached drawing is used to provide further understanding of the present invention, and constitutes part of specification, and with it is of the invention Embodiment together, is used to explain the present invention, and is not construed as limiting the invention.In the accompanying drawings:
Fig. 1 is the block diagram of voice recognition input devices of the invention.
Specific embodiment
To keep purpose and the technical solution of the embodiment of the present invention clearer, below in conjunction with the attached of the embodiment of the present invention Figure, is clearly and completely described the technical solution of the embodiment of the present invention.Obviously, described embodiment is of the invention A part of the embodiment, instead of all the embodiments.Based on described the embodiment of the present invention, those of ordinary skill in the art Every other embodiment obtained, shall fall within the protection scope of the present invention under the premise of being not necessarily to creative work.
Those skilled in the art of the present technique are appreciated that unless otherwise defined, all terms used herein (including technology art Language and scientific term) there is meaning identical with the general understanding of those of ordinary skill in fields of the present invention.Should also Understand, those terms such as defined in the general dictionary, which should be understood that, to be had and the meaning in the context of the prior art The consistent meaning of justice, and unless defined as here, it will not be explained in an idealized or overly formal meaning.
Fig. 1 is a kind of block diagram of voice recognition input devices according to the present invention comprising:
Microphone 1, e.g., BS9-MPA416, its rear end is connected with analog-digital converter, for receiving voice signal and according to not Sample frequency lower than 6kHz is converted into digital form, and exporting the voice signal is e (t)
Pretreatment unit 2 includes filter and feedback gain adjustment device, is connected to the output end of the analog-digital converter Later, for filtering out the noise signal in voice signal e (t), and by the range-adjusting of the voice signal e (t) to default model In enclosing, obtain useful signal E (t).
Under a kind of more specific mode, the pretreatment carried out in this unit to the voice signal e (t) includes: Sliding-window filtering can also further comprise that frequency domain equalization, automatic gain adjustment and preemphasis network calculate.Wherein, described Automatic gain adjustment is with feedback gain adjustment device, for example, negative-feedback gain network realizes that output end, which is also linked in sequence, frequency domain Balanced device and preemphasis network.The step of sliding-window filtering includes: firstly, determining duration corresponding to sliding window at least Reach 16ms;Then, the voice signal e (t) is traversed with the sliding window sequentially in time, is calculated separately in ergodic process The difference of sliding window both ends voice signal chooses the language in the sliding window if the difference is greater than preset difference value range Wherein data of the difference beyond preset difference value range, output are replaced for the smallest at least two data replacement of difference in sound signal Voice signal is as sliding-window filtering result;Wherein, in ergodic process, the step-length of the sliding window is no more than the sliding 1/2 of duration corresponding to window.Frequency domain equalization is zoomed in or out equal by different frequency range of the equalising network to signal Weighing apparatus is handled and is realized, can be avoided the hum and sharp explosion sound within the scope of special frequency channel.Automatic gain adjustment can be direct It is designed and is obtained by the positive-negative feedback of circuit system, or adjusted by the gain calculation process of electronic circuit.Preemphasis Network may be designed as high-pass filter, such as transmission function is H (z)=1- μ z-1Filter.
The rest part of the block diagram of the voice recognition input devices is mainly realized by Digital Signal Processing DSP. Dsp processor may be selected to be ADSP21x series, inside be also devised with temporal segmentation unit 3, language identification unit 4, sentence is taken out As unit 5 and output unit 6.
The temporal segmentation unit 3, input terminal connect the output end of the pretreatment unit, for effective letter Number E (t) carries out temporal segmentation sequentially in time, and the preceding terminal once divided is arranged in the rear starting point once divided in segmentation Before, guarantee that every group of signal frame E obtained is divided in front and backT1(t), ET2(t) ..., ETN(t), N≤2, between include at least 1/2 part is overlapped.Specifically, temporal segmentation can carry out in accordance with the following steps in this unit:
Step 301, the temporal envelope for calculating the useful signal E (t), to the amplitude averaging of its temporal envelope;
Step 302, search in the useful signal E (t) amplitude and be lower than the average, and the duration beyond it is default when Between be spaced, such as 1ms, pause point;
Step 303, for being split with the pause point to the useful signal E (t), and the N group that segmentation is obtained Signal frame ET1(t), ET2(t) ..., ETN(t), N >=2, forward/or it is extended for the phase at least between its previous group signal frame backward Mutual respect is stacked to few 1/2.
Language identification unit 4 then, input terminal connect the output end of the temporal segmentation unit, for each group Signal frame ETn(t), [1, N] n ∈ carries out language identification respectively, calculates each group of signal frame ETn(t) relative language identifies institute's base In tranining database in correspond to the margin of error ER of wordTn(t), and according to the list in tranining database closest to the signal frame Word exports language recognition result w corresponding to each group of signal frameTn(t), according to each group of signal frame ETn(t) time in is suitable Sequence splices language recognition result w corresponding to each group signal frameTn(t) consistent part is corresponded in, is obtained read statement s (m), Middle m ∈ [1, M], M are the total words in the read statement, and m is the word serial number in the read statement.
Under a kind of more specifically implementation, to each group of signal frame E in the language identification unit 4Tn(t), n ∈ [1, N], the specific steps for carrying out language identification include:
Step 401, by the signal frame ETn(t) DFFT variation is carried out, frequency-region signal frame E ' is converted intoTn(k), k ∈ [1, K], K is frequency-region signal frame E ' after DFFT variationTn(k) frequency point ranges can usually be chosen for 1024;
Step 402, to the frequency-region signal frame E 'Tn(k) frequency domain filtering is carried out, frequency-region signal frame after filtering is calculatedWherein, frequency domain filtering receptance function Wherein f (i+1)=N-2/f (i), f (i)=1;
Step 403, frequency-region signal frame after the filtering is calculatedPower spectrum
Step 404, to the power spectrum PTn(k) it carries out discrete cosine transform and obtains characteristic coefficient
Step 405, convolutional neural networks backpropagation operation is carried out to the characteristic coefficient C (t) and obtains itself and each word The margin of error, comparison filters out and trains margin of error ER in tranining database obtained in advanceTn(t) the smallest word is as language Recognition result wTn(t);
Step 406, respectively to each group of signal frame ETn(t) above-mentioned calculating is carried out respectively, to overlapping portion between signal frame Point word carry out consistency judgement, export sequentially in time and wherein correspond to consistent word and obtain read statement s (m).It is right In inconsistent word, the word of minimal error amount can be selected to insert read statement s by investigating its margin of error size respectively (m) corresponding portion in.
Subsequent sentence abstraction unit 5, input terminal connect the output end of language identification unit described in hand, for described Each word in read statement s (m) carries out feature identification, identifies marker characteristic word and word order feature, and identification is marked The sequence of word order feature out is arranged as abstract word order combination S (m).
Specifically, the step of carrying out feature identification to the read statement s (m) includes: to mark noun therein, generation Word, adjective, adverbial word, verb, number, article, preposition, conjunction mark its sequence, the sequence for the feature that identification is marked It is arranged as abstract word order combination S (m).Adjective therein, adverbial word, number, article, conjunction can be marked further and be.
Finally, split and to each section after fractionation to the abstract word order combination S (m) by above-mentioned output unit 6 Language training is carried out respectively, obtains the instruction type of the corresponding abstract word order combination S (m), and according to the read statement s (m) the feature word in determines the object of described instruction, according to described instruction type and the object output order of instruction.
Wherein, the abstract word order combination S (m) split and grammer instruction is carried out respectively to each section after fractionation Practice, step includes:
Firstly, screening out the qualifier in the abstract word order combination S (m), including adjective, adverbial word, number, article, company Word etc., the abstract word order group after note screens out are combined into S ' (m);
Then, different clause in the syntactic model that abstract word order combination S ' (m) after screening out is obtained from preparatory training are calculated The Hamming distance of the combination of word order corresponding to sequence;It is the smallest to search Hamming distance in the syntactic model that training obtains in advance Clause sequence searches most matched instruction class therewith according to the clause sequence and the feature word in syntactic model Type, and determine according to the feature word in the read statement s (m) object of described instruction, according to described instruction type and refer to The object output order of order.
In the calculating of Hamming distance, using different word orders sequence as different coordinate dimensions, with the abstract word order group after screening out The alternative relationship closed between the different parts of speech in S ' (m) is calculated as coordinate points.For example, for abstract word order combination S ' (m) it is combined for the abstract word order of " pronoun+verb+noun ", the Hamming distance between combination with " verb+pronoun " can indicate ForMutual alternative part of speech Between difference be designed as it is smaller, e.g., between pronoun and noun;It is not suitable for difference between the part of speech being substituted for each other in general word order to set Haggle over big.
Under a kind of more specifically mode, in above-mentioned apparatus, to the read statement s in the sentence abstraction unit (m) the feature identification that each word in carries out includes the part of speech for identifying each word, arranges its word according to the sequence of each word Property, the word order feature is constituted, the abstract word order combination S (m) is obtained.
The present invention can carry out oral communication with user as a result, and obtain in its language message in oral communication Instruction requires, and identifies the instruction under different expression, exports the instruction and executes for equipment.
The above is only embodiments of the present invention, and the description thereof is more specific and detailed, and but it cannot be understood as right The limitation of the invention patent range.It should be pointed out that for those of ordinary skill in the art, not departing from the present invention Under the premise of design, various modifications and improvements can be made, these are all belonged to the scope of protection of the present invention.

Claims (8)

1. a kind of voice recognition input devices characterized by comprising
Microphone (1), its rear end is connected with analog-digital converter, for receiving voice signal and according to the sampling frequency not less than 6kHz Rate is converted into digital form, and exporting the voice signal is e (t);
Pretreatment unit (2), includes filter and feedback gain adjustment device, be connected to the analog-digital converter output end it Afterwards, for filtering out the noise signal in voice signal e (t), and by the range-adjusting of the voice signal e (t) to preset range It is interior, it obtains useful signal E (t);
Temporal segmentation unit (3), input terminal connect the output end of the pretreatment unit, for the useful signal E (t) Temporal segmentation is carried out sequentially in time, before the preceding terminal once divided is arranged in the rear starting point once divided in segmentation, Guarantee that every group of signal frame E obtained is divided in front and backT1(t), ET2(t) ..., ETN(t), N >=2, between include at least 1/2 Part is overlapped;
Language identification unit (4), input terminal connect the output end of the temporal segmentation unit, for each group of signal frame ETn (t), [1, N] n ∈ carries out language identification respectively, calculates each group of signal frame ETn(t) relative language identifies the training number being based on According to the margin of error ER for corresponding to word in libraryTn(t), and it is each according to the word output in tranining database closest to the signal frame Language recognition result w corresponding to group signal frameTn(t), according to each group of signal frame ETn(t) time sequencing in splices each group Language recognition result w corresponding to signal frameTn(t) correspond to consistent part in, obtain read statement s (m), wherein m ∈ [1, M], M is the total words in the read statement, and m is the word serial number in the read statement;
Sentence abstraction unit (5), input terminal connect the output end of language identification unit described in hand, for the read statement Each word in s (m) carries out feature identification, identifies marker characteristic word and word order feature, the word order that identification is marked The sequence of feature is arranged as abstract word order combination S (m);
Output unit (6) split to the abstract word order combination S (m) and carries out grammer respectively to each section after fractionation Training obtains the instruction type of the corresponding abstract word order combination S (m), and according to the feature list in the read statement s (m) Word determines the object of described instruction, according to described instruction type and the object output order of instruction.
2. voice recognition input devices as described in claim 1, which is characterized in that in the pretreatment unit (2), the filter Wave device is sliding-window filtering device, and the feedback gain adjustment device is negative feedback oscillator network, and output end, which is also linked in sequence, to be had Frequency domain equalizer and preemphasis network.
3. voice recognition input devices as claimed in claim 2, which is characterized in that the sliding-window filtering device includes sequence Connection:
Sliding window acquiring unit, for according at least 16ms window duration and setting step-length sequence read voice letter The data of the window are corresponded in number e (t), wherein setting step-length is no more than 1/ of window duration corresponding to the sliding window 2;
Windowing processing unit calculates separately the difference of its corresponding sliding window both ends voice signal to the data of each window Value chooses the smallest at least two of difference in the voice signal in the sliding window if the difference is greater than preset difference value range Data replace the data that wherein difference exceeds preset difference value range, export replaced voice signal as sliding-window filtering knot Fruit.
4. voice recognition input devices as described in claim 1, which is characterized in that the temporal segmentation unit (3) is specifically wrapped Include each module of following sequential connection:
Envelope computing unit is averaging the amplitude of its temporal envelope for calculating the temporal envelope of the useful signal E (t) Number;
Pause point detection unit, the average for being lower than the temporal envelope for searching amplitude in the useful signal E (t), and Duration exceeds the pause point of prefixed time interval;
Overlapped partitioning unit, for being split with the pause point to the useful signal E (t), and the N group that segmentation is obtained Signal frame ET1(t), ET2(t) ..., ETN(t), N >=2, be extended at least between its previous group signal frame it is overlapped at least 1/2。
5. the voice recognition input devices as described in claim 1-3, which is characterized in that the language identification unit (4) includes There is the following module of sequential connection to realize to each group of signal frame ETn(t), [1, N] n ∈ carries out language identification:
DFFT module is used for the signal frame ETn(t) DFFT variation is carried out, frequency-region signal frame E' is converted intoTn(k), k ∈ [1, K], K is frequency-region signal frame E' after DFFT variationTn(k) frequency point ranges;
Frequency domain filtering module, for the frequency-region signal frame E'Tn(k) frequency domain filtering is carried out, frequency-region signal frame after filtering is calculatedWherein, frequency domain filtering receptance function Wherein f (i+1)=N-2/f (i), f (i) =1;
Spectra calculation module, for calculating frequency-region signal frame after the filteringPower spectrum
Characteristic coefficient computing module, for the power spectrum PTm(k) it carries out discrete cosine transform and obtains characteristic coefficient
Neural computing module is compared for carrying out convolutional neural networks backpropagation operation to the characteristic coefficient C (t) It filters out and trains margin of error ER in tranining database obtained in advanceTn(t) the smallest word is as language recognition result wTn (t);
Sentence output module, for respectively according to each group of signal frame ETn(t) each language identification knot obtained is calculated separately Fruit wTn(t), consistency judgement is carried out to the word of lap between signal frame, exported sequentially in time wherein corresponding consistent Word, obtain read statement s (m).
6. voice recognition input devices as claimed in claims 1-5, which is characterized in that right in the sentence abstraction unit (5) The feature identification that each word in the read statement s (m) carries out includes the part of speech for identifying each word, according to each word Sequence arrange its part of speech, constitute the word order feature, obtain the abstract word order combination S (m).
7. voice recognition input devices as claimed in claim 6, which is characterized in that the qualifier includes: that part of speech is to describe Word, adverbial word, number, article, conjunction word.
8. the voice recognition input devices as described in claim 1-6, which is characterized in that in the output unit (6), to described Abstract word order combination S (m) split and includes: the step of carrying out language training respectively to each section after fractionation
Firstly, screening out the qualifier in the abstract word order combination S (m), the abstract word order group after note screens out is combined into S'(m);
Then, abstract word order combination S ' (m) after screening out and clause sequences different in the syntactic model that preparatory training obtains are calculated The Hamming distance of corresponding word order combination;Search the smallest clause of Hamming distance in the syntactic model that training obtains in advance Sequentially, most matched instruction type therewith is searched in syntactic model according to the clause sequence and the feature word, and The object that described instruction is determined according to the feature word in the read statement s (m), according to described instruction type and pair of instruction As output order.
CN201910517750.9A 2019-06-14 2019-06-14 A kind of voice recognition input devices Pending CN110223679A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910517750.9A CN110223679A (en) 2019-06-14 2019-06-14 A kind of voice recognition input devices

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910517750.9A CN110223679A (en) 2019-06-14 2019-06-14 A kind of voice recognition input devices

Publications (1)

Publication Number Publication Date
CN110223679A true CN110223679A (en) 2019-09-10

Family

ID=67817220

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910517750.9A Pending CN110223679A (en) 2019-06-14 2019-06-14 A kind of voice recognition input devices

Country Status (1)

Country Link
CN (1) CN110223679A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112862096A (en) * 2021-02-04 2021-05-28 百果园技术(新加坡)有限公司 Model training and data processing method, device, equipment and medium

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1325528A (en) * 1998-09-09 2001-12-05 单一声音技术公司 Network interactive user interface using speech recognition and natural language processing
CN102298928A (en) * 2004-09-27 2011-12-28 罗伯特·博世有限公司 Interactive conversational dialogue for cognitively overloaded device users
US8706491B2 (en) * 2002-05-20 2014-04-22 Microsoft Corporation Applying a structured language model to information extraction
CN105930452A (en) * 2016-04-21 2016-09-07 北京紫平方信息技术股份有限公司 Smart answering method capable of identifying natural language
CN106844343A (en) * 2017-01-20 2017-06-13 上海傲硕信息科技有限公司 Instruction results screening plant
CN108108094A (en) * 2017-12-12 2018-06-01 深圳和而泰数据资源与云技术有限公司 A kind of information processing method, terminal and computer-readable medium
CN109273000A (en) * 2018-10-11 2019-01-25 河南工学院 A kind of audio recognition method

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1325528A (en) * 1998-09-09 2001-12-05 单一声音技术公司 Network interactive user interface using speech recognition and natural language processing
US8706491B2 (en) * 2002-05-20 2014-04-22 Microsoft Corporation Applying a structured language model to information extraction
CN102298928A (en) * 2004-09-27 2011-12-28 罗伯特·博世有限公司 Interactive conversational dialogue for cognitively overloaded device users
CN105930452A (en) * 2016-04-21 2016-09-07 北京紫平方信息技术股份有限公司 Smart answering method capable of identifying natural language
CN106844343A (en) * 2017-01-20 2017-06-13 上海傲硕信息科技有限公司 Instruction results screening plant
CN108108094A (en) * 2017-12-12 2018-06-01 深圳和而泰数据资源与云技术有限公司 A kind of information processing method, terminal and computer-readable medium
CN109273000A (en) * 2018-10-11 2019-01-25 河南工学院 A kind of audio recognition method

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
李玉鉴: "基于索引模板匹配替换通用算法的机器翻译", 《计算机应用研究》 *
韩志艳等: "《语音信号鲁棒特征提取及可视化技术研究》", 28 February 2012 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112862096A (en) * 2021-02-04 2021-05-28 百果园技术(新加坡)有限公司 Model training and data processing method, device, equipment and medium

Similar Documents

Publication Publication Date Title
Bell et al. Reduction of speech spectra by analysis‐by‐synthesis techniques
CN108847215B (en) Method and device for voice synthesis based on user timbre
CN102543073B (en) Shanghai dialect phonetic recognition information processing method
EP1005021B1 (en) Method and apparatus to extract formant-based source-filter data for coding and synthesis employing cost function and inverse filtering
JPH0816187A (en) Speech recognition method in speech analysis
CN109658918A (en) A kind of intelligence Oral English Practice repetition topic methods of marking and system
CN105845139A (en) Off-line speech control method and device
CN110276073A (en) A kind of interactive mode Oral English Practice bearing calibration
CN105845126A (en) Method for automatic English subtitle filling of English audio image data
CN111798846A (en) Voice command word recognition method and device, conference terminal and conference terminal system
CN110223679A (en) A kind of voice recognition input devices
Beckmann et al. Word-level embeddings for cross-task transfer learning in speech processing
Talesara et al. A novel Gaussian filter-based automatic labeling of speech data for TTS system in Gujarati language
CN112700520A (en) Mouth shape expression animation generation method and device based on formants and storage medium
CN1835077B (en) Automatic speech recognizing input method and system for Chinese names
Morikawa et al. System identification of the speech production process based on a state-space representation
CN114333828A (en) Quick voice recognition system for digital product
CN107825433A (en) A kind of card machine people of children speech instruction identification
CN113486208A (en) Voice search equipment based on artificial intelligence and search method thereof
Lane et al. Local word discovery for interactive transcription
CN115312029B (en) Voice translation method and system based on voice depth characterization mapping
Nandyala et al. Real time isolated word recognition using adaptive algorithm
Sriranjani et al. Experiments on front-end techniques and segmentation model for robust Indian Language speech recognizer
Teja et al. A Novel Approach in the Automatic Generation of Regional Language Subtitles for Videos in English
CN112967538B (en) English pronunciation information acquisition system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20190910