CN108335693A - A kind of Language Identification and languages identification equipment - Google Patents
A kind of Language Identification and languages identification equipment Download PDFInfo
- Publication number
- CN108335693A CN108335693A CN201710035625.5A CN201710035625A CN108335693A CN 108335693 A CN108335693 A CN 108335693A CN 201710035625 A CN201710035625 A CN 201710035625A CN 108335693 A CN108335693 A CN 108335693A
- Authority
- CN
- China
- Prior art keywords
- audio
- target
- video data
- training
- languages
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000012549 training Methods 0.000 claims abstract description 124
- 238000000034 method Methods 0.000 claims abstract description 46
- 230000015654 memory Effects 0.000 claims abstract description 35
- 238000000605 extraction Methods 0.000 claims abstract description 26
- 230000000694 effects Effects 0.000 claims description 16
- 238000001514 detection method Methods 0.000 claims description 13
- 238000001914 filtration Methods 0.000 claims description 12
- 235000013399 edible fruits Nutrition 0.000 claims 1
- 239000010410 layer Substances 0.000 description 22
- 238000013528 artificial neural network Methods 0.000 description 20
- 238000010586 diagram Methods 0.000 description 9
- 238000005516 engineering process Methods 0.000 description 9
- 230000008569 process Effects 0.000 description 7
- 238000012545 processing Methods 0.000 description 5
- 230000000306 recurrent effect Effects 0.000 description 5
- 238000012706 support-vector machine Methods 0.000 description 5
- 238000004891 communication Methods 0.000 description 4
- 230000006870 function Effects 0.000 description 4
- 230000007774 longterm Effects 0.000 description 4
- 238000012360 testing method Methods 0.000 description 4
- 238000010168 coupling process Methods 0.000 description 3
- 238000005859 coupling reaction Methods 0.000 description 3
- 238000011161 development Methods 0.000 description 3
- 230000018109 developmental process Effects 0.000 description 3
- 238000012550 audit Methods 0.000 description 2
- 230000008901 benefit Effects 0.000 description 2
- 230000008878 coupling Effects 0.000 description 2
- 238000003058 natural language processing Methods 0.000 description 2
- 239000002356 single layer Substances 0.000 description 2
- 238000001228 spectrum Methods 0.000 description 2
- 241001672694 Citrus reticulata Species 0.000 description 1
- 241001269238 Data Species 0.000 description 1
- 241000238558 Eucarida Species 0.000 description 1
- 206010016275 Fear Diseases 0.000 description 1
- 230000004913 activation Effects 0.000 description 1
- 230000003698 anagen phase Effects 0.000 description 1
- 230000006399 behavior Effects 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 239000003086 colorant Substances 0.000 description 1
- 238000011840 criminal investigation Methods 0.000 description 1
- 238000003066 decision tree Methods 0.000 description 1
- 230000003111 delayed effect Effects 0.000 description 1
- 230000009429 distress Effects 0.000 description 1
- 238000007429 general method Methods 0.000 description 1
- 239000000463 material Substances 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000036961 partial effect Effects 0.000 description 1
- 230000002688 persistence Effects 0.000 description 1
- 230000002085 persistent effect Effects 0.000 description 1
- 229910052709 silver Inorganic materials 0.000 description 1
- 239000004332 silver Substances 0.000 description 1
- 230000036962 time dependent Effects 0.000 description 1
- 239000002699 waste material Substances 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/005—Language recognition
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Artificial Intelligence (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Electrically Operated Instructional Devices (AREA)
- Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)
Abstract
The embodiment of the invention discloses a kind of Language Identification and languages identification equipment, the method includes:Feature extraction is carried out to the target audio, video data for training under line, to obtain characteristic corresponding with the target audio, video data, training is iterated to the characteristic successively by the long LSTM of memory network in short-term of N layers by level sequence included by training network, target training pattern is identified for carrying out languages to obtain.The enterprising enforcement of large data sets can be applied to shown in the present embodiment to use, through this embodiment shown in during the target training module carries out languages identification, identification accuracy is high, and speed is fast, disclosure satisfy that the demand of the speed currently identified to languages.
Description
Technical field
The present invention relates to field of computer technology more particularly to a kind of Language Identification and languages identification equipments.
Background technology
As internationalization exchanges increasingly close trend, in every field, such as information inquiry service, alarm system, silver
The fields such as row, stock exchange and urgent hotline service, the requirement to the speed of languages identification is also higher and higher, is looked into information
For asking service, many information query systems can provide multilingual service, but only determine user's in information query system
After category of language, the service of corresponding languages can be targetedly provided.The example of this kind of exemplary service includes travel information, answers
Anxious service and shopping etc..
Languages identifying schemes currently on the market largely use mixed Gauss model (full name in English:Gaussian
Mixture Model, English abbreviation:) or support vector machines (full name in English GMM:Support Vector Machine, English
Literary abbreviation:Traditional shallow Model method such as SVM).
But languages identifying schemes used by the prior art can not actually use on large data sets, and accuracy rate
Low, speed is slow, cannot meet the speed requirement currently identified to languages.
Invention content
An embodiment of the present invention provides a kind of Language Identification and languages identification equipments, can be applied to big data
Languages identification is carried out on collection, and identifies that accuracy is high, and speed is fast.
First aspect of the embodiment of the present invention provides a kind of Language Identification, including:
Obtain the target audio, video data for training under line;
Feature extraction is carried out to the target audio, video data, to obtain feature corresponding with the target audio, video data
Data;
By the long LSTM of memory network in short-term of N layers by level sequence included by training network successively to the characteristic
According to training is iterated, to obtain target training pattern, the target training pattern is for carrying out languages identification.
Second aspect of the embodiment of the present invention provides a kind of Language Identification, including:
Obtain the first object audio, video data for being identified on line;
Feature extraction is carried out to the first object audio, video data, to obtain and the first object audio, video data pair
The fisrt feature data answered;
Determine target training pattern, the target training pattern be using training network pair the second target audio, video data into
Row training obtains, and the trained network includes that N the layers long memory network LSTM in short-term, the N to sort by level is more than or equal to 2
Positive integer;
According to the target training pattern and the fisrt feature data acquisition target fractional;
Determine that languages recognition result information corresponding with the target fractional, the languages recognition result information are used to indicate
Languages belonging to the first object audio, video data.
The third aspect of the embodiment of the present invention provides a kind of languages identification equipment, including:
First acquisition unit, for obtaining the target audio, video data for training under line;
Second acquisition unit, for carrying out feature extraction to the target audio, video data, to obtain and the target sound
The corresponding characteristic of video data;
Training unit also, is used for through the long memory network LSTM in short-term of N layers by level sequence included by training network
Training is iterated to the characteristic successively, to obtain target training pattern, the target training pattern is for carrying out language
Kind identification.
Fourth aspect of the embodiment of the present invention provides a kind of languages identification equipment, including:
First acquisition unit, for obtaining the first object audio, video data for being used for being identified on line;
First recognition unit, for carrying out feature extraction to the first object audio, video data, to obtain and described the
The corresponding fisrt feature data of one target audio, video data;
First determination unit, for determining target training pattern, the target training pattern is using training network pair the
Two target audio, video datas are trained to obtain, and the trained network includes the long memory network in short-term of N layers to sort by level
LSTM, the N are the positive integer more than or equal to 2;
Second acquisition unit, for according to the target training pattern and the fisrt feature data acquisition target fractional;
Second determination unit, for determining that languages recognition result information corresponding with the target fractional, the languages are known
Other result information is used to indicate the languages belonging to the first object audio, video data.
A kind of Language Identification and languages identification equipment are present embodiments provided, method can shown in the present embodiment
Feature extraction is carried out to the target audio, video data for training under line, to obtain spy corresponding with the target audio, video data
Data are levied, by the long LSTM of memory network in short-term of N layers by level sequence included by training network successively to the characteristic
According to training is iterated, target training pattern is identified for carrying out languages to obtain.Big number can be applied to shown in the present embodiment
Used according to enterprising enforcements is collected, through this embodiment shown in during the target training module carries out languages identification, identification standard
True property is high, and speed is fast, disclosure satisfy that the demand of the speed currently identified to languages.
Description of the drawings
Fig. 1 is a kind of example structure schematic diagram of languages identification equipment provided by the present invention;
Fig. 2 is a kind of embodiment step flow chart of Language Identification provided by the present invention;
Fig. 3 is the cycle schematic diagram of recurrent neural network provided by the present invention;
Fig. 4 is the structural schematic diagram of LSTM networks provided by the present invention;
Fig. 5 is the structural schematic diagram of trained network provided by the present invention;
Fig. 6 is another embodiment step flow chart of Language Identification provided by the present invention;
Fig. 7 is another example structure schematic diagram of languages identification equipment provided by the present invention;
Fig. 8 is another example structure schematic diagram of languages identification equipment provided by the present invention.
Specific implementation mode
The Language Identification that the embodiment of the present invention is provided can be applied to the languages identification equipment with computing function,
The Language Identification that embodiment is provided for a better understanding of the present invention combines shown in Fig. 1 and implements to the present invention first below
The entity structure for the languages identification equipment that example is provided illustrates.
The explanation that the entity structure of languages identification equipment is provided for the embodiments of the invention below it should be clear that, is can
The example of choosing, is not construed as limiting, as long as the Language Identification that the embodiment of the present invention is provided can be realized.
As shown in Figure 1, Fig. 1 is a kind of languages identification equipment structural schematic diagram provided in an embodiment of the present invention, which knows
Other equipment 100 can generate bigger difference because configuration or performance are different, may include one or more central processings
Device (central processing units, CPU) 122 (for example, one or more processors) and memory 132, one
(such as one or more mass memories of storage medium 130 of a or more than one storage application program 142 or data 144
Equipment).Wherein, memory 132 and storage medium 130 can be of short duration storage or persistent storage.It is stored in storage medium 130
Program may include one or more modules (diagram does not mark), and each module may include in languages identification equipment
Series of instructions operate.Further, central processing unit 122 could be provided as communicating with storage medium 130, be identified in languages
The series of instructions operation in storage medium 130 is executed in equipment 100.
Languages identification equipment 100 can also include one or more power supplys 126, one or more wired or nothings
Wired network interface 150, one or more input/output interfaces 158, and/or, one or more operating systems 141,
Such as Windows ServerTM, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM etc..
Languages identification equipment 100 shown in FIG. 1 can realize the automatic Language Identification technology (Language of voice
IDentification, LID).
LID, which refers to the process of languages identification equipment 100, can automatically identify the affiliated languages of voice.
Automatic Language Identification technology is having critically important application in information retrieval, criminal investigation and military field, adjoint
The development of Internet technology, languages identification can play increasingly important role, along with the progress of technology, some day can
Break the obstacle of Human communication, languages identification can also play a very important role surely wherein.Following some day, from not
With national, the different colours of skin, says that the people of different language can utilize technological means, realize that free communication, languages are known
Other technology is front-end processor important among this.Multilingual service can be provided in Future Information inquiry system, such as is being believed
In terms of breath service, many information query systems can provide multilingual service, after information query system determines the category of language of user,
The service of corresponding languages is provided.The example of this kind of exemplary service includes travel information, emergency service and shopping and bank, stock
Ticket is merchandised.
Automatic Language Identification technology can also be used to the front-end processing of multi-lingual MT system, and directly will be a kind of
Language conversion at another language communication system.
In addition in military affairs can also be used to that speaker's identity and nationality are monitored or differentiated.With arriving for information age
Come and the development of internet, Language Identification increasingly show its application value.
Based on languages identification equipment shown in FIG. 1, know below in conjunction with languages are provided for the embodiments of the invention shown in Fig. 2
The specific execution steps flow chart of other method is described in detail, wherein Fig. 2 is the one of Language Identification provided by the present invention
Kind embodiment step flow chart.
First, the step 201 shown in embodiment to step 207 is the specific execution steps flow chart of training part under line:
Step 201 obtains the second audio-video document.
During training part under executing line, the languages identification equipment can be obtained first for into training under line
The second audio-video document.
The present embodiment is not construed as limiting the number of the audio, video data included by second audio-video document.
Step 202 is decoded to generate the second audio, video data second audio-video document by decoder.
Multimedia video handling implement (full name in English shown in the present embodiment:Fast Forward Mpeg, English abbreviation:
FFmpeg) decoder.
It should be clear that, the present embodiment is optional example to the explanation of the decoder, is not construed as limiting, as long as the solution
Code device can be decoded second audio-video document to generate second audio and video that can carry out languages identification
Data.
Step 203 is filtered second audio, video data the second target audio, video data of generation.
To reduce the duration that training part executes under line, the efficiency of promotion languages identification promotes the accuracy of languages identification,
Then languages identification equipment shown in the present embodiment can be filtered second audio, video data.
Specifically, languages identification equipment shown in the present embodiment passes through VAD (full name in English:Voice Activity
Detection, Chinese name:Voiced activity detection) it is detected, it is invalid quiet in second audio, video data to filter
Segment is to generate the second target audio, video data.
As it can be seen that being included using the second target audio, video data accessed by the step 203 shown in the present embodiment
Data be effective data, when being handled wasted to useless data so as to avoid the languages identification equipment
Long and system resource.
Step 204 carries out the second target audio, video data feature extraction to obtain second feature data.
Specifically, in the present embodiment, the languages identification equipment can carry out the second target audio, video data special
Sign extraction, to obtain second feature data corresponding with the second target audio, video data.
The feature extracting method of feature extraction can be carried out shown in the present embodiment to the second target audio, video data
Can be spectrum envelope method, Cepstrum Method, LPC interpolation methods, LPC extraction of roots, Hilbert transform method, formant tracing algorithm etc..
The present embodiment is not construed as limiting the feature extracting method, as long as the second target audio data can be extracted
Second feature data.
Target classification label is arranged in step 205 in the second feature data.
The target classification label is the label for the languages for being used to indicate the target audio data.
The target classification label shown in the present embodiment is label corresponding with the second feature data.
The present embodiment is realized by the way that the target classification label is arranged in the second feature data according to languages
Difference classify to second feature data.
Classify shown in the present embodiment, in simple terms, be exactly according to the language feature or attribute of the second feature data,
It is divided into existing classification.
Such as in natural language processing NLP, text classification is a classification problem, and general method for classifying modes is all available
It is studied in text classification.
Commonly sorting algorithm includes:Decision tree classification, simple Bayesian Classification Arithmetic (native Bayesian
Classifier the grader of support vector machines (SVM), neural network, k- nearest neighbor methods (k-nearest), are based on
Neighbor, kNN), fuzzy classifier method etc..
For example, being illustrated so that Tibetan language identifies scene as an example, the target classification label that can predefine Tibetan language is
1, by the target classification label so that Tibetan language is distinguished with other languages, then the step shown in the present embodiment
Be arranged in 205 be set as in the second feature data 1 target classification label.
The second feature data for being provided with the target classification label are input to the trained network by step 206.
Step 207, by the trained network to be provided with the second feature data of the target classification label into
Row iteration is trained to obtain the target training pattern.
Specifically, passing through the memory network LSTM in short-term N layers long included by the trained network shown in the present embodiment
Training is iterated to the second feature data for being provided with the target classification label successively, is trained with obtaining the target
Model.
More specifically, because of the step 206 shown in through this embodiment, the languages identification equipment will be provided with the mesh
The second feature data of mark tag along sort are sent to the trained network, then the trained network can pass through the training
The LSTM of the memory network in short-term N layers long included by network is special to being provided with described the second of the target classification label successively
Sign data are iterated training, to obtain the target training pattern.
Below to the long memory network in short-term of N layers by level sequence included by the trained network shown in the present embodiment
LSTM is illustrated:
The mankind are not to ponder a problem from the beginning each second, and the word before the mankind are all based on understands each list
Word, and all the elements can't be all discarded, then understand from the beginning, the thinking of the mankind has persistence.Traditional
Neural network can not accomplish this point, this is one main disadvantage.For example, each time point inside a film
The occurent thing of institute is classified, traditional neural network the reasoning about event before can not be applied to after thing
In part.
Recurrent neural network (Recurrent Neural Networks, RNN) solves the problems, such as this.They are a kind of
Network with cycle has the ability for keeping information.RNN can be seen as the multiple copies of the same neural network, often
A neural network module can transmit the information to next.
The cycle of recurrent neural network RNN is illustrated below in conjunction with shown in Fig. 3, such as 301 institute of neural network in Fig. 3
It is shown as the neural network schematic diagram that cycle is not yet unfolded, the neural network 302 in Fig. 3 is that the neural network of loop unrolling is illustrated
Figure.
As it can be seen that the neural network 302 after expansion includes multiple neural network module A being sequentially connected.
Specifically, in neural network 301 and neural network 302, the input of neural network module A is Xt, export as ht。
In neural network 302, the loop structure of each neural network module A of loop unrolling makes information from the upper of network
One step passes in next step.Recurrent neural network is considered the multiple replicated architecture of identical network, each network handle disappears
Breath is transmitted to its successor.
RNN can learn to utilize past information, and by among information connection task till now before, such as with regarding
Frequency can be used for understanding the information of video data present frame according to the information of former frame.
One language model of imagination is tasted based on current word tries to predict next word.If we have tried predictions
The last one word of the clouds are in the sky, under we do not need to any additional information obviously
One word is exactly " sky ".In this case, the interval between the point of the point information associated therewith of target prediction is smaller.At this point,
We need to forget contextual information.
But also sometimes we need more contextual informations.Imagine the last one word of prediction the words:I
grew up in France,I speak fluent French.Nearest information shows a kind of next word seemingly language
The name of speech, but if it is desirable that reducing the range for determining language form, it would be desirable to as France's before earlier
Context.And the interval between the point point associated therewith predicted is needed to become very much very big.At this time, it would be desirable to remember and according to
Rely contextual information.
That is according to concrete condition, need to forget contextual information when having plenty of, sometimes we need to remember
Context information.Traditional RNN methods can not solve long-term Dependence Problem, and it refers to long-term to be relied on for a long time shown in the present embodiment
Remember and relies on contextual information.But LSTM shown in the present embodiment can solve the problems, such as to rely on contextual information for a long time.
LSTM networks are a kind of special RNN, it being capable of Chief Learning Officer, CLO's Time Dependent.LSTM is specially to be designed to avoid growing
Phase Dependence Problem.Memory long-term information is the default behavior of LSTM, rather than their things for trying to learn.LSTM have compared with
Strong timing dependence can utilize context relation, for being related to the task of sequence inputting, such as voice and language very well
Speech, LSTM networks can obtain better effect.
The concrete structure of LSTM networks is illustrated below in conjunction with shown in Fig. 4:
LSTM networks tool there are one forget door 401, can need remember context relation when, selection for a long time according to
Rely;When needing to forget context relation, selection is forgotten.Long-time Dependence Problem can thus be well solved.
Specifically, three Memory-Gates of LSTM network settings band, input gate input gate402, out gate output
Gate403 and the block structure for forgeing door forget gate401.
The input gate input gate402 can be filtered input, then be stored in mnemon cell404,
So that in mnemon cell404 existing last moment state, and be added to the state at current time.
The cooperation of three doors allows LSTM networks to store long-term information, as long as example, input gate input gate402
It remains turned-off, the information that mnemon cell404 is stored would not be covered by the input at moment later.
After LSTM networks, when error is returned from output layer backpropagation, it can be remembered using mnemon cell404
Get off.So LSTM can remember to compare the information in long-time.
More specifically, the input gate input gate402 play a part of to control input information, and the input of door is upper one
The output of the concealed nodes of moment point and current input, by the output of input gate input gate402 and input node
Output, which is multiplied, can play the role of controlling information content.
The forgetting door forget gate401 play a part of to control internal state information, and the input of door was a upper moment
The output and current input of the concealed nodes of point.
The out gate output gate403 play a part of to control output information, and the input of door is a upper moment point
The output of concealed nodes and current input, activation primitive sigmoid will be defeated because the output of sigmoid is between 0-1
The output of output gate403 of going out is multiplied with the output of internal state node can play the role of controlling information content.
The concrete structure of the trained network shown in the present embodiment is described in detail below in conjunction with shown in Fig. 5:
As shown in figure 5, the trained network shown in the present embodiment includes the long memory network in short-term of N layers to sort by level
LSTM, the present embodiment do not limit the specific number of N, as long as the N is the positive integer more than or equal to 2.
The present embodiment carries out optional exemplary explanation so that the N is equal to 2 as an example, i.e., the present embodiment is with the trained network packet
For including two layers of LSTM.
Specifically, in two layers of LSTM, the input of the output of the LSTM of preceding layer as later layer, it is seen then that multilayer LSTM it
Between can carry out the cycles of data.
Relative to the LSTM of single layer, LSTMs of the bilayer LSTM relative to single layer shown in the present embodiment has more optimized
Performance, and can be more efficient use parameter possessed by LSTM.
Because the trained network shown in the present embodiment includes multiple LSTM, the LSTM for being located at lower layer can be to upper
The iterative parameter that layer LSTM is inputted is modified, it is seen then that the LSTM for using multilayer can effectively promote the standard of languages identification
True property.
Optionally, shown in the present embodiment can by the second feature data in the trained network iteration M take turns, often wheel changing
The training pattern that generation is generated may be set to candidate training pattern.
Languages identification equipment shown in the present embodiment can select the target in M takes turns candidate training pattern and train mould
Type.
The present embodiment does not limit the concrete mode of the determination target training pattern, for example, shown in the present embodiment
Languages identification equipment can manslaughter rate, average recognition speed and accuracy rate etc. are selected in M takes turns candidate training pattern according to coverage rate
The fixed target training pattern.
Step 208 as shown below to step 213 is to be identified on the line of Language Identification shown in the embodiment of the present invention
Partial specific execution step:
Step 208 obtains the first audio-video document.
In the present embodiment, the first audio-video document for carrying out languages identification can will be needed to be input to language shown in the present embodiment
Kind identification equipment.
For example, first audio-video document shown in the present embodiment may include that the number of video is 4654, by 4654
A video input is to the languages identification equipment.
Step 209 is decoded to generate the first audio, video data first audio-video document by decoder.
Multimedia video handling implement (full name in English shown in the present embodiment:Fast Forward Mpeg, English abbreviation:
FFmpeg) decoder.
It should be clear that, the present embodiment is optional example to the explanation of the decoder, is not construed as limiting, as long as the solution
Code device can be decoded first audio-video document to generate first audio and video that can carry out languages identification
Data.
Step 210 is filtered first audio, video data generation first object audio, video data.
To reduce the duration that identification division executes on line, the efficiency of promotion languages identification promotes the accuracy of languages identification,
Then languages identification equipment shown in the present embodiment can be filtered first audio, video data.
Specifically, languages identification equipment shown in the present embodiment passes through VAD (full name in English:Voice Activity
Detection, Chinese name:Voiced activity detection) it is detected, it is invalid quiet in first audio, video data to filter
Segment is to generate the first object audio, video data.
As it can be seen that being included using the first object audio, video data accessed by the step 211 shown in the present embodiment
Data be effective data, when being handled wasted to useless data so as to avoid the languages identification equipment
Long and system resource.
Step 211 carries out feature extraction fisrt feature data to the first object audio, video data.
Specifically, in the present embodiment, the languages identification equipment can carry out the first object audio, video data special
Sign extraction, to obtain fisrt feature data corresponding with the first object audio, video data.
The feature extracting method of feature extraction can be carried out shown in the present embodiment to the first object audio, video data
Can be spectrum envelope method, Cepstrum Method, LPC interpolation methods, LPC extraction of roots, Hilbert transform method, formant tracing algorithm etc..
The present embodiment is not construed as limiting the feature extracting method, as long as the first object audio data can be extracted
Fisrt feature data.
Step 212 determines target training pattern.
During executing step 212, languages identification equipment shown in the present embodiment is firstly the need of 207 institute of obtaining step
The target training pattern obtained.
Step 213, according to the target training pattern and the fisrt feature data acquisition target fractional.
Languages identification equipment shown in the present embodiment can be according to the target training pattern got and described
One characteristic carries out corresponding calculating, to obtain the target fractional.
Specifically, languages identification equipment shown in the present embodiment can be by each parameter possessed by the target training pattern
It is calculated with the fisrt feature data to obtain the target fractional.
Step 214 determines languages recognition result information corresponding with the target fractional.
Specifically, the languages recognition result information shown in the present embodiment is used to indicate the first object audio and video number
According to affiliated languages.
More specifically, languages identification equipment shown in the present embodiment is previously provided with different fraction ranges and different languages
The correspondence of kind, during executing the step 214 shown in the present embodiment, the languages identification equipment can first determine that institute
The target fractional range belonging to target fractional is stated, and then the languages identification device can determine and the target fractional range pair
The languages recognition result information answered.
For example, by taking languages corresponding with the fisrt feature data described in the present embodiment are Tibetan language as an example, then this implementation
Languages identification equipment can be previously stored with fraction range corresponding with Tibetan language shown in example, such as between 0 and 1, when the languages are known
The target fractional that other equipment identifies is fallen in the fraction range, then languages equipment identification may recognize that and described the
The corresponding file of one characteristic is Tibetan language audio-video document.If for example, the languages identification equipment identifies the target point
Number is 0.999, then the languages identification equipment can recognize that the target fractional 0.999 between fraction range 0 and 1,
Then the languages identification equipment may recognize that file corresponding with the fisrt feature data is Tibetan language audio-video document.
Advantage using method shown in the present embodiment is that languages identification equipment shown in the present embodiment is without regarding sound
The content of frequency file is analyzed, it is only necessary to which establishment can train the target training pattern that audio-video document is trained
Go out the languages belonging to audio-video document, and because the target training pattern is using training network pair the second target audio, video data
It is trained to obtain, the trained network includes the long memory network LSTM in short-term of N layers to sort by level, may make identification languages
Process efficiency it is high, speed is fast, and accuracy rate and coverage rate are much better than traditional shallow Model method and common DNN nets
Network, languages that can quick and precisely belonging to audio-video document.
It is to better illustrate the advantage of method shown in the present embodiment, then following that method shown in the present embodiment is surveyed
Examination;
Include Tibetan language video 79 with the first audio-video document in this test, for non-Tibetan language video 9604,
Wherein, each video maximum length is 180 seconds.
During determining the target training pattern, online lower training part, the target training pattern can be institute
State the training pattern that iteration the 4600th is taken turns in trained network;
It is trained to first audio-video document to export languages recognition result according to the target training pattern
It when information, can obtain, in this test, coverage rate=67/79=84.8% manslaughters rate=1/ (9064)=0.01%, Tibetan language
Video is averaged, and recognition speed 1.6s/ is a, and normal video is averaged recognition speed=3.4s/.
For another example, include dimension language video 100, non-dimension language video with the first audio-video document in another test
For 9608, wherein each video maximum length is 180 seconds.
During determining the target training pattern, online lower training part, the target training pattern can be institute
State the training pattern that iteration the 3400th is taken turns in trained network;
It is trained to first audio-video document to export languages recognition result according to the target training pattern
It when information, can obtain, in this test, coverage rate=30/100=30.0%, the rate of manslaughtering=10/9068=0.1% is tieed up language and regarded
Frequency is averaged, and recognition speed 1.66s/ is a, and normal video is averaged recognition speed=3.51s/.
Method shown in embodiment for a better understanding of the present invention, below can to method shown in the embodiment of the present invention
The application scenarios of application illustrate:
The explanation for the scene applied to method shown in the embodiment of the present invention below it should be clear that, is optionally to show
Example, does not limit.
Scene one:Field of speech recognition
With the arrival in mobile interchange epoch, voice assistant is fashionable as similar siri, and user needs basis
The difference of itself language downloads the voice assistant of different language.Various voices also in the market turn text tool, need foundation
Described languages select corresponding tool, inconvenient.It, can be according to using using Language Identification shown in the present embodiment
Language described in person quickly navigates to the voice assistant of corresponding languages, convenient and efficient.
Scene two:Bank and stock exchange information service
In the places such as bank and stock exchange, when encountering the ethnic group customer that will not be spoken Mandarin, it is difficult to do correlation
Reason business needs to find and specially understands that the staff of a small number of name races language is responsible for reception.Before this, it not can determine that described in customer
Language can waste many times.Tibetan dimension voice frequency can be quickly identified using Language Identification shown in the present embodiment,
Content described in user, religion machine recognize the sound of ethnic group compatriot, quickly recognize corresponding languages classification, find
Relevant staff receives.
Scene three:Urgent hotline service
The distress call 120 for handling ethnic group compatriot and alarm the emergency services such as 110 when, the time is of short duration, can not
In the case of confirming speaker's languages, the valuable emergency time can be delayed, jeopardize the life for calling for help people.Shown in the present embodiment
Languages identify that skill method, the audio described in user quickly recognize corresponding languages classification, corresponding languages are understood in searching
Staff records, and can save the quality time, saves life.
Scene four:Video identification is feared cruelly
With the development of mobile Internet, many people like issuing video in wechat, the social softwares such as QQ space, daily
The video of upload is hundreds of millions of.A large amount of malice video can also be contained among these, be related to politics and fear cruelly etc., similar Tibetan independence, boundary
It is even more solely high-risk malice video.This kind of number of videos is not very much, and the daily audit amount of contact staff is fixed, is differed surely
It effectively finds this kind of video, and a large amount of time can be wasted.It, can be quick using Language Identification shown in the present embodiment
It positions doubtful politics present in massive video and fears video cruelly, such as be supplied to customer service to audit the video that languages are Tibetan dimension language,
Improve working efficiency, accurate killing malice video.
Scene five:Monitor suspect
When army and police monitor suspect, need to differentiate identity, nationality and the speech content spoken, this needs big
The manpower and materials of amount carry out, and lead to inefficiency.Using the Language Identification of the present embodiment, can precisely adjudicate monitored
The language information of people, to judge the information such as its identity, nationality and nationality.
Languages identification equipment shown in the present embodiment can be used for executing the present embodiment Language Identification shown in Fig. 2, this
Languages identification equipment shown in embodiment also can perform the present embodiment Language Identification shown in fig. 6, in figure 6, institute's predicate
Kind identification equipment need to only execute training part under the line in Language Identification.
Step 601 obtains audio-video document.
Step 602 is decoded to generate audio, video data the audio-video document by decoder.
Step 603 is filtered the audio, video data generation target audio, video data.
Step 604 carries out the target audio, video data feature extraction to obtain characteristic.
Target classification label is arranged in step 605 in the characteristic.
The characteristic for being provided with the target classification label is input to the trained network by step 606.
Step 607 is iterated training to obtain the target training by the trained network to the characteristic
Model.
The audio-video document shown in the present embodiment illustrates, and please refer to the second audio-video document shown in Fig. 2
Explanation, illustrating for the target audio, video data shown in the present embodiment please refer to the second target sound shown in Fig. 2 and regard
The explanation of frequency file, illustrating for the characteristic shown in the present embodiment, please refer to second feature data shown in Fig. 2
Explanation, do not repeat in the present embodiment specifically.
Process shown in step 601 to step 607 shown in the present embodiment please refer to step 201 shown in Fig. 2 to step
Shown in 207, do not repeat in the present embodiment specifically.
Below in conjunction with shown in Fig. 7 from function module angle to the concrete structure of languages identification equipment shown in the present embodiment into
Row explanation:
The languages identification equipment includes:
Third acquiring unit 701, for obtaining the second target audio, video data;
Specifically, the second acquisition unit 701 includes:
Second acquisition module 7011, for obtaining the second audio-video document for training under line;
Second decoder module 7012 is decoded to generate second second audio-video document for passing through decoder
Audio, video data;
Second filtering module 7013, for filtering the nothing in second audio, video data by voiced activity detection VAD
Mute section is imitated to generate the second target audio, video data.
Second recognition unit 702, for the second target audio, video data carry out feature extraction, with obtain with it is described
The corresponding second feature data of second target audio, video data;
Setting unit 703, for target classification label, the target classification label to be arranged in the second feature data
For be used to indicate the target audio data languages label;
Training unit 704, for passing through the LSTM of the memory network in short-term N layers long included by the trained network successively
Training is iterated to the second feature data, to obtain the target training pattern;
Training unit also 704 is additionally operable to, and passes through the memory network LSTM in short-term N layers long included by the trained network
Training is iterated to the second feature data for being provided with the target classification label successively, is trained with obtaining the target
Model.
First acquisition unit 705, for obtaining the first object audio, video data for being used for being identified on line;
Specifically, the first acquisition unit 705 includes:
First acquisition module 7051, for obtaining the first audio-video document for being used for being identified on line;
First decoder module 7052 is decoded to generate first first audio-video document for passing through decoder
Audio, video data;
First filtering module 7053, for filtering the nothing in first audio, video data by voiced activity detection VAD
Mute section is imitated to generate the first object audio, video data.
First recognition unit 706, for the first object audio, video data carry out feature extraction, with obtain with it is described
The corresponding fisrt feature data of first object audio, video data;
First determination unit 707, for determining that target training pattern, the target training pattern are using training network pair
Second target audio, video data is trained to obtain, and the trained network includes the long memory network in short-term of N layers to sort by level
LSTM, the N are the positive integer more than or equal to 2;
Second acquisition unit 708, for according to the target training pattern and the fisrt feature data acquisition target point
Number;
Second determination unit 708, for determining languages recognition result information corresponding with the target fractional, the languages
Recognition result information is used to indicate the languages belonging to the first object audio, video data.
The detailed process that shown languages identification equipment shown in the present embodiment executes Language Identification please refer to Fig. 2 institutes
Show, does not repeat in the present embodiment specifically.
Languages identification equipment shown in the present embodiment executes the advantageous effect during Language Identification, please refer to Fig. 2
Shown in embodiment, do not repeat in the present embodiment specifically.
Below in conjunction with shown in Fig. 8 from function module angle to the concrete structure of languages identification equipment shown in the present embodiment into
Row explanation, the languages identification equipment shown in Fig. 8 can realize training part under the line in Language Identification.
Specifically, the languages identification equipment includes:
First acquisition unit 801, for obtaining the target audio, video data for training under line;
Specifically, the acquiring unit 801 includes:
Acquisition module 8011, for obtaining the audio-video document for training under line;
Decoder module 8012 is decoded to generate audio, video data the audio-video document for passing through decoder;
Filtering module 8013, for filtering invalid mute section in the audio, video data by voiced activity detection VAD
To generate the target audio, video data.
Second acquisition unit 802, for carrying out feature extraction to the target audio, video data, to obtain and the target
The corresponding characteristic of audio, video data;
Setting unit 803, for target classification label to be arranged in the characteristic, the target classification label is to use
In the label for the languages for indicating the target audio data;
Training unit 804, for passing through the long memory network LSTM in short-term of N layers by level sequence included by training network
Training is iterated to the characteristic successively, to obtain target training pattern, the target training pattern is for carrying out language
Kind identification;
The training unit also 804 is additionally operable to, and passes through the N layers long memory network in short-term included by the trained network
LSTM is iterated training to the characteristic for being provided with the target classification label successively, is trained with obtaining the target
Model.
The detailed process that shown languages identification equipment shown in the present embodiment executes Language Identification please refer to Fig. 6 institutes
Show, does not repeat in the present embodiment specifically.
Languages identification equipment shown in the present embodiment executes the advantageous effect during Language Identification, please refer to Fig. 6
Shown in embodiment, do not repeat in the present embodiment specifically.
It is apparent to those skilled in the art that for convenience and simplicity of description, the system of foregoing description,
The specific work process of device and unit, can refer to corresponding processes in the foregoing method embodiment, and details are not described herein.
In several embodiments provided herein, it should be understood that disclosed system, device and method can be with
It realizes by another way.For example, the apparatus embodiments described above are merely exemplary, for example, the unit
It divides, only a kind of division of logic function, formula that in actual implementation, there may be another division manner, such as multiple units or component
It can be combined or can be integrated into another system, or some features can be ignored or not executed.Another point, it is shown or
The mutual coupling, direct-coupling or communication connection discussed can be the indirect coupling by some interfaces, device or unit
It closes or communicates to connect, can be electrical, machinery or other forms.
The unit illustrated as separating component may or may not be physically separated, aobvious as unit
The component shown may or may not be physical unit, you can be located at a place, or may be distributed over multiple
In network element.Some or all of unit therein can be selected according to the actual needs to realize the mesh of this embodiment scheme
's.
In addition, each functional unit in each embodiment of the present invention can be integrated in a processing unit, it can also
It is that each unit physically exists alone, it can also be during two or more units be integrated in one unit.Above-mentioned integrated list
The form that hardware had both may be used in member is realized, can also be realized in the form of SFU software functional unit.
If the integrated unit is realized in the form of SFU software functional unit and sells or use as independent product
When, it can be stored in a computer read/write memory medium.Based on this understanding, technical scheme of the present invention is substantially
The all or part of the part that contributes to existing technology or the technical solution can be in the form of software products in other words
It embodies, which is stored in a storage medium, including some instructions are used so that a computer
Equipment (can be personal computer, languages identification equipment or the network equipment etc.) executes side described in each embodiment of the present invention
The all or part of step of method.And storage medium above-mentioned includes:USB flash disk, mobile hard disk, read-only memory (ROM, Read-Only
Memory), random access memory (RAM, Random Access Memory), magnetic disc or CD etc. are various can store journey
The medium of sequence code.
The above, the above embodiments are merely illustrative of the technical solutions of the present invention, rather than its limitations;Although with reference to before
Stating embodiment, invention is explained in detail, it will be understood by those of ordinary skill in the art that:It still can be to preceding
The technical solution recorded in each embodiment is stated to modify or equivalent replacement of some of the technical features;And these
Modification or replacement, the spirit and scope for various embodiments of the present invention technical solution that it does not separate the essence of the corresponding technical solution.
Claims (16)
1. a kind of Language Identification, which is characterized in that including:
Obtain the target audio, video data for training under line;
Feature extraction is carried out to the target audio, video data, to obtain characteristic corresponding with the target audio, video data
According to;
By training network included by by level sequence the long LSTM of memory network in short-term of N layers successively to the characteristic into
Row iteration is trained, and to obtain target training pattern, the target training pattern is for carrying out languages identification.
2. according to the method described in claim 1, it is characterized in that, described carry out feature extraction to the target audio, video data
Later, the method further includes:
Target classification label is set in the characteristic, and the target classification label is to be used to indicate the target sound frequency
According to languages label;
The long LSTM of memory network in short-term of N layers by level sequence by included by training network is successively to the characteristic
Include according to training is iterated:
By the LSTM of the memory network in short-term N layers long included by the trained network successively to being provided with the target classification
The characteristic of label is iterated training, to obtain the target training pattern.
3. method according to claim 1 or 2, which is characterized in that described to obtain the target sound video data and include:
Obtain the audio-video document for training under line;
The audio-video document is decoded to generate audio, video data by decoder;
Invalid mute section in the audio, video data is filtered to generate the target sound video counts by voiced activity detection VAD
According to.
4. a kind of Language Identification, which is characterized in that including:
Obtain the first object audio, video data for being identified on line;
Feature extraction is carried out to the first object audio, video data, it is corresponding with the first object audio, video data to obtain
Fisrt feature data;
Determine that target training pattern, the target training pattern are to be instructed using training network pair the second target audio, video data
It gets, the trained network includes that N the layers long memory network LSTM in short-term, the N to sort by level is just more than or equal to 2
Integer;
According to the target training pattern and the fisrt feature data acquisition target fractional;
Determine that languages recognition result information corresponding with the target fractional, the languages recognition result information are used to indicate described
Languages belonging to first object audio, video data.
5. according to the method described in claim 4, it is characterized in that, the first object audio and video obtained for being identified on line
Data include:
Obtain the first audio-video document for being identified on line;
First audio-video document is decoded to generate the first audio, video data by decoder;
Invalid mute section in first audio, video data is filtered to generate the first object by voiced activity detection VAD
Audio, video data.
6. according to the method described in claim 4, it is characterized in that, the first object audio and video obtained for being identified on line
Before data, the method further includes:
Obtain the second target audio, video data;
Feature extraction is carried out to the second target audio, video data, it is corresponding with the second target audio, video data to obtain
Second feature data;
By the LSTM of the memory network in short-term N layers long included by the trained network successively to the second feature data into
Row iteration is trained, to obtain the target training pattern.
7. according to the method described in claim 6, it is characterized in that, described carry out feature to the second target audio, video data
After extraction, the method further includes:
Target classification label is set in the second feature data, and the target classification label is to be used to indicate the target sound
The label of the languages of frequency evidence;
The LSTM of the memory network in short-term N layers long by included by the trained network is successively to the second feature number
Include according to training is iterated:
By the LSTM of the memory network in short-term N layers long included by the trained network successively to being provided with the target classification
The second feature data of label are iterated training, to obtain the target training pattern.
8. the method described according to claim 6 or 7, which is characterized in that described to obtain the second target sound video data packet
It includes:
Obtain the second audio-video document for training under line;
Second audio-video document is decoded to generate the second audio, video data by decoder;
Invalid mute section in second audio, video data is filtered to generate second target by voiced activity detection VAD
Audio, video data.
9. a kind of languages identification equipment, which is characterized in that including:
First acquisition unit, for obtaining the target audio, video data for training under line;
Second acquisition unit, for carrying out feature extraction to the target audio, video data, to obtain and the target audio and video
The corresponding characteristic of data;
Training unit, for right successively by the long LSTM of memory network in short-term of N layers by level sequence included by training network
The characteristic is iterated training, and to obtain target training pattern, the target training pattern is for carrying out languages identification.
10. languages identification equipment according to claim 9, which is characterized in that the languages identification equipment further includes:
Setting unit, for target classification label to be arranged in the characteristic, the target classification label is to be used to indicate
The label of the languages of the target audio data;
The training unit is additionally operable to, successively by the LSTM of the memory network in short-term N layers long included by the trained network
The characteristic to being provided with the target classification label is iterated training, to obtain the target training pattern.
11. languages identification equipment according to claim 9 or 10, which is characterized in that the first acquisition unit includes:
Acquisition module, for obtaining the audio-video document for training under line;
Decoder module is decoded to generate audio, video data the audio-video document for passing through decoder;
Filtering module, for filtering invalid mute section in the audio, video data by voiced activity detection VAD to generate
State target audio, video data.
12. a kind of languages identification equipment, which is characterized in that including:
First acquisition unit, for obtaining the first object audio, video data for being used for being identified on line;
First recognition unit, for carrying out feature extraction to the first object audio, video data, to obtain and first mesh
The corresponding fisrt feature data of mark with phonetic symbols video data;
First determination unit, for determining that target training pattern, the target training pattern are using training the second mesh of network pair
Mark with phonetic symbols video data is trained to obtain, and the trained network includes the long memory network LSTM in short-term of N layers to sort by level, institute
It is the positive integer more than or equal to 2 to state N;
Second acquisition unit, for according to the target training pattern and the fisrt feature data acquisition target fractional;
Second determination unit, for determining languages recognition result information corresponding with the target fractional, the languages identification knot
Fruit information is used to indicate the languages belonging to the first object audio, video data.
13. languages identification equipment according to claim 12, which is characterized in that the first acquisition unit includes:
First acquisition module, for obtaining the first audio-video document for being used for being identified on line;
First decoder module is decoded first audio-video document for passing through decoder to generate the first audio and video number
According to;
First filtering module, for filtering invalid mute section in first audio, video data by voiced activity detection VAD
To generate the first object audio, video data.
14. languages identification equipment according to claim 12, which is characterized in that the languages identification equipment further includes:
Third acquiring unit, for obtaining the second target audio, video data;
Second recognition unit, for carrying out feature extraction to the second target audio, video data, to obtain and second mesh
The corresponding second feature data of mark with phonetic symbols video data;
Training unit, for passing through the LSTM of the memory network in short-term N layers long included by the trained network successively to described
Second feature data are iterated training, to obtain the target training pattern.
15. languages identification equipment according to claim 14, which is characterized in that the languages identification equipment further includes:
Setting unit, in the second feature data be arranged target classification label, the target classification label be for
Indicate the label of the languages of the target audio data;
The training unit is additionally operable to, successively by the LSTM of the memory network in short-term N layers long included by the trained network
The second feature data to being provided with the target classification label are iterated training, and mould is trained to obtain the target
Type.
16. the languages identification equipment according to claims 14 or 15, which is characterized in that the third acquiring unit includes:
Second acquisition module, for obtaining the second audio-video document for training under line;
Second decoder module is decoded second audio-video document for passing through decoder to generate the second audio and video number
According to;
Second filtering module, for filtering invalid mute section in second audio, video data by voiced activity detection VAD
To generate the second target audio, video data.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710035625.5A CN108335693B (en) | 2017-01-17 | 2017-01-17 | Language identification method and language identification equipment |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710035625.5A CN108335693B (en) | 2017-01-17 | 2017-01-17 | Language identification method and language identification equipment |
Publications (2)
Publication Number | Publication Date |
---|---|
CN108335693A true CN108335693A (en) | 2018-07-27 |
CN108335693B CN108335693B (en) | 2022-02-25 |
Family
ID=62921583
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710035625.5A Expired - Fee Related CN108335693B (en) | 2017-01-17 | 2017-01-17 | Language identification method and language identification equipment |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108335693B (en) |
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109192192A (en) * | 2018-08-10 | 2019-01-11 | 北京猎户星空科技有限公司 | A kind of Language Identification, device, translator, medium and equipment |
CN109346103A (en) * | 2018-10-30 | 2019-02-15 | 交通运输部公路科学研究所 | A kind of audio-frequency detection for highway tunnel traffic event |
CN110033756A (en) * | 2019-04-15 | 2019-07-19 | 北京达佳互联信息技术有限公司 | Language Identification, device, electronic equipment and storage medium |
CN110148399A (en) * | 2019-05-06 | 2019-08-20 | 北京猎户星空科技有限公司 | A kind of control method of smart machine, device, equipment and medium |
WO2020039247A1 (en) * | 2018-08-23 | 2020-02-27 | Google Llc | Automatically determining language for speech recognition of spoken utterance received via an automated assistant interface |
CN111179910A (en) * | 2019-12-17 | 2020-05-19 | 深圳追一科技有限公司 | Speed of speech recognition method and apparatus, server, computer readable storage medium |
CN111429924A (en) * | 2018-12-24 | 2020-07-17 | 同方威视技术股份有限公司 | Voice interaction method and device, robot and computer readable storage medium |
CN112669816A (en) * | 2020-12-24 | 2021-04-16 | 北京有竹居网络技术有限公司 | Model training method, speech recognition method, device, medium and equipment |
WO2021208455A1 (en) * | 2020-04-15 | 2021-10-21 | 南京邮电大学 | Neural network speech recognition method and system oriented to home spoken environment |
CN113761885A (en) * | 2021-03-17 | 2021-12-07 | 中科天玑数据科技股份有限公司 | Bayesian LSTM-based language identification method |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104427292A (en) * | 2013-08-22 | 2015-03-18 | 中兴通讯股份有限公司 | Method and device for extracting a conference summary |
US20160035344A1 (en) * | 2014-08-04 | 2016-02-04 | Google Inc. | Identifying the language of a spoken utterance |
CN105957531A (en) * | 2016-04-25 | 2016-09-21 | 上海交通大学 | Speech content extracting method and speech content extracting device based on cloud platform |
CN205647778U (en) * | 2016-04-01 | 2016-10-12 | 安徽听见科技有限公司 | Intelligent conference system |
-
2017
- 2017-01-17 CN CN201710035625.5A patent/CN108335693B/en not_active Expired - Fee Related
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104427292A (en) * | 2013-08-22 | 2015-03-18 | 中兴通讯股份有限公司 | Method and device for extracting a conference summary |
US20160035344A1 (en) * | 2014-08-04 | 2016-02-04 | Google Inc. | Identifying the language of a spoken utterance |
CN205647778U (en) * | 2016-04-01 | 2016-10-12 | 安徽听见科技有限公司 | Intelligent conference system |
CN105957531A (en) * | 2016-04-25 | 2016-09-21 | 上海交通大学 | Speech content extracting method and speech content extracting device based on cloud platform |
Cited By (15)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109192192A (en) * | 2018-08-10 | 2019-01-11 | 北京猎户星空科技有限公司 | A kind of Language Identification, device, translator, medium and equipment |
US11393476B2 (en) | 2018-08-23 | 2022-07-19 | Google Llc | Automatically determining language for speech recognition of spoken utterance received via an automated assistant interface |
WO2020039247A1 (en) * | 2018-08-23 | 2020-02-27 | Google Llc | Automatically determining language for speech recognition of spoken utterance received via an automated assistant interface |
CN112262430A (en) * | 2018-08-23 | 2021-01-22 | 谷歌有限责任公司 | Automatically determining language for speech recognition of a spoken utterance received via an automated assistant interface |
CN109346103A (en) * | 2018-10-30 | 2019-02-15 | 交通运输部公路科学研究所 | A kind of audio-frequency detection for highway tunnel traffic event |
CN109346103B (en) * | 2018-10-30 | 2023-03-28 | 交通运输部公路科学研究所 | Audio detection method for road tunnel traffic incident |
CN111429924A (en) * | 2018-12-24 | 2020-07-17 | 同方威视技术股份有限公司 | Voice interaction method and device, robot and computer readable storage medium |
CN110033756A (en) * | 2019-04-15 | 2019-07-19 | 北京达佳互联信息技术有限公司 | Language Identification, device, electronic equipment and storage medium |
CN110033756B (en) * | 2019-04-15 | 2021-03-16 | 北京达佳互联信息技术有限公司 | Language identification method and device, electronic equipment and storage medium |
CN110148399A (en) * | 2019-05-06 | 2019-08-20 | 北京猎户星空科技有限公司 | A kind of control method of smart machine, device, equipment and medium |
CN111179910A (en) * | 2019-12-17 | 2020-05-19 | 深圳追一科技有限公司 | Speed of speech recognition method and apparatus, server, computer readable storage medium |
WO2021208455A1 (en) * | 2020-04-15 | 2021-10-21 | 南京邮电大学 | Neural network speech recognition method and system oriented to home spoken environment |
CN112669816A (en) * | 2020-12-24 | 2021-04-16 | 北京有竹居网络技术有限公司 | Model training method, speech recognition method, device, medium and equipment |
CN112669816B (en) * | 2020-12-24 | 2023-06-02 | 北京有竹居网络技术有限公司 | Model training method, voice recognition method, device, medium and equipment |
CN113761885A (en) * | 2021-03-17 | 2021-12-07 | 中科天玑数据科技股份有限公司 | Bayesian LSTM-based language identification method |
Also Published As
Publication number | Publication date |
---|---|
CN108335693B (en) | 2022-02-25 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108335693A (en) | A kind of Language Identification and languages identification equipment | |
Ren et al. | Deep scalogram representations for acoustic scene classification | |
Pouyanfar et al. | Multimodal deep learning based on multiple correspondence analysis for disaster management | |
Demertzis et al. | Extreme deep learning in biosecurity: the case of machine hearing for marine species identification | |
US9278255B2 (en) | System and method for activity recognition | |
CN111161715B (en) | Specific sound event retrieval and positioning method based on sequence classification | |
CN112183107B (en) | Audio processing method and device | |
Gaurav et al. | Performance of deer hunting optimization based deep learning algorithm for speech emotion recognition | |
Parekh et al. | Weakly supervised representation learning for audio-visual scene analysis | |
Yan et al. | A region based attention method for weakly supervised sound event detection and classification | |
KR20230175258A (en) | End-to-end speaker separation through iterative speaker embedding | |
CN110136726A (en) | A kind of estimation method, device, system and the storage medium of voice gender | |
Zhang et al. | Automatic detection and classification of marmoset vocalizations using deep and recurrent neural networks | |
Colonna et al. | A comparison of hierarchical multi-output recognition approaches for anuran classification | |
Yousefi et al. | Real-time speaker counting in a cocktail party scenario using attention-guided convolutional neural network | |
Shen et al. | Learning mobile application usage-a deep learning approach | |
CN112466284B (en) | Mask voice identification method | |
Meng et al. | A capsule network with pixel-based attention and BGRU for sound event detection | |
Kalinli et al. | Saliency-driven unstructured acoustic scene classification using latent perceptual indexing | |
CN113870863A (en) | Voiceprint recognition method and device, storage medium and electronic equipment | |
Yang et al. | LCSED: A low complexity CNN based SED model for IoT devices | |
Gowda et al. | Affective computing using speech processing for call centre applications | |
CN116705034A (en) | Voiceprint feature extraction method, speaker recognition method, model training method and device | |
CN116186255A (en) | Method for training unknown intention detection model, unknown intention detection method and device | |
Mandal et al. | Is attention always needed? a case study on language identification from speech |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20220225 |