CN108335693A

CN108335693A - A kind of Language Identification and languages identification equipment

Info

Publication number: CN108335693A
Application number: CN201710035625.5A
Authority: CN
Inventors: 张大威; 贲国生
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2017-01-17
Filing date: 2017-01-17
Publication date: 2018-07-27
Anticipated expiration: 2037-01-17
Also published as: CN108335693B

Abstract

The embodiment of the invention discloses a kind of Language Identification and languages identification equipment, the method includes：Feature extraction is carried out to the target audio, video data for training under line, to obtain characteristic corresponding with the target audio, video data, training is iterated to the characteristic successively by the long LSTM of memory network in short-term of N layers by level sequence included by training network, target training pattern is identified for carrying out languages to obtain.The enterprising enforcement of large data sets can be applied to shown in the present embodiment to use, through this embodiment shown in during the target training module carries out languages identification, identification accuracy is high, and speed is fast, disclosure satisfy that the demand of the speed currently identified to languages.

Description

A kind of Language Identification and languages identification equipment

Technical field

The present invention relates to field of computer technology more particularly to a kind of Language Identification and languages identification equipments.

Background technology

As internationalization exchanges increasingly close trend, in every field, such as information inquiry service, alarm system, silver The fields such as row, stock exchange and urgent hotline service, the requirement to the speed of languages identification is also higher and higher, is looked into information For asking service, many information query systems can provide multilingual service, but only determine user's in information query system After category of language, the service of corresponding languages can be targetedly provided.The example of this kind of exemplary service includes travel information, answers Anxious service and shopping etc..

Languages identifying schemes currently on the market largely use mixed Gauss model (full name in English：Gaussian Mixture Model, English abbreviation：) or support vector machines (full name in English GMM：Support Vector Machine, English Literary abbreviation：Traditional shallow Model method such as SVM).

But languages identifying schemes used by the prior art can not actually use on large data sets, and accuracy rate Low, speed is slow, cannot meet the speed requirement currently identified to languages.

Invention content

An embodiment of the present invention provides a kind of Language Identification and languages identification equipments, can be applied to big data Languages identification is carried out on collection, and identifies that accuracy is high, and speed is fast.

First aspect of the embodiment of the present invention provides a kind of Language Identification, including：

Obtain the target audio, video data for training under line；

Feature extraction is carried out to the target audio, video data, to obtain feature corresponding with the target audio, video data Data；

By the long LSTM of memory network in short-term of N layers by level sequence included by training network successively to the characteristic According to training is iterated, to obtain target training pattern, the target training pattern is for carrying out languages identification.

Second aspect of the embodiment of the present invention provides a kind of Language Identification, including：

Obtain the first object audio, video data for being identified on line；

Feature extraction is carried out to the first object audio, video data, to obtain and the first object audio, video data pair The fisrt feature data answered；

Determine target training pattern, the target training pattern be using training network pair the second target audio, video data into Row training obtains, and the trained network includes that N the layers long memory network LSTM in short-term, the N to sort by level is more than or equal to 2 Positive integer；

According to the target training pattern and the fisrt feature data acquisition target fractional；

Determine that languages recognition result information corresponding with the target fractional, the languages recognition result information are used to indicate Languages belonging to the first object audio, video data.

The third aspect of the embodiment of the present invention provides a kind of languages identification equipment, including：

First acquisition unit, for obtaining the target audio, video data for training under line；

Second acquisition unit, for carrying out feature extraction to the target audio, video data, to obtain and the target sound The corresponding characteristic of video data；

Training unit also, is used for through the long memory network LSTM in short-term of N layers by level sequence included by training network Training is iterated to the characteristic successively, to obtain target training pattern, the target training pattern is for carrying out language Kind identification.

Fourth aspect of the embodiment of the present invention provides a kind of languages identification equipment, including：

First acquisition unit, for obtaining the first object audio, video data for being used for being identified on line；

First recognition unit, for carrying out feature extraction to the first object audio, video data, to obtain and described the The corresponding fisrt feature data of one target audio, video data；

First determination unit, for determining target training pattern, the target training pattern is using training network pair the Two target audio, video datas are trained to obtain, and the trained network includes the long memory network in short-term of N layers to sort by level LSTM, the N are the positive integer more than or equal to 2；

Second acquisition unit, for according to the target training pattern and the fisrt feature data acquisition target fractional；

Second determination unit, for determining that languages recognition result information corresponding with the target fractional, the languages are known Other result information is used to indicate the languages belonging to the first object audio, video data.

A kind of Language Identification and languages identification equipment are present embodiments provided, method can shown in the present embodiment Feature extraction is carried out to the target audio, video data for training under line, to obtain spy corresponding with the target audio, video data Data are levied, by the long LSTM of memory network in short-term of N layers by level sequence included by training network successively to the characteristic According to training is iterated, target training pattern is identified for carrying out languages to obtain.Big number can be applied to shown in the present embodiment Used according to enterprising enforcements is collected, through this embodiment shown in during the target training module carries out languages identification, identification standard True property is high, and speed is fast, disclosure satisfy that the demand of the speed currently identified to languages.

Description of the drawings

Fig. 1 is a kind of example structure schematic diagram of languages identification equipment provided by the present invention；

Fig. 2 is a kind of embodiment step flow chart of Language Identification provided by the present invention；

Fig. 3 is the cycle schematic diagram of recurrent neural network provided by the present invention；

Fig. 4 is the structural schematic diagram of LSTM networks provided by the present invention；

Fig. 5 is the structural schematic diagram of trained network provided by the present invention；

Fig. 6 is another embodiment step flow chart of Language Identification provided by the present invention；

Fig. 7 is another example structure schematic diagram of languages identification equipment provided by the present invention；

Fig. 8 is another example structure schematic diagram of languages identification equipment provided by the present invention.

Specific implementation mode

The Language Identification that the embodiment of the present invention is provided can be applied to the languages identification equipment with computing function, The Language Identification that embodiment is provided for a better understanding of the present invention combines shown in Fig. 1 and implements to the present invention first below The entity structure for the languages identification equipment that example is provided illustrates.

The explanation that the entity structure of languages identification equipment is provided for the embodiments of the invention below it should be clear that, is can The example of choosing, is not construed as limiting, as long as the Language Identification that the embodiment of the present invention is provided can be realized.

As shown in Figure 1, Fig. 1 is a kind of languages identification equipment structural schematic diagram provided in an embodiment of the present invention, which knows Other equipment 100 can generate bigger difference because configuration or performance are different, may include one or more central processings Device (central processing units, CPU) 122 (for example, one or more processors) and memory 132, one (such as one or more mass memories of storage medium 130 of a or more than one storage application program 142 or data 144 Equipment).Wherein, memory 132 and storage medium 130 can be of short duration storage or persistent storage.It is stored in storage medium 130 Program may include one or more modules (diagram does not mark), and each module may include in languages identification equipment Series of instructions operate.Further, central processing unit 122 could be provided as communicating with storage medium 130, be identified in languages The series of instructions operation in storage medium 130 is executed in equipment 100.

Languages identification equipment 100 can also include one or more power supplys 126, one or more wired or nothings Wired network interface 150, one or more input/output interfaces 158, and/or, one or more operating systems 141, Such as Windows ServerTM, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM etc..

Languages identification equipment 100 shown in FIG. 1 can realize the automatic Language Identification technology (Language of voice IDentification, LID).

LID, which refers to the process of languages identification equipment 100, can automatically identify the affiliated languages of voice.

Automatic Language Identification technology is having critically important application in information retrieval, criminal investigation and military field, adjoint The development of Internet technology, languages identification can play increasingly important role, along with the progress of technology, some day can Break the obstacle of Human communication, languages identification can also play a very important role surely wherein.Following some day, from not With national, the different colours of skin, says that the people of different language can utilize technological means, realize that free communication, languages are known Other technology is front-end processor important among this.Multilingual service can be provided in Future Information inquiry system, such as is being believed In terms of breath service, many information query systems can provide multilingual service, after information query system determines the category of language of user, The service of corresponding languages is provided.The example of this kind of exemplary service includes travel information, emergency service and shopping and bank, stock Ticket is merchandised.

Automatic Language Identification technology can also be used to the front-end processing of multi-lingual MT system, and directly will be a kind of Language conversion at another language communication system.

In addition in military affairs can also be used to that speaker's identity and nationality are monitored or differentiated.With arriving for information age Come and the development of internet, Language Identification increasingly show its application value.

Based on languages identification equipment shown in FIG. 1, know below in conjunction with languages are provided for the embodiments of the invention shown in Fig. 2 The specific execution steps flow chart of other method is described in detail, wherein Fig. 2 is the one of Language Identification provided by the present invention Kind embodiment step flow chart.

First, the step 201 shown in embodiment to step 207 is the specific execution steps flow chart of training part under line：

Step 201 obtains the second audio-video document.

During training part under executing line, the languages identification equipment can be obtained first for into training under line The second audio-video document.

The present embodiment is not construed as limiting the number of the audio, video data included by second audio-video document.

Step 202 is decoded to generate the second audio, video data second audio-video document by decoder.

Multimedia video handling implement (full name in English shown in the present embodiment：Fast Forward Mpeg, English abbreviation： FFmpeg) decoder.

It should be clear that, the present embodiment is optional example to the explanation of the decoder, is not construed as limiting, as long as the solution Code device can be decoded second audio-video document to generate second audio and video that can carry out languages identification Data.

Step 203 is filtered second audio, video data the second target audio, video data of generation.

To reduce the duration that training part executes under line, the efficiency of promotion languages identification promotes the accuracy of languages identification, Then languages identification equipment shown in the present embodiment can be filtered second audio, video data.

Specifically, languages identification equipment shown in the present embodiment passes through VAD (full name in English：Voice Activity Detection, Chinese name：Voiced activity detection) it is detected, it is invalid quiet in second audio, video data to filter Segment is to generate the second target audio, video data.

As it can be seen that being included using the second target audio, video data accessed by the step 203 shown in the present embodiment Data be effective data, when being handled wasted to useless data so as to avoid the languages identification equipment Long and system resource.

Step 204 carries out the second target audio, video data feature extraction to obtain second feature data.

Specifically, in the present embodiment, the languages identification equipment can carry out the second target audio, video data special Sign extraction, to obtain second feature data corresponding with the second target audio, video data.

The feature extracting method of feature extraction can be carried out shown in the present embodiment to the second target audio, video data Can be spectrum envelope method, Cepstrum Method, LPC interpolation methods, LPC extraction of roots, Hilbert transform method, formant tracing algorithm etc..

The present embodiment is not construed as limiting the feature extracting method, as long as the second target audio data can be extracted Second feature data.

Target classification label is arranged in step 205 in the second feature data.

The target classification label is the label for the languages for being used to indicate the target audio data.

The target classification label shown in the present embodiment is label corresponding with the second feature data.

The present embodiment is realized by the way that the target classification label is arranged in the second feature data according to languages Difference classify to second feature data.

Classify shown in the present embodiment, in simple terms, be exactly according to the language feature or attribute of the second feature data, It is divided into existing classification.

Such as in natural language processing NLP, text classification is a classification problem, and general method for classifying modes is all available It is studied in text classification.

Commonly sorting algorithm includes：Decision tree classification, simple Bayesian Classification Arithmetic (native Bayesian Classifier the grader of support vector machines (SVM), neural network, k- nearest neighbor methods (k-nearest), are based on Neighbor, kNN), fuzzy classifier method etc..

For example, being illustrated so that Tibetan language identifies scene as an example, the target classification label that can predefine Tibetan language is 1, by the target classification label so that Tibetan language is distinguished with other languages, then the step shown in the present embodiment Be arranged in 205 be set as in the second feature data 1 target classification label.

The second feature data for being provided with the target classification label are input to the trained network by step 206.

Step 207, by the trained network to be provided with the second feature data of the target classification label into Row iteration is trained to obtain the target training pattern.

Specifically, passing through the memory network LSTM in short-term N layers long included by the trained network shown in the present embodiment Training is iterated to the second feature data for being provided with the target classification label successively, is trained with obtaining the target Model.

More specifically, because of the step 206 shown in through this embodiment, the languages identification equipment will be provided with the mesh The second feature data of mark tag along sort are sent to the trained network, then the trained network can pass through the training The LSTM of the memory network in short-term N layers long included by network is special to being provided with described the second of the target classification label successively Sign data are iterated training, to obtain the target training pattern.

Below to the long memory network in short-term of N layers by level sequence included by the trained network shown in the present embodiment LSTM is illustrated：

The mankind are not to ponder a problem from the beginning each second, and the word before the mankind are all based on understands each list Word, and all the elements can't be all discarded, then understand from the beginning, the thinking of the mankind has persistence.Traditional Neural network can not accomplish this point, this is one main disadvantage.For example, each time point inside a film The occurent thing of institute is classified, traditional neural network the reasoning about event before can not be applied to after thing In part.

Recurrent neural network (Recurrent Neural Networks, RNN) solves the problems, such as this.They are a kind of Network with cycle has the ability for keeping information.RNN can be seen as the multiple copies of the same neural network, often A neural network module can transmit the information to next.

The cycle of recurrent neural network RNN is illustrated below in conjunction with shown in Fig. 3, such as 301 institute of neural network in Fig. 3 It is shown as the neural network schematic diagram that cycle is not yet unfolded, the neural network 302 in Fig. 3 is that the neural network of loop unrolling is illustrated Figure.

As it can be seen that the neural network 302 after expansion includes multiple neural network module A being sequentially connected.

Specifically, in neural network 301 and neural network 302, the input of neural network module A is X_t, export as h_t。

In neural network 302, the loop structure of each neural network module A of loop unrolling makes information from the upper of network One step passes in next step.Recurrent neural network is considered the multiple replicated architecture of identical network, each network handle disappears Breath is transmitted to its successor.

RNN can learn to utilize past information, and by among information connection task till now before, such as with regarding Frequency can be used for understanding the information of video data present frame according to the information of former frame.

One language model of imagination is tasted based on current word tries to predict next word.If we have tried predictions The last one word of the clouds are in the sky, under we do not need to any additional information obviously One word is exactly " sky ".In this case, the interval between the point of the point information associated therewith of target prediction is smaller.At this point, We need to forget contextual information.

But also sometimes we need more contextual informations.Imagine the last one word of prediction the words：I grew up in France,I speak fluent French.Nearest information shows a kind of next word seemingly language The name of speech, but if it is desirable that reducing the range for determining language form, it would be desirable to as France's before earlier Context.And the interval between the point point associated therewith predicted is needed to become very much very big.At this time, it would be desirable to remember and according to Rely contextual information.

That is according to concrete condition, need to forget contextual information when having plenty of, sometimes we need to remember Context information.Traditional RNN methods can not solve long-term Dependence Problem, and it refers to long-term to be relied on for a long time shown in the present embodiment Remember and relies on contextual information.But LSTM shown in the present embodiment can solve the problems, such as to rely on contextual information for a long time.

LSTM networks are a kind of special RNN, it being capable of Chief Learning Officer, CLO's Time Dependent.LSTM is specially to be designed to avoid growing Phase Dependence Problem.Memory long-term information is the default behavior of LSTM, rather than their things for trying to learn.LSTM have compared with Strong timing dependence can utilize context relation, for being related to the task of sequence inputting, such as voice and language very well Speech, LSTM networks can obtain better effect.

The concrete structure of LSTM networks is illustrated below in conjunction with shown in Fig. 4：

LSTM networks tool there are one forget door 401, can need remember context relation when, selection for a long time according to Rely；When needing to forget context relation, selection is forgotten.Long-time Dependence Problem can thus be well solved.

Specifically, three Memory-Gates of LSTM network settings band, input gate input gate402, out gate output Gate403 and the block structure for forgeing door forget gate401.

The input gate input gate402 can be filtered input, then be stored in mnemon cell404, So that in mnemon cell404 existing last moment state, and be added to the state at current time.

The cooperation of three doors allows LSTM networks to store long-term information, as long as example, input gate input gate402 It remains turned-off, the information that mnemon cell404 is stored would not be covered by the input at moment later.

After LSTM networks, when error is returned from output layer backpropagation, it can be remembered using mnemon cell404 Get off.So LSTM can remember to compare the information in long-time.

More specifically, the input gate input gate402 play a part of to control input information, and the input of door is upper one The output of the concealed nodes of moment point and current input, by the output of input gate input gate402 and input node Output, which is multiplied, can play the role of controlling information content.

The forgetting door forget gate401 play a part of to control internal state information, and the input of door was a upper moment The output and current input of the concealed nodes of point.

The out gate output gate403 play a part of to control output information, and the input of door is a upper moment point The output of concealed nodes and current input, activation primitive sigmoid will be defeated because the output of sigmoid is between 0-1 The output of output gate403 of going out is multiplied with the output of internal state node can play the role of controlling information content.

The concrete structure of the trained network shown in the present embodiment is described in detail below in conjunction with shown in Fig. 5：

As shown in figure 5, the trained network shown in the present embodiment includes the long memory network in short-term of N layers to sort by level LSTM, the present embodiment do not limit the specific number of N, as long as the N is the positive integer more than or equal to 2.

The present embodiment carries out optional exemplary explanation so that the N is equal to 2 as an example, i.e., the present embodiment is with the trained network packet For including two layers of LSTM.

Specifically, in two layers of LSTM, the input of the output of the LSTM of preceding layer as later layer, it is seen then that multilayer LSTM it Between can carry out the cycles of data.

Relative to the LSTM of single layer, LSTMs of the bilayer LSTM relative to single layer shown in the present embodiment has more optimized Performance, and can be more efficient use parameter possessed by LSTM.

Because the trained network shown in the present embodiment includes multiple LSTM, the LSTM for being located at lower layer can be to upper The iterative parameter that layer LSTM is inputted is modified, it is seen then that the LSTM for using multilayer can effectively promote the standard of languages identification True property.

Optionally, shown in the present embodiment can by the second feature data in the trained network iteration M take turns, often wheel changing The training pattern that generation is generated may be set to candidate training pattern.

Languages identification equipment shown in the present embodiment can select the target in M takes turns candidate training pattern and train mould Type.

The present embodiment does not limit the concrete mode of the determination target training pattern, for example, shown in the present embodiment Languages identification equipment can manslaughter rate, average recognition speed and accuracy rate etc. are selected in M takes turns candidate training pattern according to coverage rate The fixed target training pattern.

Step 208 as shown below to step 213 is to be identified on the line of Language Identification shown in the embodiment of the present invention Partial specific execution step：

Step 208 obtains the first audio-video document.

In the present embodiment, the first audio-video document for carrying out languages identification can will be needed to be input to language shown in the present embodiment Kind identification equipment.

For example, first audio-video document shown in the present embodiment may include that the number of video is 4654, by 4654 A video input is to the languages identification equipment.

Step 209 is decoded to generate the first audio, video data first audio-video document by decoder.

It should be clear that, the present embodiment is optional example to the explanation of the decoder, is not construed as limiting, as long as the solution Code device can be decoded first audio-video document to generate first audio and video that can carry out languages identification Data.

Step 210 is filtered first audio, video data generation first object audio, video data.

To reduce the duration that identification division executes on line, the efficiency of promotion languages identification promotes the accuracy of languages identification, Then languages identification equipment shown in the present embodiment can be filtered first audio, video data.

Specifically, languages identification equipment shown in the present embodiment passes through VAD (full name in English：Voice Activity Detection, Chinese name：Voiced activity detection) it is detected, it is invalid quiet in first audio, video data to filter Segment is to generate the first object audio, video data.

As it can be seen that being included using the first object audio, video data accessed by the step 211 shown in the present embodiment Data be effective data, when being handled wasted to useless data so as to avoid the languages identification equipment Long and system resource.

Step 211 carries out feature extraction fisrt feature data to the first object audio, video data.

Specifically, in the present embodiment, the languages identification equipment can carry out the first object audio, video data special Sign extraction, to obtain fisrt feature data corresponding with the first object audio, video data.

The feature extracting method of feature extraction can be carried out shown in the present embodiment to the first object audio, video data Can be spectrum envelope method, Cepstrum Method, LPC interpolation methods, LPC extraction of roots, Hilbert transform method, formant tracing algorithm etc..

The present embodiment is not construed as limiting the feature extracting method, as long as the first object audio data can be extracted Fisrt feature data.

Step 212 determines target training pattern.

During executing step 212, languages identification equipment shown in the present embodiment is firstly the need of 207 institute of obtaining step The target training pattern obtained.

Step 213, according to the target training pattern and the fisrt feature data acquisition target fractional.

Languages identification equipment shown in the present embodiment can be according to the target training pattern got and described One characteristic carries out corresponding calculating, to obtain the target fractional.

Specifically, languages identification equipment shown in the present embodiment can be by each parameter possessed by the target training pattern It is calculated with the fisrt feature data to obtain the target fractional.

Step 214 determines languages recognition result information corresponding with the target fractional.

Specifically, the languages recognition result information shown in the present embodiment is used to indicate the first object audio and video number According to affiliated languages.

More specifically, languages identification equipment shown in the present embodiment is previously provided with different fraction ranges and different languages The correspondence of kind, during executing the step 214 shown in the present embodiment, the languages identification equipment can first determine that institute The target fractional range belonging to target fractional is stated, and then the languages identification device can determine and the target fractional range pair The languages recognition result information answered.

For example, by taking languages corresponding with the fisrt feature data described in the present embodiment are Tibetan language as an example, then this implementation Languages identification equipment can be previously stored with fraction range corresponding with Tibetan language shown in example, such as between 0 and 1, when the languages are known The target fractional that other equipment identifies is fallen in the fraction range, then languages equipment identification may recognize that and described the The corresponding file of one characteristic is Tibetan language audio-video document.If for example, the languages identification equipment identifies the target point Number is 0.999, then the languages identification equipment can recognize that the target fractional 0.999 between fraction range 0 and 1, Then the languages identification equipment may recognize that file corresponding with the fisrt feature data is Tibetan language audio-video document.

Advantage using method shown in the present embodiment is that languages identification equipment shown in the present embodiment is without regarding sound The content of frequency file is analyzed, it is only necessary to which establishment can train the target training pattern that audio-video document is trained Go out the languages belonging to audio-video document, and because the target training pattern is using training network pair the second target audio, video data It is trained to obtain, the trained network includes the long memory network LSTM in short-term of N layers to sort by level, may make identification languages Process efficiency it is high, speed is fast, and accuracy rate and coverage rate are much better than traditional shallow Model method and common DNN nets Network, languages that can quick and precisely belonging to audio-video document.

It is to better illustrate the advantage of method shown in the present embodiment, then following that method shown in the present embodiment is surveyed Examination；

Include Tibetan language video 79 with the first audio-video document in this test, for non-Tibetan language video 9604, Wherein, each video maximum length is 180 seconds.

During determining the target training pattern, online lower training part, the target training pattern can be institute State the training pattern that iteration the 4600th is taken turns in trained network；

It is trained to first audio-video document to export languages recognition result according to the target training pattern It when information, can obtain, in this test, coverage rate=67/79=84.8% manslaughters rate=1/ (9064)=0.01%, Tibetan language Video is averaged, and recognition speed 1.6s/ is a, and normal video is averaged recognition speed=3.4s/.

For another example, include dimension language video 100, non-dimension language video with the first audio-video document in another test For 9608, wherein each video maximum length is 180 seconds.

During determining the target training pattern, online lower training part, the target training pattern can be institute State the training pattern that iteration the 3400th is taken turns in trained network；

It is trained to first audio-video document to export languages recognition result according to the target training pattern It when information, can obtain, in this test, coverage rate=30/100=30.0%, the rate of manslaughtering=10/9068=0.1% is tieed up language and regarded Frequency is averaged, and recognition speed 1.66s/ is a, and normal video is averaged recognition speed=3.51s/.

Method shown in embodiment for a better understanding of the present invention, below can to method shown in the embodiment of the present invention The application scenarios of application illustrate：

The explanation for the scene applied to method shown in the embodiment of the present invention below it should be clear that, is optionally to show Example, does not limit.

Scene one：Field of speech recognition

With the arrival in mobile interchange epoch, voice assistant is fashionable as similar siri, and user needs basis The difference of itself language downloads the voice assistant of different language.Various voices also in the market turn text tool, need foundation Described languages select corresponding tool, inconvenient.It, can be according to using using Language Identification shown in the present embodiment Language described in person quickly navigates to the voice assistant of corresponding languages, convenient and efficient.

Scene two：Bank and stock exchange information service

In the places such as bank and stock exchange, when encountering the ethnic group customer that will not be spoken Mandarin, it is difficult to do correlation Reason business needs to find and specially understands that the staff of a small number of name races language is responsible for reception.Before this, it not can determine that described in customer Language can waste many times.Tibetan dimension voice frequency can be quickly identified using Language Identification shown in the present embodiment, Content described in user, religion machine recognize the sound of ethnic group compatriot, quickly recognize corresponding languages classification, find Relevant staff receives.

Scene three：Urgent hotline service

The distress call 120 for handling ethnic group compatriot and alarm the emergency services such as 110 when, the time is of short duration, can not In the case of confirming speaker's languages, the valuable emergency time can be delayed, jeopardize the life for calling for help people.Shown in the present embodiment Languages identify that skill method, the audio described in user quickly recognize corresponding languages classification, corresponding languages are understood in searching Staff records, and can save the quality time, saves life.

Scene four：Video identification is feared cruelly

With the development of mobile Internet, many people like issuing video in wechat, the social softwares such as QQ space, daily The video of upload is hundreds of millions of.A large amount of malice video can also be contained among these, be related to politics and fear cruelly etc., similar Tibetan independence, boundary It is even more solely high-risk malice video.This kind of number of videos is not very much, and the daily audit amount of contact staff is fixed, is differed surely It effectively finds this kind of video, and a large amount of time can be wasted.It, can be quick using Language Identification shown in the present embodiment It positions doubtful politics present in massive video and fears video cruelly, such as be supplied to customer service to audit the video that languages are Tibetan dimension language, Improve working efficiency, accurate killing malice video.

Scene five：Monitor suspect

When army and police monitor suspect, need to differentiate identity, nationality and the speech content spoken, this needs big The manpower and materials of amount carry out, and lead to inefficiency.Using the Language Identification of the present embodiment, can precisely adjudicate monitored The language information of people, to judge the information such as its identity, nationality and nationality.

Languages identification equipment shown in the present embodiment can be used for executing the present embodiment Language Identification shown in Fig. 2, this Languages identification equipment shown in embodiment also can perform the present embodiment Language Identification shown in fig. 6, in figure 6, institute's predicate Kind identification equipment need to only execute training part under the line in Language Identification.

Step 601 obtains audio-video document.

Step 602 is decoded to generate audio, video data the audio-video document by decoder.

Step 603 is filtered the audio, video data generation target audio, video data.

Step 604 carries out the target audio, video data feature extraction to obtain characteristic.

Target classification label is arranged in step 605 in the characteristic.

The characteristic for being provided with the target classification label is input to the trained network by step 606.

Step 607 is iterated training to obtain the target training by the trained network to the characteristic Model.

The audio-video document shown in the present embodiment illustrates, and please refer to the second audio-video document shown in Fig. 2 Explanation, illustrating for the target audio, video data shown in the present embodiment please refer to the second target sound shown in Fig. 2 and regard The explanation of frequency file, illustrating for the characteristic shown in the present embodiment, please refer to second feature data shown in Fig. 2 Explanation, do not repeat in the present embodiment specifically.

Process shown in step 601 to step 607 shown in the present embodiment please refer to step 201 shown in Fig. 2 to step Shown in 207, do not repeat in the present embodiment specifically.

Below in conjunction with shown in Fig. 7 from function module angle to the concrete structure of languages identification equipment shown in the present embodiment into Row explanation：

The languages identification equipment includes：

Third acquiring unit 701, for obtaining the second target audio, video data；

Specifically, the second acquisition unit 701 includes：

Second acquisition module 7011, for obtaining the second audio-video document for training under line；

Second decoder module 7012 is decoded to generate second second audio-video document for passing through decoder Audio, video data；

Second filtering module 7013, for filtering the nothing in second audio, video data by voiced activity detection VAD Mute section is imitated to generate the second target audio, video data.

Second recognition unit 702, for the second target audio, video data carry out feature extraction, with obtain with it is described The corresponding second feature data of second target audio, video data；

Setting unit 703, for target classification label, the target classification label to be arranged in the second feature data For be used to indicate the target audio data languages label；

Training unit 704, for passing through the LSTM of the memory network in short-term N layers long included by the trained network successively Training is iterated to the second feature data, to obtain the target training pattern；

Training unit also 704 is additionally operable to, and passes through the memory network LSTM in short-term N layers long included by the trained network Training is iterated to the second feature data for being provided with the target classification label successively, is trained with obtaining the target Model.

First acquisition unit 705, for obtaining the first object audio, video data for being used for being identified on line；

Specifically, the first acquisition unit 705 includes：

First acquisition module 7051, for obtaining the first audio-video document for being used for being identified on line；

First decoder module 7052 is decoded to generate first first audio-video document for passing through decoder Audio, video data；

First filtering module 7053, for filtering the nothing in first audio, video data by voiced activity detection VAD Mute section is imitated to generate the first object audio, video data.

First recognition unit 706, for the first object audio, video data carry out feature extraction, with obtain with it is described The corresponding fisrt feature data of first object audio, video data；

First determination unit 707, for determining that target training pattern, the target training pattern are using training network pair Second target audio, video data is trained to obtain, and the trained network includes the long memory network in short-term of N layers to sort by level LSTM, the N are the positive integer more than or equal to 2；

Second acquisition unit 708, for according to the target training pattern and the fisrt feature data acquisition target point Number；

Second determination unit 708, for determining languages recognition result information corresponding with the target fractional, the languages Recognition result information is used to indicate the languages belonging to the first object audio, video data.

The detailed process that shown languages identification equipment shown in the present embodiment executes Language Identification please refer to Fig. 2 institutes Show, does not repeat in the present embodiment specifically.

Languages identification equipment shown in the present embodiment executes the advantageous effect during Language Identification, please refer to Fig. 2 Shown in embodiment, do not repeat in the present embodiment specifically.

Below in conjunction with shown in Fig. 8 from function module angle to the concrete structure of languages identification equipment shown in the present embodiment into Row explanation, the languages identification equipment shown in Fig. 8 can realize training part under the line in Language Identification.

Specifically, the languages identification equipment includes：

First acquisition unit 801, for obtaining the target audio, video data for training under line；

Specifically, the acquiring unit 801 includes：

Acquisition module 8011, for obtaining the audio-video document for training under line；

Decoder module 8012 is decoded to generate audio, video data the audio-video document for passing through decoder；

Filtering module 8013, for filtering invalid mute section in the audio, video data by voiced activity detection VAD To generate the target audio, video data.

Second acquisition unit 802, for carrying out feature extraction to the target audio, video data, to obtain and the target The corresponding characteristic of audio, video data；

Setting unit 803, for target classification label to be arranged in the characteristic, the target classification label is to use In the label for the languages for indicating the target audio data；

Training unit 804, for passing through the long memory network LSTM in short-term of N layers by level sequence included by training network Training is iterated to the characteristic successively, to obtain target training pattern, the target training pattern is for carrying out language Kind identification；

The training unit also 804 is additionally operable to, and passes through the N layers long memory network in short-term included by the trained network LSTM is iterated training to the characteristic for being provided with the target classification label successively, is trained with obtaining the target Model.

The detailed process that shown languages identification equipment shown in the present embodiment executes Language Identification please refer to Fig. 6 institutes Show, does not repeat in the present embodiment specifically.

Languages identification equipment shown in the present embodiment executes the advantageous effect during Language Identification, please refer to Fig. 6 Shown in embodiment, do not repeat in the present embodiment specifically.

It is apparent to those skilled in the art that for convenience and simplicity of description, the system of foregoing description, The specific work process of device and unit, can refer to corresponding processes in the foregoing method embodiment, and details are not described herein.

In several embodiments provided herein, it should be understood that disclosed system, device and method can be with It realizes by another way.For example, the apparatus embodiments described above are merely exemplary, for example, the unit It divides, only a kind of division of logic function, formula that in actual implementation, there may be another division manner, such as multiple units or component It can be combined or can be integrated into another system, or some features can be ignored or not executed.Another point, it is shown or The mutual coupling, direct-coupling or communication connection discussed can be the indirect coupling by some interfaces, device or unit It closes or communicates to connect, can be electrical, machinery or other forms.

The unit illustrated as separating component may or may not be physically separated, aobvious as unit The component shown may or may not be physical unit, you can be located at a place, or may be distributed over multiple In network element.Some or all of unit therein can be selected according to the actual needs to realize the mesh of this embodiment scheme 's.

In addition, each functional unit in each embodiment of the present invention can be integrated in a processing unit, it can also It is that each unit physically exists alone, it can also be during two or more units be integrated in one unit.Above-mentioned integrated list The form that hardware had both may be used in member is realized, can also be realized in the form of SFU software functional unit.

If the integrated unit is realized in the form of SFU software functional unit and sells or use as independent product When, it can be stored in a computer read/write memory medium.Based on this understanding, technical scheme of the present invention is substantially The all or part of the part that contributes to existing technology or the technical solution can be in the form of software products in other words It embodies, which is stored in a storage medium, including some instructions are used so that a computer Equipment (can be personal computer, languages identification equipment or the network equipment etc.) executes side described in each embodiment of the present invention The all or part of step of method.And storage medium above-mentioned includes：USB flash disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), magnetic disc or CD etc. are various can store journey The medium of sequence code.

The above, the above embodiments are merely illustrative of the technical solutions of the present invention, rather than its limitations；Although with reference to before Stating embodiment, invention is explained in detail, it will be understood by those of ordinary skill in the art that：It still can be to preceding The technical solution recorded in each embodiment is stated to modify or equivalent replacement of some of the technical features；And these Modification or replacement, the spirit and scope for various embodiments of the present invention technical solution that it does not separate the essence of the corresponding technical solution.

Claims

1. a kind of Language Identification, which is characterized in that including：

Obtain the target audio, video data for training under line；

Feature extraction is carried out to the target audio, video data, to obtain characteristic corresponding with the target audio, video data According to；

By training network included by by level sequence the long LSTM of memory network in short-term of N layers successively to the characteristic into Row iteration is trained, and to obtain target training pattern, the target training pattern is for carrying out languages identification.

2. according to the method described in claim 1, it is characterized in that, described carry out feature extraction to the target audio, video data Later, the method further includes：

Target classification label is set in the characteristic, and the target classification label is to be used to indicate the target sound frequency According to languages label；

The long LSTM of memory network in short-term of N layers by level sequence by included by training network is successively to the characteristic Include according to training is iterated：

By the LSTM of the memory network in short-term N layers long included by the trained network successively to being provided with the target classification The characteristic of label is iterated training, to obtain the target training pattern.

3. method according to claim 1 or 2, which is characterized in that described to obtain the target sound video data and include：

Obtain the audio-video document for training under line；

The audio-video document is decoded to generate audio, video data by decoder；

Invalid mute section in the audio, video data is filtered to generate the target sound video counts by voiced activity detection VAD According to.

4. a kind of Language Identification, which is characterized in that including：

Obtain the first object audio, video data for being identified on line；

Feature extraction is carried out to the first object audio, video data, it is corresponding with the first object audio, video data to obtain Fisrt feature data；

Determine that target training pattern, the target training pattern are to be instructed using training network pair the second target audio, video data It gets, the trained network includes that N the layers long memory network LSTM in short-term, the N to sort by level is just more than or equal to 2 Integer；

Determine that languages recognition result information corresponding with the target fractional, the languages recognition result information are used to indicate described Languages belonging to first object audio, video data.

5. according to the method described in claim 4, it is characterized in that, the first object audio and video obtained for being identified on line Data include：

Obtain the first audio-video document for being identified on line；

First audio-video document is decoded to generate the first audio, video data by decoder；

Invalid mute section in first audio, video data is filtered to generate the first object by voiced activity detection VAD Audio, video data.

6. according to the method described in claim 4, it is characterized in that, the first object audio and video obtained for being identified on line Before data, the method further includes：

Obtain the second target audio, video data；

Feature extraction is carried out to the second target audio, video data, it is corresponding with the second target audio, video data to obtain Second feature data；

By the LSTM of the memory network in short-term N layers long included by the trained network successively to the second feature data into Row iteration is trained, to obtain the target training pattern.

7. according to the method described in claim 6, it is characterized in that, described carry out feature to the second target audio, video data After extraction, the method further includes：

Target classification label is set in the second feature data, and the target classification label is to be used to indicate the target sound The label of the languages of frequency evidence；

The LSTM of the memory network in short-term N layers long by included by the trained network is successively to the second feature number Include according to training is iterated：

By the LSTM of the memory network in short-term N layers long included by the trained network successively to being provided with the target classification The second feature data of label are iterated training, to obtain the target training pattern.

8. the method described according to claim 6 or 7, which is characterized in that described to obtain the second target sound video data packet It includes：

Obtain the second audio-video document for training under line；

Second audio-video document is decoded to generate the second audio, video data by decoder；

Invalid mute section in second audio, video data is filtered to generate second target by voiced activity detection VAD Audio, video data.

9. a kind of languages identification equipment, which is characterized in that including：

Second acquisition unit, for carrying out feature extraction to the target audio, video data, to obtain and the target audio and video The corresponding characteristic of data；

Training unit, for right successively by the long LSTM of memory network in short-term of N layers by level sequence included by training network The characteristic is iterated training, and to obtain target training pattern, the target training pattern is for carrying out languages identification.

10. languages identification equipment according to claim 9, which is characterized in that the languages identification equipment further includes：

Setting unit, for target classification label to be arranged in the characteristic, the target classification label is to be used to indicate The label of the languages of the target audio data；

The training unit is additionally operable to, successively by the LSTM of the memory network in short-term N layers long included by the trained network The characteristic to being provided with the target classification label is iterated training, to obtain the target training pattern.

11. languages identification equipment according to claim 9 or 10, which is characterized in that the first acquisition unit includes：

Acquisition module, for obtaining the audio-video document for training under line；

Decoder module is decoded to generate audio, video data the audio-video document for passing through decoder；

Filtering module, for filtering invalid mute section in the audio, video data by voiced activity detection VAD to generate State target audio, video data.

12. a kind of languages identification equipment, which is characterized in that including：

First recognition unit, for carrying out feature extraction to the first object audio, video data, to obtain and first mesh The corresponding fisrt feature data of mark with phonetic symbols video data；

First determination unit, for determining that target training pattern, the target training pattern are using training the second mesh of network pair Mark with phonetic symbols video data is trained to obtain, and the trained network includes the long memory network LSTM in short-term of N layers to sort by level, institute It is the positive integer more than or equal to 2 to state N；

Second determination unit, for determining languages recognition result information corresponding with the target fractional, the languages identification knot Fruit information is used to indicate the languages belonging to the first object audio, video data.

13. languages identification equipment according to claim 12, which is characterized in that the first acquisition unit includes：

First acquisition module, for obtaining the first audio-video document for being used for being identified on line；

First decoder module is decoded first audio-video document for passing through decoder to generate the first audio and video number According to；

First filtering module, for filtering invalid mute section in first audio, video data by voiced activity detection VAD To generate the first object audio, video data.

14. languages identification equipment according to claim 12, which is characterized in that the languages identification equipment further includes：

Third acquiring unit, for obtaining the second target audio, video data；

Second recognition unit, for carrying out feature extraction to the second target audio, video data, to obtain and second mesh The corresponding second feature data of mark with phonetic symbols video data；

Training unit, for passing through the LSTM of the memory network in short-term N layers long included by the trained network successively to described Second feature data are iterated training, to obtain the target training pattern.

15. languages identification equipment according to claim 14, which is characterized in that the languages identification equipment further includes：

Setting unit, in the second feature data be arranged target classification label, the target classification label be for Indicate the label of the languages of the target audio data；

The training unit is additionally operable to, successively by the LSTM of the memory network in short-term N layers long included by the trained network The second feature data to being provided with the target classification label are iterated training, and mould is trained to obtain the target Type.

16. the languages identification equipment according to claims 14 or 15, which is characterized in that the third acquiring unit includes：

Second acquisition module, for obtaining the second audio-video document for training under line；

Second decoder module is decoded second audio-video document for passing through decoder to generate the second audio and video number According to；

Second filtering module, for filtering invalid mute section in second audio, video data by voiced activity detection VAD To generate the second target audio, video data.