CN111210805A

CN111210805A - Language identification model training method and device and language identification method and device

Info

Publication number: CN111210805A
Application number: CN201811308721.3A
Authority: CN
Inventors: 梁鸣心; 郭庭炜; 赵帅江
Original assignee: Beijing Didi Infinity Technology and Development Co Ltd
Current assignee: Beijing Didi Infinity Technology and Development Co Ltd
Priority date: 2018-11-05
Filing date: 2018-11-05
Publication date: 2020-05-29

Abstract

The application provides a language identification model training method and device and a language identification method and device, wherein the language identification method comprises the following steps: acquiring a voice to be identified; determining a first feature vector representing the acoustic features of the voice to be identified and a second feature vector corresponding to at least one pronunciation feature of the voice to be identified respectively; and obtaining language information of the voice to be identified based on the first feature vector, the second feature vector and a pre-trained language identification model. In the embodiment of the application, the first sample feature vector can represent the acoustic features of the first voice sample, and each second sample feature vector can represent one pronunciation feature of the first voice sample, so that the acoustic features and pronunciation features of input voice can be effectively utilized, and the accuracy rate of final language identification results can be improved.

Description

Language identification model training method and device and language identification method and device

Technical Field

The application relates to the technical field of machine learning, in particular to a language identification model training method and device and a language identification method and device.

Background

In recent years, with the continuous popularization of voice products, voice input is accepted by more and more people as an important man-machine interaction means. However, since the languages vary widely from region to region, it is difficult to find an effective universal speech recognition model to handle all the different kinds of speech input. An effective method for solving the problem is to establish a separate speech recognition model aiming at the characteristics of each language so as to process different languages in a targeted manner; this requires that, after receiving the input speech, the type of language to which the speech belongs must be determined and then processed using a language recognition model corresponding to the type of language. As an important component of speech processing, language identification is of great significance in practical applications.

Currently, in the field of language identification, Mel-frequency cepstralcoefficients (MFCCs) are generally adopted to characterize languages, and the obtained MFCC features are used as input of a neural network to train the neural network, so as to obtain a language identification model. Although the MFCC method can effectively extract frame-level acoustic features in the input speech, the process of independently extracting features from each frame lacks consideration on the relevance between adjacent frames in the input speech, so that the characterization capability of the extracted features on the input speech is insufficient, and the accuracy of the final identification result is greatly limited.

Disclosure of Invention

In view of this, an object of the embodiments of the present application is to provide a language identification model training method and apparatus, and a language identification method and apparatus, which can more effectively utilize acoustic features and pronunciation features of an input speech, thereby improving accuracy of a language identification result.

In a first aspect, a language identification method is provided, which includes:

acquiring a voice to be identified;

determining a first feature vector representing the acoustic features of the voice to be identified and a second feature vector corresponding to at least one pronunciation feature of the voice to be identified respectively;

and obtaining language information of the voice to be identified based on the first feature vector, the second feature vector and a pre-trained language identification model.

In one possible embodiment, the acoustic features include: mel-frequency cepstrum coefficients MFCC features; the pronunciation features include: at least one of phoneme characteristics, syllable characteristics and word characteristics.

In a possible implementation manner, the obtaining language information of the speech to be identified based on the first feature vector, the second feature vector, and a pre-trained language identification model includes:

fusing the first feature vector and the second feature vector to generate a target feature vector;

and inputting the target feature vector to the language identification model trained in advance to obtain the language information of the voice to be identified.

In one possible embodiment, fusing the first feature vector and the second feature vector to generate a target feature vector includes:

splicing the first feature vector and the second feature vector to generate the target feature vector; alternatively, the first and second electrodes may be,

fusing and splicing the first characteristic vector and the second characteristic vector to form a spliced vector; and extracting low-dimensional transformation vector features of the spliced vector, and generating the target feature vector based on the extracted low-dimensional transformation vector features.

In one possible embodiment, the language identification model is obtained by:

acquiring a plurality of first voice samples and language information of each first voice sample;

determining a first sample feature vector representing acoustic features of each acquired first voice sample and a second sample feature vector corresponding to at least one pronunciation feature of the first voice sample;

and training a language identification model based on the first sample feature vector, the second sample feature vector and language information corresponding to the first voice sample.

In a possible implementation manner, determining second feature vectors respectively corresponding to at least one pronunciation feature of the language to be identified includes:

and aiming at each pronunciation feature, inputting the first feature vector into a feature vector extraction network corresponding to the pronunciation feature to obtain a second feature vector of the pronunciation feature.

In one possible embodiment, the feature vector extraction network is generated by:

acquiring a plurality of second voice samples and feature labeling information of each second voice sample under the at least one pronunciation feature;

for each obtained second voice sample, determining a third sample feature vector for characterizing the acoustic features of the second voice sample;

and training the feature vector extraction network based on the third sample feature vector and the feature labeling information.

In a possible implementation, the training of the feature vector extraction network based on the third sample feature vector and the feature labeling information includes:

calculating the similarity between the feature vector of the third sample and the feature marking information, and comparing the similarity with a preset similarity threshold;

when the similarity is smaller than the preset similarity threshold, adjusting the feature vector extraction network parameters, and based on the adjusted feature vector extraction network, obtaining the third sample feature vector again;

and returning to the operation of calculating the similarity between the third sample feature vector and the feature labeling information until the similarity between the third sample feature vector and the feature labeling information is not less than a preset similarity threshold value.

taking any one of the second voice samples which are not trained in the current round as a target second voice sample;

adjusting the parameters of a feature vector extraction network based on the feature labeling information corresponding to the target second voice sample and the feature vector of the third sample;

taking the target second voice sample as a second voice sample which completes training in the current round, taking any one of the second voice samples which do not complete training in the current round as a new target second voice sample, extracting a third sample feature vector of the new target second voice sample by using the feature vector extraction network after the parameters are adjusted, returning the feature labeling information and the third sample feature vector corresponding to the target second voice sample, and adjusting the feature vector to extract network parameters;

and repeating the steps until all the second voice samples finish the training of the round, and entering the next training round until the preset model training cutoff condition is met.

In a possible implementation manner, the adjusting parameters of the feature vector extraction network based on the feature labeling information corresponding to the target second speech sample and the feature vector of the third sample includes:

calculating the similarity between a third sample feature vector of a target second voice sample and feature marking information corresponding to the target second voice sample;

comparing the similarity with a preset similarity threshold;

and when the similarity is smaller than the preset similarity threshold, adjusting the parameters of the feature vector extraction network.

In one possible embodiment, the feature vector extraction network comprises a bottleneck feature extraction layer;

determining that the at least one pronunciation feature respectively corresponds to the second feature vector by:

and inputting the first feature vector into a feature vector extraction network, and acquiring the second feature vector from a bottleneck feature extraction layer in the feature vector extraction network.

In a possible embodiment, the first feature vector is a mel-frequency cepstral coefficient MFCC vector, and the second feature vector is a bottleneck feature BNF vector.

In one possible embodiment, the language identification model includes: a Probabilistic Linear Discriminant Analysis (PLDA) model, or a neural network model.

In a second aspect, a language identification model training method is provided, which includes:

In a possible embodiment, the training of the language identification model based on the first sample feature vector, the second sample feature vector, and the language information corresponding to the first speech sample includes:

fusing the first sample feature vector and the second sample feature vector to generate a target sample feature vector;

and training a language identification model based on the target sample feature vector and the language information corresponding to the first voice sample.

In one possible embodiment, the fusing the first sample feature vector and the second sample feature vector includes:

splicing the first sample feature vector and the second sample feature vector to generate the target sample feature vector, or,

splicing the first sample characteristic vector and the second sample characteristic vector to form a spliced vector; and extracting low-dimensional transformation vector features of the spliced vector, and generating the target sample feature vector based on the extracted low-dimensional transformation vector features.

In a possible embodiment, determining a second sample feature vector corresponding to at least one pronunciation feature of the first speech sample respectively includes:

and aiming at each pronunciation feature, inputting the first sample feature vector into a feature vector extraction network corresponding to the pronunciation feature to obtain a second sample feature vector of the pronunciation feature.

when the similarity is smaller than the preset similarity threshold, adjusting the feature vector extraction network parameters, and extracting a network based on the adjusted feature vector to obtain the feature vector of the third voice sample again;

a step of taking the target second voice sample as a second voice sample which completes training in the current round, taking any one of the second voice samples which do not complete training in the current round as a new target second voice sample, extracting a third feature vector of the new target second voice sample by using the second feature vector with adjusted parameters, and returning the feature labeling information and the third sample feature vector corresponding to the target second voice sample and adjusting the second feature vector to extract network parameters;

determining that the at least one pronunciation feature respectively corresponds to the second sample feature vector by:

and inputting the first sample feature vector into the feature vector extraction network, and acquiring the second sample feature vector from a bottleneck feature extraction layer in the feature vector extraction network.

In a third aspect, there is provided a language identification apparatus, including:

the voice to be identified acquisition module is used for acquiring the voice to be identified;

the feature vector determination module is used for determining a first feature vector representing the acoustic features of the voice to be identified and a second feature vector corresponding to at least one pronunciation feature of the voice to be identified;

and the language information acquisition module is used for acquiring the language information of the voice to be identified based on the first feature vector, the second feature vector and a pre-trained language identification model.

In a possible implementation manner, the language information obtaining module is configured to obtain the language information of the speech to be identified based on the first feature vector, the second feature vector, and a pre-trained language identification model in the following manner:

In a possible implementation manner, the language information obtaining module is configured to fuse the first feature vector and the second feature vector to generate a target feature vector by using the following method:

In a possible embodiment, the method further comprises: the first model training module is used for obtaining the language identification model by adopting the following modes:

In a possible implementation manner, the feature vector determining module is configured to determine second feature vectors respectively corresponding to at least one pronunciation feature of the language to be identified, by using the following manners:

In a possible embodiment, the method further comprises: a second model training module, configured to generate the feature vector extraction network in the following manner:

In one possible implementation, the second model training module is configured to perform training of the feature vector extraction network based on the third sample feature vector and the feature labeling information in the following manner:

In a possible implementation manner, the second model training module is configured to perform training of the feature vector extraction network based on the third sample feature vector and the feature labeling information in the following manner:

In a possible implementation manner, the second model training module is configured to adjust parameters of a feature vector extraction network based on feature labeling information corresponding to the target second speech sample and a third sample feature vector by:

comparing the similarity with a preset similarity threshold;

the feature vector determining module is configured to determine that the at least one pronunciation feature respectively corresponds to the second feature vector by:

In a fourth aspect, a language identification model training device is provided, which includes:

the system comprises a sample acquisition module, a voice recognition module and a voice recognition module, wherein the sample acquisition module is used for acquiring a plurality of first voice samples and language information of each first voice sample;

the sample feature vector determination module is used for determining a first sample feature vector representing the acoustic features of each acquired first voice sample and a second sample feature vector corresponding to at least one pronunciation feature of the first voice sample;

and the first training module is used for training a language identification model based on the first sample feature vector, the second sample feature vector and language information corresponding to the first voice sample.

In a possible implementation manner, the first training module is configured to perform language identification model training based on the first sample feature vector, the second sample feature vector, and language information corresponding to the first speech sample in the following manner:

In a possible implementation, the first training module is configured to fuse the first sample feature vector and the second sample feature vector by:

In a possible implementation manner, the sample feature vector determining module is configured to determine a second sample feature vector corresponding to each of at least one pronunciation feature of the first speech sample by:

In one possible embodiment, the method comprises the following steps: a second training module, configured to generate the feature vector extraction network in the following manner:

In a possible implementation manner, the second training module is configured to perform training of the feature vector extraction network based on the third sample feature vector and the feature labeling information in the following manner:

the sample feature vector determining module is configured to determine that the at least one pronunciation feature respectively corresponds to the second sample feature vector by:

In a fifth aspect, an electronic device is provided, comprising: a processor, a storage medium and a bus, wherein the storage medium stores machine-readable instructions executable by the processor, when an electronic device runs, the processor and the storage medium communicate with each other through the bus, and the processor executes the machine-readable instructions to perform the steps of the language identification method according to any one of the first aspect.

A sixth aspect provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the language identification method according to any one of the first aspect.

In a seventh aspect, an electronic device is provided, which includes: a processor, a storage medium and a bus, wherein the storage medium stores machine-readable instructions executable by the processor, when an electronic device runs, the processor and the storage medium communicate with each other through the bus, and the processor executes the machine-readable instructions to execute the steps of the language identification model training method according to any one of the second aspect.

In an eighth aspect, a computer-readable storage medium is provided, wherein the computer-readable storage medium stores a computer program, and the computer program is executed by a processor to perform the steps of the language identification model training method according to any one of the second aspect.

According to the embodiment of the application, the first sample characteristic vector and the second sample characteristic vector of each first voice sample are determined, the language identification model is trained based on the first sample characteristic vector and the second sample characteristic vector and the language information corresponding to the first voice sample, in the process, the first sample characteristic vector can represent the acoustic characteristics of the first voice sample, each second sample characteristic vector can represent one pronunciation characteristic of the first voice sample, the acoustic characteristics and the pronunciation characteristics of input voice are effectively utilized, and the accuracy rate of the final language identification result is improved.

In addition, in some embodiments of the present application, the first sample feature vector is an MFCC vector, and the second sample feature vector is a BNF sample feature vector, which extracts a plurality of BNF features of the input speech based on different feature vector extraction networks, and generates a feature vector of the input first speech sample by fusing BNF features and MFCC features having complementarity, so that the extracted features can comprehensively reflect characteristics of the first speech sample. On one hand, by combining the representation capability of BNF on the associated information between adjacent frames of the input voice and the description capability of MFCC on the independent acoustic feature of each frame, the description capability of the fused target sample feature vector on the first voice sample is greatly improved. On the other hand, the pronunciation characteristics of the first speech sample can be comprehensively depicted from multiple angles (such as phonemes, syllables and the like) by the aid of various different BNF characteristics, and the capability of language identification by the aid of the fusion characteristics can be further enhanced.

In addition, in other embodiments, the feature vector extraction network is DNN, and the embodiments have strong iterative capability and a lifting space by virtue of the advantage of DNN in the aspect of incremental learning. When facing rapidly-increased data in an online environment, the new method can realize rapid utilization of newly-increased data, so that the method can be better applied to a real scene.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are required to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained from the drawings without inventive effort.

Fig. 1 is a block diagram of a language identification system 100 provided in an embodiment of the present application;

FIG. 2 is a diagram illustrating exemplary hardware and software components of the language identification system 100 provided by an embodiment of the present application;

FIG. 3 is a flowchart illustrating a language identification model training method according to an embodiment of the present application;

fig. 4 is a flowchart illustrating a specific method for generating a feature vector extraction network in the language identification model training method according to the embodiment of the present application;

fig. 5 is a flowchart illustrating a specific method for training a feature vector extraction network in the language identification model training method provided in the embodiment of the present application;

fig. 6 is a flowchart illustrating another specific method for training a feature vector extraction network in the language identification model training method provided in the embodiment of the present application;

fig. 7 is a flowchart illustrating a specific method for performing language identification model training in the language identification model training method according to the embodiment of the present application;

FIG. 8 is a flowchart illustrating a language identification method provided in the third embodiment of the present application;

FIG. 9 is a schematic diagram illustrating a language identification model training apparatus 900 according to a fourth embodiment of the present application;

fig. 10 is a schematic diagram of a computer device 1000 according to a fifth embodiment of the present application;

fig. 11 is a schematic diagram illustrating a language identification model training apparatus 1100 according to a sixth embodiment of the present application;

fig. 12 shows a schematic diagram of a computer device 2000 according to a seventh embodiment of the present application.

Detailed Description

To make the purpose, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it should be understood that the drawings in the present application are for illustrative and descriptive purposes only and are not used to limit the scope of protection of the present application. Additionally, it should be understood that the schematic drawings are not necessarily drawn to scale. The flowcharts used in this application illustrate operations implemented according to some embodiments of the present application. It should be understood that the operations of the flow diagrams may be performed out of order, and steps without logical context may be performed in reverse order or simultaneously. One skilled in the art, under the guidance of this application, may add one or more other operations to, or remove one or more operations from, the flowchart.

In addition, the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. The components of the embodiments of the present application, generally described and illustrated in the figures herein, can be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present application, presented in the accompanying drawings, is not intended to limit the scope of the claimed application, but is merely representative of selected embodiments of the application. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present application without making any creative effort, shall fall within the protection scope of the present application.

To enable those skilled in the art to use the present disclosure, the following embodiments are presented in conjunction with a specific application scenario, "voice taxi". It will be apparent to those skilled in the art that the general principles defined herein may be applied to other embodiments and applications without departing from the spirit and scope of the application. Although the present application is described primarily in the context of language identification of speech input to taxi software, it should be understood that this is but one exemplary embodiment. The application can be applied to any other traffic type. For example, the present application may be applied to different transportation system environments, including terrestrial, marine, or airborne, among others, or any combination thereof. The vehicle of the transportation system may include a taxi, a private car, a windmill, a bus, a train, a bullet train, a high speed rail, a subway, a ship, an airplane, a spacecraft, a hot air balloon, or an unmanned vehicle, etc., or any combination thereof. The present application may also include any service system that requires speech language identification, such as speech translation, speech to text, speech recognition, etc., for example, for a chat system, a shopping system, etc. Applications of the system or method of the present application may include web pages, plug-ins for browsers, client terminals, customization systems, internal analysis systems, or artificial intelligence robots, among others, or any combination thereof.

It should be noted that in the embodiments of the present application, the term "comprising" is used to indicate the presence of the features stated hereinafter, but does not exclude the addition of further features.

One aspect of the present application relates to a language identification model training method. The method can train the language identification model based on a first feature vector representing the voice to be identified and a second feature vector corresponding to at least one pronunciation feature of the voice to be identified, the extracted first feature vector and the extracted second feature vector can have stronger depicting capability on the voice to be identified, and the identification accuracy of the language identification model is improved.

It is noted that, before the present application filed, the speech was characterized only by MFCC, and the speech was not sufficiently characterized, resulting in a low accuracy of discriminating the language of the input speech.

FIG. 1 is a block diagram of a system 100 for an application scenario in accordance with some embodiments of the present application. For example, the system 100 may be an online transportation service platform for transportation services such as taxi cab, designated drive service, express, carpool, bus service, driver rental, or regular service, or any combination thereof. The system 100 may include one or more of a server 110, a network 120, a service requester terminal 130, a service provider terminal 140, and a database 150, and the server 110 may include a processor therein that performs operations of instructions.

The language identification method of the embodiment of the present application may be applied to any one or more of the server 110, the service requester terminal 130, and the service provider terminal 140 of the system 100 described above.

In some embodiments, the server 110 implementing the language identification method may be a single server or a server group. The set of servers can be centralized or distributed (e.g., the servers 110 can be a distributed system). In some embodiments, the server 110 may be local or remote to the terminal. For example, the server 110 may access information and/or data stored in the service requester terminal 130, the service provider terminal 140, or the database 150, or any combination thereof, via the network 120. As another example, the server 110 may be directly connected to at least one of the service requester terminal 130, the service provider terminal 140, and the database 150 to access stored information and/or data. In some embodiments, the server 110 may be implemented on a cloud platform; by way of example only, the cloud platform may include a private cloud, a public cloud, a hybrid cloud, a community cloud (community cloud), a distributed cloud, an inter-cloud, a multi-cloud, and the like, or any combination thereof. In some embodiments, the server 110 may be implemented on an electronic device 200 having one or more of the components shown in FIG. 2 in the present application.

In some embodiments, the server 110 may include a processor. The processor may process information and/or data related to the service request to perform one or more of the functions described herein. For example, the processor may determine the target vehicle based on a service request obtained from the service requester terminal 130. In some embodiments, a processor may include one or more processing cores (e.g., a single-core processor (S) or a multi-core processor (S)). Merely by way of example, a Processor may include a Central Processing Unit (CPU), an Application Specific Integrated Circuit (ASIC), an Application Specific Instruction Set Processor (ASIP), a Graphics Processing Unit (GPU), a Physical Processing Unit (PPU), a Digital Signal Processor (DSP), a Field Programmable Gate Array (FPGA), a Programmable Logic Device (PLD), a controller, a microcontroller Unit, a reduced Instruction Set computer (reduced Instruction Set computer), a microprocessor, or the like, or any combination thereof.

The network 120 described above may be used for the exchange of information and/or data. In some embodiments, one or more components (e.g., server 110, service requestor terminal 130, service provider terminal 140, and database 150) in the system 100 described above may send information and/or data to other components. For example, the server 110 may obtain a service request from the service requester terminal 130 via the network 120. In some embodiments, the network 120 may be any type of wired or wireless network, or combination thereof. Merely by way of example, Network 130 may include a wired Network, a Wireless Network, a fiber optic Network, a telecommunications Network, an intranet, the internet, a Local Area Network (LAN), a Wide Area Network (WAN), a Wireless Local Area Network (WLAN), a WLAN, a Metropolitan Area Network (MAN), a Wide Area Network (WAN), a Public Switched Telephone Network (PSTN), a bluetooth Network, a ZigBee Network, a Near Field Communication (NFC) Network, or the like, or any combination thereof. In some embodiments, network 120 may include one or more network access points. For example, network 120 may include wired or wireless network access points, such as base stations and/or network switching nodes, through which one or more components of system 100 described above may connect to network 120 to exchange data and/or information.

In some embodiments, the language identification method described above may be implemented by the service requester terminal 130, and the user of the service requester terminal 130 may be a person other than the actual demander of the service. For example, the user a of the service requester terminal 130 may use the service requester terminal 130 to initiate a service request for the service actual demander B (for example, the user a may call a car for his friend B), or receive service information or instructions from the server 110. In some embodiments, the user of the service provider terminal 140 may be the actual provider of the service or may be another person than the actual provider of the service. For example, user C of the service provider terminal 140 may use the service provider terminal 140 to receive a service request serviced by the service provider entity D (e.g., user C may pick up an order for driver D employed by user C), and/or information or instructions from the server 110.

In some embodiments, the service requester terminal 130 may comprise a mobile device, a tablet computer, a laptop computer, or a built-in device in a motor vehicle, etc., or any combination thereof. In some embodiments, the mobile device may include a smart home device, a wearable device, a smart mobile device, a virtual reality device, an augmented reality device, or the like, or any combination thereof. In some embodiments, the smart home devices may include smart lighting devices, control devices for smart electrical devices, smart monitoring devices, smart televisions, smart cameras, or walkie-talkies, or the like, or any combination thereof. In some embodiments, the wearable device may include a smart bracelet, a smart lace, smart glass, a smart helmet, a smart watch, a smart garment, a smart backpack, a smart accessory, and the like, or any combination thereof. In some embodiments, the smart mobile device may include a smartphone, a Personal Digital Assistant (PDA), a gaming device, a navigation device, or a point of sale (POS) device, or the like, or any combination thereof. In some embodiments, the virtual reality device and/or the augmented reality device may include a virtual reality helmet, virtual reality glass, a virtual reality patch, an augmented reality helmet, augmented reality glass, an augmented reality patch, or the like, or any combination thereof. For example, the virtual reality device and/or augmented reality device may include various virtual reality products and the like. In some embodiments, the built-in devices in the motor vehicle may include an on-board computer, an on-board television, and the like. In some embodiments, the service requester terminal 130 may be a device having a location technology for locating the location of the service requester and/or service requester terminal.

In some embodiments, the above-described language authentication method may be implemented by the service provider terminal 140, and the service provider terminal 140 may be a similar or identical device to the service requester terminal 130. In some embodiments, the service provider terminal 140 may be a device with location technology for locating the location of the service provider and/or the service provider terminal. In some embodiments, the service requester terminal 130 and/or the service provider terminal 140 may communicate with other locating devices to determine the location of the service requester, service requester terminal 130, service provider, or service provider terminal 140, or any combination thereof. In some embodiments, the service requester terminal 130 and/or the service provider terminal 140 may transmit the location information to the server 110.

Database 150 may store data and/or instructions. In some embodiments, the database 150 may store data obtained from the service requester terminal 130 and/or the service provider terminal 140. In some embodiments, database 150 may store data and/or instructions for the exemplary methods described herein. In some embodiments, database 150 may include mass storage, removable storage, volatile Read-write Memory, or Read-Only Memory (ROM), among others, or any combination thereof. By way of example, mass storage may include magnetic disks, optical disks, solid state drives, and the like; removable memory may include flash drives, floppy disks, optical disks, memory cards, zip disks, tapes, and the like; volatile read-write Memory may include Random Access Memory (RAM); the RAM may include Dynamic RAM (DRAM), Double data Rate Synchronous Dynamic RAM (DDR SDRAM); static RAM (SRAM), Thyristor-Based Random Access Memory (T-RAM), Zero-capacitor RAM (Zero-RAM), and the like. By way of example, ROMs may include Mask Read-Only memories (MROMs), Programmable ROMs (PROMs), Erasable Programmable ROMs (PERROMs), Electrically Erasable Programmable ROMs (EEPROMs), compact disk ROMs (CD-ROMs), digital versatile disks (ROMs), and the like. In some embodiments, database 150 may be implemented on a cloud platform. By way of example only, the cloud platform may include a private cloud, a public cloud, a hybrid cloud, a community cloud, a distributed cloud, across clouds, multiple clouds, or the like, or any combination thereof.

In some embodiments, a database 150 may be connected to the network 120 to communicate with one or more components of the system 100 described above (e.g., the server 110, the service requester terminal 130, the service provider terminal 140, etc.). One or more components in system 100 may access data or instructions stored in database 150 via network 120. In some embodiments, the database 150 may be directly connected to one or more components in the language identification system 100 (e.g., the server 110, the service requestor terminal 130, the service provider terminal 140, etc.); alternatively, in some embodiments, database 150 may also be part of server 110.

In some embodiments, one or more components in the system 100 (e.g., the server 110, the service requestor terminal 130, the service provider terminal 140, etc.) may have access to the database 150. In some embodiments, one or more components in system 100 may read and/or modify information related to a service requestor, a service provider, or the public, or any combination thereof, when certain conditions are met. For example, server 110 may read and/or modify information for one or more users after receiving a service request. As another example, the service provider terminal 140 may access information related to the service requester when receiving the service request from the service requester terminal 130, but the service provider terminal 140 may not modify the related information of the service requester.

Fig. 2 illustrates a schematic diagram of exemplary hardware and software components of an electronic device 200 of a server 110, a service requester terminal 130, a service provider terminal 140, which may implement the concepts of the present application, according to some embodiments of the present application. For example, a processor may be used on the electronic device 200 and to perform the functions herein.

Electronic device 200 may be a general purpose computer or a special purpose computer, both of which may be used to implement the language identification methods of the present application. Although only a single computer is shown, for convenience, the functions described herein may be implemented in a distributed fashion across multiple similar platforms to balance processing loads.

For example, the electronic device 200 may include a network port 210 connected to a network, one or more processors 220 for executing program instructions, a communication bus 230, and a different form of storage medium 240, such as a disk, ROM, or RAM, or any combination thereof. Illustratively, the computer platform may also include program instructions stored in ROM, RAM, or other types of non-transitory storage media, or any combination thereof. The method of the present application may be implemented in accordance with these program instructions. The electronic device 200 also includes an Input/Output (I/O) interface 250 between the computer and other Input/Output devices (e.g., keyboard, display screen).

For ease of illustration, only one processor is depicted in the electronic device 200. However, it should be noted that the electronic device 200 in the present application may also comprise a plurality of processors, and thus the steps performed by one processor described in the present application may also be performed by a plurality of processors in combination or individually. For example, if the processor of the electronic device 200 executes steps a and B, it should be understood that steps a and B may also be executed by two different processors together or separately in one processor. For example, a first processor performs step a and a second processor performs step B, or the first processor and the second processor perform steps a and B together.

Example one

Referring to fig. 3, an embodiment of the present application provides a language identification model training method, which includes S301 to S303. The following describes S301 to S303:

s301: the method comprises the steps of obtaining a plurality of first voice samples and language information of each first voice sample.

In a specific implementation, when the language identification model is obtained based on the language identification model training method provided in the embodiment of the present application, the first speech sample includes a plurality of first speech samples of all languages that can be recognized by the language identification model. For example, if the language identification model is used for identifying multiple languages such as chinese, english, and french, the first speech sample includes multiple first speech samples corresponding to the multiple languages such as chinese, english, and french, respectively; if the language identification model is used for identifying the languages of different languages such as Putonghua, Minnan, Guangdong, Shandong, Sichuan and Tibetan languages, the first voice sample comprises a plurality of first voice samples respectively corresponding to different languages such as Putonghua, Minnan, Guangdong, Shandong, Sichuan and Tibetan languages; if the language identification model is used for identifying the languages of different languages such as Guangdong Min south, hong Kong Min south and Fujian Min south, the first voice sample comprises a plurality of first voice samples respectively corresponding to different languages such as Guangdong Min south, hong Kong Min south and Fujian Min south.

The language information of the first voice sample is the related information of the specific language to which each first voice sample belongs. In an alternative embodiment, the specific language of the first speech sample may be obtained by manual tagging.

In another alternative embodiment, each first voice sample may be obtained by collecting voice of a user speaking in a specific language, and obtaining language information of each first voice sample according to the specific language used by the user.

S302: and for each acquired first voice sample, determining a first sample feature vector for characterizing acoustic features of the first voice sample and a second sample feature vector respectively corresponding to at least one pronunciation feature of the first voice sample.

In the concrete implementation:

a: the first sample feature vector may be regarded as a first sample feature vector capable of characterizing the first speech sample, which is obtained after feature extraction and dimensionality reduction operation are performed on the first speech sample. In an alternative embodiment, the first sample feature vector characterizing the acoustic features of the first speech sample is a MFCC vector.

Here, the MFCC vector for the first speech sample may be obtained by:

(1) pre-emphasis: and passing the sampled digital voice signal through a high-pass filter to obtain a pre-emphasized voice signal. The pre-emphasis is to enhance the high frequency part of the speech signal, flatten the spectrum of the speech signal, maintain the spectrum in the whole frequency band from low frequency to high frequency, and obtain the spectrum with the same signal-to-noise ratio. Meanwhile, the method is also used for eliminating the effect of vocal cords and lips in the sounding process, compensating the high-frequency part of the voice signal which is restrained by the sounding system, and highlighting the formants of the high frequency.

(2) Framing and windowing: frame division means dividing the speech into a plurality of frames, and firstly, collecting N sampling points in the speech signal into an observation unit, which is called a frame. In order to avoid the excessive variation of two adjacent frames, there is an overlapping region between two adjacent frames.

Windowing: since a speech signal varies continuously over a long range and cannot be processed without a fixed characteristic, each frame is substituted into a window function, and the value outside the window is set to 0, so that the purpose of eliminating signal discontinuity which may be caused at both ends of each frame is achieved. The commonly used window functions include a square window, a hamming window, a hanning window, etc., and the hamming window is often used according to the frequency domain characteristics of the window function.

(3) Fast Fourier transform: since the signal is usually difficult to see by the transformation in the time domain, it is usually observed by transforming it into an energy distribution in the frequency domain, and different energy distributions can represent the characteristics of different voices. After windowing, each frame must also undergo a fast fourier transform to obtain the energy distribution over the spectrum. And carrying out fast Fourier transform on each frame signal subjected to framing and windowing to obtain the frequency spectrum of each frame. And the power spectrum of the voice signal is obtained by taking the modulus square of the frequency spectrum of the voice signal.

(4) The fast fourier transformed speech signal is filtered using a mel filter.

(5) And carrying out operation of logarithmic energy parameters on the voice signals filtered by the Mel filter.

(6) And (4) performing discrete cosine transform on the signal output in the step (5) to obtain an MFCC vector of the first voice sample.

B: the second sample feature vector can characterize a pronunciation feature of the first speech sample in a certain aspect, such as a phone feature, a syllable feature, a word feature, and the like. The second sample feature vector is not an independent frame feature, but a feature that can characterize the correlation relationship between adjacent frames.

In an alternative embodiment, the second sample feature vector of the first speech sample may be generated by: and aiming at each pronunciation feature, inputting the first sample feature vector into a feature vector extraction network corresponding to the pronunciation feature to obtain a second sample feature vector of the pronunciation feature.

Here, the pronunciation features and the feature vector extraction network have a one-to-one correspondence; if the pronunciation characteristics include several types, a corresponding number of feature vector extraction networks need to be trained in advance.

Specifically, referring to fig. 4, an embodiment of the present application provides a specific method for generating the feature vector extraction network, where the method includes S401 to S403:

s401: obtaining a plurality of second voice samples and feature labeling information of each second voice sample under the at least one pronunciation feature.

Here, the category of the second voice sample is the same as the language number and language of the first voice sample; for example, the first speech sample includes: mandarin, southern Fujian, Guangdong, Shandong, Sichuan and Tibetan, and the second speech sample also includes mandarin, southern Fujian, Guangdong, Shandong, Sichuan and Tibetan. The second speech sample may be identical to the first speech sample, may be partially identical to the first speech sample, or may be completely different from the first speech sample. The second speech samples used for training the feature vector extraction networks respectively corresponding to different pronunciation features are the same.

The feature labeling information of the second voice sample under each pronunciation feature can be labeled manually or automatically labeled based on a machine learning mode.

S402: for each second speech sample acquired, a third sample feature vector characterizing acoustic features of the second speech sample is determined.

Here, the third sample feature vector of the second speech sample can characterize the second speech sample. In an alternative embodiment, the third sample feature vector may also be a MFCC vector. The extraction method of the feature vector of the third sample may refer to the extraction method of the feature vector of the first sample, and is not described herein again.

S403: and training the feature vector extraction network based on the third sample feature vector and the feature labeling information.

Here, the training of the feature vector extraction network is performed based on the third sample feature vector and the feature labeling information, and is performed by inputting the third sample feature vector into a basic feature vector extraction network, performing feature learning on the third sample feature vector using the basic feature vector extraction network, and outputting feature information extracted for the third sample feature vector and capable of representing a corresponding pronunciation feature. The feature information extracted for the third sample feature vector by the basic feature vector extraction network should be consistent with the corresponding feature label information. Therefore, under the condition that the feature information is inconsistent with the corresponding feature labeling information, the parameters of the feature vector extraction network are adjusted, so that the feature information extracted for the feature vector of the third sample again can tend to keep the direction change with the feature labeling information after the parameters of the feature vector extraction network are adjusted. And finally, training the basic feature vector extraction network based on the third sample feature vectors corresponding to the second voice samples and the corresponding feature labeling information to generate the feature vector extraction network.

In an alternative embodiment, the feature vector extraction network comprises: deep Neural Networks (DNN). And extracting a feature vector corresponding to the pronunciation feature for the second voice sample by the deep neural network based on the third sample feature vector.

It should be noted that, when extracting the second sample feature vector for the first speech sample based on DNN, after inputting the first sample feature vector to DNN, the output of the last network output layer of DNN is taken as the second sample feature vector; a Bottleneck layer bottleeck may also be included in the DNN, and after the first sample Feature vector is input into the DNN, the Bottleneck layer may output a Bottleneck Feature (BNF) as the second sample Feature vector. Here, the bottleneck layer is not the last layer of the DNN, but is located at the middle and rear of the DNN. The bottleneck layer can reduce the dimension of the features extracted by the DNN; and the BNF is used as the second sample feature vector, so that the training difficulty of the language identification model can be reduced.

When training the feature vector extraction network based on the third sample feature vector and the feature labeling information, any one of the following manners may be adopted:

one is as follows: referring to fig. 5, a first specific method for training the feature vector extraction network based on the third sample feature vector and the feature labeling information provided in the embodiment of the present application includes:

s501: and calculating the similarity between the third sample feature vector and the feature marking information, and comparing the similarity with a preset similarity threshold.

Here, the similarity includes: -any of an euclidean distance, a manhattan distance, a chebyshev distance, a minkowski distance, a normalized euclidean distance, a mahalanobis distance, an included angle cosine, a hamming distance, a jaccard distance or jaccard similarity coefficient, a correlation coefficient or a correlation distance, and an information entropy.

S502: and detecting whether the similarity is greater than or equal to a preset similarity threshold value. If not, skipping to S503; if so, the process is ended.

S503: adjusting the feature vector extraction network parameters, and based on the adjusted feature vector extraction network, re-obtaining the feature vector of the third voice sample; jumping to S501.

Referring to fig. 6, the embodiment of the present application further provides a second specific method for training the feature vector extraction network based on the third sample feature vector and the feature labeling information, including:

s601: taking any one of the second voice samples which are not trained in the current round as a target second voice sample;

s602: and adjusting parameters of a feature vector extraction network based on the feature labeling information corresponding to the target second voice sample and the feature vector of the third sample.

Here, the parameters of the feature vector extraction network may be adjusted in the following manner:

calculating the similarity between the third sample feature vector of the target second voice sample and the corresponding feature marking information;

comparing the similarity with a preset similarity threshold;

S603: detecting whether a second voice sample which does not complete the training in the current round exists; if yes, jumping to S604; if not, then jump to S606.

S604: and taking the target second voice sample as a second voice sample of which the training is finished in the current round, and taking any one of the second voice samples of which the training is not finished in the current round as a new target second voice sample.

S605: extracting a third sample feature vector of the new target second voice sample by using the feature vector extraction network after the parameters are adjusted; returning to S602.

S606: and entering the next round of training.

Until a preset model training cutoff condition is met.

Here, the model training cutoff condition may be any one of the following conditions:

(1) the number of training rounds reaches the preset number of rounds. At this time, the feature vector extraction network obtained in the last round of training is used as the trained feature vector extraction network.

(2) Training a feature vector extraction network obtained after the training of the current round is finished by using a test sample set; if the test samples are concentrated, the similarity between the feature vectors of the third sample extracted for the test samples based on the feature vector extraction network and the feature labeling information of the test samples meets the preset similarity requirement, the percentage occupied in the test sample set reaches the first preset percentage, the training of the feature vector extraction network is stopped, and the feature vector extraction network obtained in the last round is used as the trained feature vector extraction network.

(3) In the current round of training, the number of the second voice samples with the similarity greater than or equal to the preset similarity threshold reaches a second preset percentage.

Based on the above process, training of the feature vector extraction network is completed.

When the second sample feature vectors corresponding to at least one pronunciation feature of the first voice samples are determined, the first sample feature vectors can be input into a pre-trained feature vector extraction network, and the second sample feature vectors of each first voice sample under various pronunciation features are obtained.

After determining, for each first speech sample, a first sample feature vector characterizing the first speech sample and a second sample feature vector corresponding to at least one pronunciation feature of the first speech sample, respectively, the method further includes:

s303: and training a language identification model based on the first sample feature vector, the second sample feature vector and language information corresponding to the first voice sample.

In a specific implementation, referring to fig. 7, the following method may be used to implement the training of the language identification model:

s701: and fusing the first sample feature vector and the second sample feature vector to generate a target sample feature vector.

Here, the fusing the first sample feature vector and the second sample feature vector may be performed in any one of the following manners:

(1) and splicing the first sample feature vector and the second sample feature vector to generate the target sample feature vector.

When the first sample feature vector and the second sample feature vector are spliced, the splicing sequence can be specifically set according to actual needs.

For example, the first sample feature vector includes: a, the second sample feature vector includes B and C, then when splicing the three, possible concatenation mode includes: any one of ABC, ACB, BAC, BCA, CAB and CBA.

In addition, when the first sample feature vector and the second sample feature vector are spliced, one of the first sample feature vector and the second sample feature vector can be placed between the other sample feature vectors. For example, the first sample feature vector includes a, and the second sample feature vector includes B, where the first sample feature vector a may be cut from any position in the middle to form two first sample feature sub-vectors a1 and a2, and when a and B are spliced, the following may be performed: a1BA 2.

It should be noted here that the splicing manner of the first sample feature vector and the second sample feature vector corresponding to different first speech samples should be consistent.

(2) Splicing the first sample characteristic vector and the second sample characteristic vector to form a spliced vector; and extracting low-dimensional transformation vector features of the spliced vector, and generating the target sample feature vector based on the extracted low-dimensional transformation vector features.

Here, the manner of stitching the first sample feature vector and the second sample feature vector is similar to that in the above stitching, and is not described herein again.

After the first sample feature vector and the second sample feature vector are spliced to form a spliced vector, extracting low micro-transformation vector features of the spliced vector based on a vector feature (i-vector) method. The i-vector method extracts the low-dimensional overall variation factor vector of the input voice as the feature vector of the input voice through full-difference space analysis, and can realize the dimension reduction processing of the spliced vector.

S702: and training a language identification model based on the target sample feature vector and the language information corresponding to the first voice sample.

Here, the training of the language identification model based on the target sample feature vector and the language information corresponding to the first voice sample is to input the target sample feature vector into the basic language identification model, and perform feature learning on the target sample feature vector by using the basic language identification model. The basic language identification model can output the language prediction result corresponding to the feature vector of the target sample, and then the parameters of the basic language identification model are adjusted according to the language prediction result and the corresponding language information.

In an alternative embodiment, the language identification model includes: a Probabilistic Linear Discriminant Analysis (PLDA) model, or a neural network model.

Specifically, when the language identification model includes the neural network model, the process of training the language identification model based on the feature vector of the target sample and the language information corresponding to the first voice sample specifically includes:

inputting the target sample feature vector into a language identification model to obtain a language prediction result corresponding to the target sample feature vector;

determining the cross entropy loss corresponding to the first voice sample based on the language prediction result and the corresponding language information;

and adjusting parameters of the language identification model according to the cross entropy loss of the first voice sample.

Here, the cross entropy loss may be obtained using a cross entropy loss function.

When the language identification model comprises a PLDA model, in the field of voiceprint recognition, it is assumed that the training data speech consists of speech in I languages, where each language has J segments of the first speech sample. Then, the jth segment of the ith language is the first speech sample x_ij. Then, according to the factor analysis, x_ijThe generated model is:

x_ij＝μ+Fh_i+Gw_ij+ξ_ij；

wherein, mu and Fh_iOnly with respect to language, called signal portion, where the differences between different languages are described; gw_ij+ξ_ijDescribes the difference between the first speech samples of the same languageThe difference of (a), here the noise part;

wherein, mu represents the numerical mean value, F and G are both space characteristic matrixes, ξ_ijRepresenting the noise covariance.

The two matrices F and G contain basic factors in the respective imaginary variable space, which can be regarded as eigenvectors of the respective space. Each column of F corresponds to a feature vector of the inter-class space and each column of G corresponds to a feature vector of the intra-class space. And h_iAnd w_ijThe two vectors can be regarded as a representation of features of the respective space, such as h_iCan be regarded as x_ijRepresentation of features in the linguistic space, w_ijIs a characteristic representation between languages. Identification and scoring stage, if h of two voices_iThe greater the likelihood that the features are the same, the more certain the two voices belong to the same language.

The training process of the model is to carry out the above parameters mu, matrixes F and G and ξ on the basis of the feature vectors of the target samples_ijAnd (5) carrying out a solving process.

In addition, in some embodiments of the present application, the first sample feature vector is an MFCC vector, and the second sample feature vector is a BNF sample feature vector, which extracts a plurality of BNF features of the input speech based on different feature vector extraction networks, and generates a feature vector of the input first speech sample by fusing the BNF features and the MFCC features having complementarity, so that the extracted features can comprehensively reflect the characteristics of the first speech sample. On one hand, by combining the representation capability of BNF on the associated information between adjacent frames of the input voice and the description capability of MFCC on the independent acoustic feature of each frame, the description capability of the fused target sample feature vector on the first voice sample is greatly improved. On the other hand, the pronunciation characteristics of the first speech sample can be comprehensively depicted from multiple angles (such as phonemes, syllables and the like) by the aid of various different BNF characteristics, and the capability of language identification by the aid of the fusion characteristics can be further enhanced.

Example two

The second embodiment of the present application further provides another language identification model training method, and on the basis of the first embodiment, the language identification model training method provided in the second embodiment of the present application further includes:

(1) and performing de-muting operation on the first voice sample based on a preset non-muting energy threshold.

Specifically, based on a preset non-silence energy threshold, performing a de-silence operation on a first voice sample, including:

intercepting voice sections with preset lengths from the first voice sample for multiple times according to preset step lengths;

calculating the energy value of each sampling point in the intercepted voice section;

comparing the energy value of each sampling point with a preset non-silent energy threshold value;

if the number of the sampling points with the energy value smaller than the non-silent energy threshold value in the voice section reaches the preset percentage of the total number of the sampling points in the intercepted voice section, taking the intercepted voice section as a silent voice section;

de-muting the first speech sample based on the muted speech segments.

Here, the de-muting operation may be performed on the first speech sample based on the muted speech segments in the following manner: removing the silent speech segments from the first speech sample.

Or, further comprising:

(2) de-muting may also be performed on the first sample feature vector based on a preset non-muting energy threshold.

And performing de-muting operation on the first sample feature vector based on a preset non-muting energy threshold.

Specifically, based on a preset non-mute energy threshold, performing a de-mute operation on a first sample feature vector, including:

intercepting a preset number of elements from the first sample feature vector for multiple times according to a preset step length;

calculating the energy value of each element in the intercepted preset number of elements;

comparing the energy value of each element with a preset non-silent energy threshold value;

if the quantity of the elements with the energy value smaller than the non-mute energy threshold value in the preset quantity of elements reaches a preset percentage of the total quantity of the elements in the intercepted preset quantity of elements, taking the intercepted preset quantity of elements as elements corresponding to the mute section;

and de-muting the first voice sample based on the element corresponding to the mute section.

The de-mute operation is realized by deleting the first feature vector of the corresponding element of the mute section.

With this embodiment it is possible to remove the silence segments of the first speech sample, i.e. to remove the invalid parts of the first sample feature vector.

In addition, the operation of de-muting the second voice sample or the third sample feature vector corresponding to the second voice sample can be performed in the same manner as described above. The specific operation manner is similar to the above manner, and is not described herein again.

Example three:

referring to fig. 8, a third embodiment of the present application further provides a language identification method, where the method includes S801 to S803:

s801: and acquiring the voice to be identified.

Here, the method for acquiring the speech to be identified may have different acquisition modes according to different application scenarios of the language identification method; for example, the language identification method applied to a voice taxi taking scenario may be the service requester terminal 130 shown in fig. 1. A voice input key is provided in the service requester terminal 130; the voice input key can be triggered by a user; and the service requester terminal 130 can acquire the voice input by the user after the voice input case is triggered by the user. The voice input by the user is the voice to be identified.

S802: and determining a first feature vector representing the acoustic features of the voice to be identified and a second feature vector corresponding to at least one pronunciation feature of the voice to be identified respectively.

Here, the acoustic features include: mel-frequency cepstrum coefficients MFCC features; the pronunciation features include: at least one of phoneme characteristics, syllable characteristics and word characteristics.

The specific obtaining manner of the first feature vector of the speech to be identified is similar to the obtaining manner of the first sample feature vector of the first speech sample in the first embodiment, and is not described herein again.

The manner of obtaining the second feature vectors corresponding to the at least one pronunciation feature of the voice to be identified is similar to the manner of obtaining the second sample feature vectors corresponding to the at least one pronunciation feature of the first voice sample in the first embodiment, and is not described herein again.

S803: and obtaining language information of the voice to be identified based on the first feature vector, the second feature vector and a pre-trained language identification model.

Here, the language identification model trained in advance may be obtained by the language identification model training method provided in any one of the first embodiment and the second embodiment of the present application, and details thereof are not repeated herein.

When language information of a voice to be identified is obtained based on the first feature vector, the second feature vector and a pre-trained language identification model, the first feature vector and the second feature vector are fused to generate a target feature vector, and the target feature vector is input into the language identification model obtained by the language identification model training method provided by the embodiment of the application to obtain the language information of the language to be identified.

Specifically, the manner of fusing the first feature vector and the second feature vector is similar to the manner of fusing the first sample feature vector and the second sample feature vector, and is not repeated again.

In the embodiment of the application, the language information of the voice to be identified is obtained by determining the first feature vector and the second feature vector of the voice to be identified and based on the first feature vector and the second feature vector and the pre-trained language identification model.

In addition, in some embodiments of the present application, the first feature vector is an MFCC vector, and the second feature vector is a BNF sample feature vector, which extracts a plurality of BNF features of the input speech based on different feature vector extraction networks, and generates a feature vector of the input speech to be identified by fusing the BNF features and the MFCC features having complementarity, so that the extracted features can comprehensively reflect the characteristics of the speech to be identified. On one hand, by combining the representation capability of BNF on the associated information between adjacent frames of the input voice and the description capability of MFCC on the independent acoustic feature of each frame, the description capability of the fused features on the voice to be identified is greatly improved. On the other hand, the pronunciation characteristics of the speech to be identified can be comprehensively depicted from multiple angles (such as phonemes, syllables and the like) by using multiple different BNF characteristics, and the language identification capability of the fused target characteristic vector can be further enhanced.

Based on the same inventive concept, the embodiment of the present application further provides a language identification model training device corresponding to the language identification model training method, and as the principle of solving the problem of the device in the embodiment of the present application is similar to the language identification model training method in the embodiment of the present application, the implementation of the device can refer to the implementation of the method, and repeated details are not repeated.

Example four

As shown in fig. 9, a language identification model training apparatus 900 according to a fourth embodiment of the present application includes: a sample acquisition module 910, a sample feature vector determination module 920, and a first training module 930.

The sample obtaining module 910 is configured to obtain a plurality of first voice samples and language information of each of the first voice samples.

The sample feature vector determining module 920 is configured to determine, for each obtained first voice sample, a first sample feature vector characterizing acoustic features of the first voice sample, and a second sample feature vector corresponding to at least one pronunciation feature of the first voice sample.

A first training module 930, configured to train a language identification model based on the first sample feature vector, the second sample feature vector, and language information corresponding to the first voice sample.

In a possible implementation manner, the first training module 930 is specifically configured to perform a language identification model training based on the first sample feature vector, the second sample feature vector, and language information corresponding to the first speech sample, in the following manner:

In one possible implementation, the first training module 930 is specifically configured to fuse the first sample feature vector and the second sample feature vector by:

In a possible implementation manner, the sample feature vector determining module 920 is specifically configured to determine the second sample feature vectors corresponding to the at least one pronunciation feature of the first speech sample respectively by using the following manners:

In a possible embodiment, the method further comprises: the second training module 940 is specifically configured to generate the feature vector extraction network by:

In a possible implementation manner, the second training module 940 is specifically configured to perform training of the feature vector extraction network based on the third sample feature vector and the feature labeling information in the following manner:

the sample feature vector determining module 920 is specifically configured to determine the second sample feature vectors corresponding to the at least one pronunciation feature respectively by using the following methods:

In a possible embodiment, the first sample feature vector is a mel-frequency cepstral coefficient MFCC vector, and the second sample feature vector is a bottleneck BNF sample feature vector.

In one possible embodiment, the pronunciation features include: at least one of phoneme characteristics, syllable characteristics and word characteristics.

EXAMPLE five

Corresponding to the language identification model training method in fig. 3, an embodiment of the present application further provides a computer device 1000, and as shown in fig. 10, a schematic structural diagram of the computer device 1000 provided in the embodiment of the present application includes:

the apparatus includes a memory 1010, a processor 1020, and a computer program stored in the memory 1010 and executable on the processor 1020, wherein the processor 1020 implements the steps of the language identification model training method when executing the computer program.

Specifically, the memory 1010 and the processor 1020 can be general memories and processors, which are not limited herein, and when the processor 1020 runs a computer program stored in the memory 1010, the language identification model training method can be executed, so as to solve the problem that the speech features are only characterized by MFCC, and the speech characterization capability is insufficient, which results in a low accuracy rate of identifying the language of the input speech, thereby achieving a more effective utilization of the acoustic features and pronunciation features of the input speech, and improving the accuracy rate of the language identification result.

Corresponding to the language identification model training method in fig. 3, an embodiment of the present application further provides a computer-readable storage medium, on which a computer program is stored, where the computer program is executed by the processor 1020 to perform the steps of the language identification model training method.

Specifically, the storage medium can be a general storage medium, such as a mobile disk, a hard disk, and the like, and when a computer program on the storage medium is run, the language identification model training method can be executed, so that the problem that the accuracy of identifying the language of the input speech is low due to insufficient voice description capability because the characteristics of the speech are only described by MFCC is solved, and the acoustic characteristics and the pronunciation characteristics of the input speech are more effectively utilized, thereby improving the accuracy of the language identification result.

Based on the same inventive concept, the embodiment of the present application further provides a language identification device corresponding to the language identification method, and since the principle of solving the problem of the device in the embodiment of the present application is similar to that of the language identification method in the embodiment of the present application, the implementation of the device can refer to the implementation of the method, and repeated details are not repeated.

EXAMPLE six

As shown in fig. 11, a language identification apparatus 1100 according to a sixth embodiment of the present application includes: a to-be-identified speech obtaining module 1110, a feature vector determining module 1120, and a language information obtaining module 1130.

A to-be-identified voice obtaining module 1110 configured to obtain a to-be-identified voice;

a feature vector determining module 1120, configured to determine a first feature vector characterizing acoustic features of the speech to be identified, and a second feature vector corresponding to at least one pronunciation feature of the speech to be identified;

a language information obtaining module 1130, configured to obtain language information of the speech to be identified based on the first feature vector, the second feature vector, and a pre-trained language identification model.

In an alternative embodiment, the acoustic features include: mel-frequency cepstrum coefficients MFCC features; the pronunciation features include: at least one of phoneme characteristics, syllable characteristics and word characteristics.

In a possible implementation manner, the language information obtaining module 1130 is configured to obtain the language information of the speech to be identified based on the first feature vector, the second feature vector, and a pre-trained language identification model in the following manner:

In a possible implementation manner, the language information obtaining module 1130 is configured to fuse the first feature vector and the second feature vector to generate a target feature vector by using the following method:

In a possible embodiment, the method further comprises: a first model training module 1140, configured to obtain the language identification model in the following manner:

In a possible embodiment, the feature vector determining module 1120 is configured to determine second feature vectors respectively corresponding to at least one pronunciation feature of the language to be identified, by using the following manners:

In a possible implementation, the system further includes a second model training module 1150, configured to generate the feature vector extraction network by:

In a possible implementation, the second model training module 1150 is configured to perform training of the feature vector extraction network based on the third sample feature vector and the feature labeling information in the following manner:

In a possible implementation manner, the second model training module 1150 is configured to adjust parameters of a feature vector extraction network based on feature labeling information corresponding to the target second speech sample and a feature vector of a third sample in the following manner:

comparing the similarity with a preset similarity threshold;

a feature vector determining module 1120, configured to determine that the at least one pronunciation feature respectively corresponds to the second feature vector by:

EXAMPLE seven

Corresponding to the language identification method in fig. 8, an embodiment of the present application further provides a computer device 2000, and as shown in fig. 12, a schematic structural diagram of the computer device 2000 provided in the embodiment of the present application includes:

the apparatus includes a memory 2010, a processor 2020, and a computer program stored in the memory 2010 and executable on the processor 2020, wherein the processor 2020 implements the steps of the language identification method when executing the computer program.

Specifically, the memory 2010 and the processor 2020 can be general memories and processors, which are not specifically limited herein, and when the processor 2020 runs a computer program stored in the memory 2010, the language identification method can be executed, so as to solve the problem that the speech features are only characterized by the MFCC, and the speech characterization capability is insufficient, which results in a low accuracy rate for identifying the language of the input speech, and further achieve the effect of more effectively utilizing the acoustic features and pronunciation features of the input speech, thereby improving the accuracy rate of the language identification result.

Corresponding to the language identification model training method in fig. 8, an embodiment of the present application further provides a computer-readable storage medium, on which a computer program is stored, where the computer program is executed by the processor 2020 to perform the steps of the language identification method.

Specifically, the storage medium can be a general storage medium, such as a mobile disk, a hard disk, and the like, and when a computer program on the storage medium is run, the language identification method can be executed, so that the problem that the accuracy of identifying the language of the input voice is low due to the fact that the voice feature is only characterized by MFCC and the characterization capability of the voice is insufficient is solved, and the acoustic feature and the pronunciation feature of the input voice are more effectively utilized, so that the accuracy of the language identification result is improved.

It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working processes of the system and the apparatus described above may refer to corresponding processes in the method embodiments, and are not described in detail in this application. In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. The above-described apparatus embodiments are merely illustrative, and for example, the division of the modules is merely a logical division, and there may be other divisions in actual implementation, and for example, a plurality of modules or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of devices or modules through some communication interfaces, and may be in an electrical, mechanical or other form.

The modules described as separate parts may or may not be physically separate, and parts displayed as modules may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a non-volatile computer-readable storage medium executable by a processor. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a U disk, a removable hard disk, a ROM, a RAM, a magnetic disk, or an optical disk.

The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A language identification method, comprising:

acquiring a voice to be identified;

2. The method of claim 1, wherein the acoustic features comprise: mel-frequency cepstrum coefficients MFCC features; the pronunciation features include: at least one of phoneme characteristics, syllable characteristics and word characteristics.

3. The method according to claim 1, wherein obtaining language information of the speech to be identified based on the first feature vector and the second feature vector and a pre-trained language identification model comprises:

4. The method of claim 3, wherein fusing the first feature vector and the second feature vector to generate a target feature vector comprises:

5. The method of claim 1, wherein said language identification model is obtained by:

6. The method according to claim 1, wherein determining second feature vectors respectively corresponding to at least one pronunciation feature of the language to be identified comprises:

7. The method of claim 6, wherein the feature vector extraction network is generated by:

8. The method of claim 7, wherein the training of the feature vector extraction network based on the third sample feature vector and the feature labeling information comprises:

9. The method of claim 7, wherein training the feature vector extraction network based on the third sample feature vector and the feature labeling information comprises:

10. The method of claim 9, wherein the adjusting parameters of a feature vector extraction network based on the feature labeling information corresponding to the target second speech sample and a third sample feature vector comprises:

comparing the similarity with a preset similarity threshold;

11. The method of claim 6, wherein the feature vector extraction network comprises a bottleneck feature extraction layer;

12. The method according to any of claims 1-11, wherein the first feature vector is a mel-frequency cepstral coefficient (MFCC) vector and the second feature vector is a bottleneck feature (BNF) vector.

13. The method according to any one of claims 1-11, wherein said language identification model comprises: a Probabilistic Linear Discriminant Analysis (PLDA) model, or a neural network model.

14. A language identification model training method is characterized by comprising the following steps:

15. The method according to claim 14, wherein training a language identification model based on the first and second sample feature vectors and language information corresponding to the first speech sample comprises:

16. The method of claim 15, wherein fusing the first sample feature vector and the second sample feature vector comprises:

17. The method of claim 14, wherein determining a second sample feature vector corresponding to each of at least one pronunciation feature of the first speech sample comprises:

18. The method of claim 17, wherein the feature vector extraction network is generated by:

19. The method of claim 18, wherein training the feature vector extraction network based on the third sample feature vector and the feature labeling information comprises:

20. The method of claim 18, wherein training the feature vector extraction network based on the third sample feature vector and the feature labeling information comprises:

21. The method of claim 17, wherein the feature vector extraction network comprises a bottleneck feature extraction layer;

22. A language identification apparatus, comprising:

23. The apparatus of claim 22, wherein the acoustic features comprise: mel-frequency cepstrum coefficients MFCC features; the pronunciation features include: at least one of phoneme characteristics, syllable characteristics and word characteristics.

24. The apparatus according to claim 22, wherein the language information obtaining module is configured to obtain the language information of the speech to be identified based on the first feature vector, the second feature vector, and a pre-trained language identification model in the following manner:

25. The apparatus according to claim 24, wherein the language information obtaining module is configured to fuse the first feature vector and the second feature vector to generate a target feature vector by:

26. The apparatus of claim 22, further comprising: the first model training module is used for obtaining the language identification model by adopting the following modes:

27. The apparatus according to claim 22, wherein the feature vector determining module is configured to determine the second feature vectors respectively corresponding to at least one pronunciation feature of the language to be identified by:

28. The apparatus of claim 27, further comprising: a second model training module, configured to generate the feature vector extraction network in the following manner:

29. The apparatus of claim 28, wherein the second model training module is configured to perform training of the feature vector extraction network based on the third sample feature vector and the feature labeling information in the following manner:

30. The apparatus of claim 28, wherein the second model training module is configured to perform training of the feature vector extraction network based on the third sample feature vector and the feature labeling information in the following manner:

31. The apparatus of claim 30, wherein the second model training module is configured to adjust parameters of a feature vector extraction network based on feature labeling information corresponding to the target second speech sample and a third sample feature vector by:

comparing the similarity with a preset similarity threshold;

32. The apparatus of claim 27, wherein the feature vector extraction network comprises a bottleneck feature extraction layer;

the feature vector determining module is configured to determine the second feature vectors corresponding to the at least one pronunciation feature respectively by:

33. The apparatus according to any of the claims 22-32, wherein the first feature vector is a mel-frequency cepstral coefficient (MFCC) vector and the second feature vector is a bottleneck feature (BNF) vector.

34. The apparatus according to any one of claims 22-32, wherein said language identification model comprises: a Probabilistic Linear Discriminant Analysis (PLDA) model, or a neural network model.

35. A language identification model training device is characterized by comprising:

36. The apparatus of claim 35, wherein the first training module is configured to train a language identification model based on the first and second sample feature vectors and language information corresponding to the first speech sample in the following manner:

37. The apparatus of claim 36, wherein the first training module is configured to fuse the first sample feature vector and the second sample feature vector by:

38. The apparatus of claim 35, wherein the sample feature vector determining module is configured to determine the second sample feature vectors corresponding to the at least one pronunciation feature of the first speech sample respectively by:

39. The apparatus of claim 38, comprising: a second training module, configured to generate the feature vector extraction network in the following manner:

40. The apparatus of claim 39, wherein the second training module is configured to perform the training of the feature vector extraction network based on the third sample feature vector and the feature labeling information in the following manner:

41. The apparatus of claim 39, wherein the second training module is configured to perform the training of the feature vector extraction network based on the third sample feature vector and the feature labeling information in the following manner:

42. The apparatus of claim 38, wherein the feature vector extraction network comprises a bottleneck feature extraction layer;

43. An electronic device, comprising: a processor, a storage medium and a bus, the storage medium storing machine-readable instructions executable by the processor, the processor and the storage medium communicating via the bus when the electronic device is operating, the processor executing the machine-readable instructions to perform the steps of the language identification method according to any one of claims 1 to 13.

44. A computer-readable storage medium, having stored thereon a computer program for performing, when executed by a processor, the steps of the language identification method according to any one of claims 1 to 13.

45. An electronic device, comprising: a processor, a storage medium and a bus, the storage medium storing machine-readable instructions executable by the processor, the processor and the storage medium communicating via the bus when the electronic device is running, the processor executing the machine-readable instructions to perform the steps of the language identification model training method according to any one of claims 14 to 21.

46. A computer-readable storage medium, having stored thereon a computer program for performing, when executed by a processor, the steps of the language identification model training method according to any one of claims 14 to 21.