CN114038450A - Dialect identification method, dialect identification device, dialect identification equipment and storage medium - Google Patents

Dialect identification method, dialect identification device, dialect identification equipment and storage medium Download PDF

Info

Publication number
CN114038450A
CN114038450A CN202111478141.0A CN202111478141A CN114038450A CN 114038450 A CN114038450 A CN 114038450A CN 202111478141 A CN202111478141 A CN 202111478141A CN 114038450 A CN114038450 A CN 114038450A
Authority
CN
China
Prior art keywords
dialect
voice
data
recognition
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111478141.0A
Other languages
Chinese (zh)
Inventor
汪雪
程刚
蒋志燕
陈诚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Raisound Technology Co ltd
Original Assignee
Shenzhen Raisound Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Raisound Technology Co ltd filed Critical Shenzhen Raisound Technology Co ltd
Priority to CN202111478141.0A priority Critical patent/CN114038450A/en
Publication of CN114038450A publication Critical patent/CN114038450A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/005Language recognition
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/10Speech classification or search using distance or distortion measures between unknown speech and reference templates
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/02Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using spectral analysis, e.g. transform vocoders or subband vocoders
    • G10L19/0212Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using spectral analysis, e.g. transform vocoders or subband vocoders using orthogonal transformation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Artificial Intelligence (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Machine Translation (AREA)

Abstract

The invention relates to an artificial intelligence technology, and discloses a dialect identification method, which comprises the following steps: receiving dialect voice data input by a user, and extracting voice characteristics of the dialect voice data; carrying out similarity detection on the speech features one by utilizing training data corresponding to all dialect recognition models in a pre-constructed dialect model library to obtain similarity scores of the speech features and each type of training data; taking the dialect recognition model corresponding to the training data with the highest similarity score as a target dialect recognition model; and converting the dialect voice data by using the target dialect recognition model to obtain a voice recognition result corresponding to the dialect voice data. The invention also provides a dialect identification device, an electronic device and a storage medium. The invention can solve the problem of low dialect identification precision.

Description

Dialect identification method, dialect identification device, dialect identification equipment and storage medium
Technical Field
The present invention relates to the field of artificial intelligence technologies, and in particular, to a dialect identifying method and apparatus, an electronic device, and a computer-readable storage medium.
Background
As society develops, more and more software: such as input methods, navigation software, intelligent question-answering systems, etc., all use speech recognition technology. Speech recognition technology is gradually becoming a key technology for man-machine interaction in information technology. At present, speech recognition is mostly used for recognizing Mandarin, dialects are used as a local characteristic language, and a large number of people still use the dialects, especially, people of a large age cannot speak the Mandarin but can only speak the dialects, so that the speech recognition of the dialects is an important research subject.
Most of the dialect recognition at present is carried out dialect training by a model, and then the trained model is used for recognizing dialects, however, the types of dialects are different, and each dialect recognition model can only recognize one or a plurality of dialects, so that a plurality of dialect recognition models are required to be trained and collected together to form a multi-dialect recognition model.
Disclosure of Invention
The invention provides a dialect identification method, a dialect identification device and a computer readable storage medium, and mainly aims to solve the problem of low dialect identification precision.
In order to achieve the above object, the present invention provides a dialect identifying method, including:
receiving dialect voice data input by a user, and extracting voice characteristics of the dialect voice data;
carrying out similarity detection on the speech features one by utilizing training data corresponding to all dialect recognition models in a pre-constructed dialect model library to obtain similarity scores of the speech features and each type of training data;
taking the dialect recognition model corresponding to the training data with the highest similarity score as a target dialect recognition model;
and converting the dialect voice data by using the target dialect recognition model to obtain a voice recognition result corresponding to the dialect voice data.
Optionally, the extracting the voice feature of the dialect voice data includes:
converting the sound signals in the voice data into digital signals;
and calculating the digital signal by using a preset triangular band-pass filter to obtain the voice characteristics corresponding to the voice data.
Optionally, the calculating the digital signal by using a preset triangular band-pass filter to obtain a voice feature corresponding to the voice data includes:
pre-emphasis, framing and windowing are carried out on the digital signal to obtain frequency domain energy;
performing fast Fourier transform on the frequency domain energy to obtain a frequency spectrum;
calculating the frequency spectrum by using the triangular band-pass filter to obtain logarithmic energy;
discrete cosine transform is carried out on the logarithmic energy to obtain a Mel frequency cepstrum coefficient;
and carrying out differential calculation according to the mel frequency cepstrum coefficient to obtain a dynamic differential parameter, and determining the dynamic differential parameter as a voice characteristic.
Optionally, the performing similarity detection on the speech features one by using training data corresponding to all dialect recognition models in a pre-constructed dialect model library to obtain similarity scores of the speech features and each type of the training data includes:
extracting the voice characteristics of the training data of each dialect recognition model one by one;
calculating a distance value between the voice feature of the training data and the voice feature of the dialect voice data input by the user;
and calculating the similarity score of the voice feature of the dialect voice data input by the user and the voice feature of each type of training data according to the distance value.
Optionally, the converting the dialect speech data by using the target dialect recognition model to obtain a speech recognition result corresponding to the dialect speech data includes:
carrying out convolution, pooling and full-connection operation on the dialect voice data for preset times by using the target dialect identification model to obtain a coding vector;
and decoding the coding vector by using a preset activation function to obtain a voice recognition result.
Optionally, before the dialect recognition model corresponding to the training data with the highest similarity score is used as the target dialect recognition model, the method further includes:
acquiring a random voice training set and a multi-party speech sound training set;
and training according to the random voice training set by utilizing a pre-constructed universal voice model. Obtaining a universal voice model;
and respectively carrying out self-adaptive training according to the general voice model and the multi-dialect voice training set to obtain a plurality of dialect recognition models.
In order to solve the above problem, the present invention also provides a dialect identifying apparatus, including:
the voice feature extraction module is used for receiving dialect voice data input by a user and extracting voice features of the dialect voice data;
the target dialect recognition model determining module is used for carrying out similarity detection on the training data corresponding to all the dialect recognition models in the pre-constructed dialect model library and the voice features one by one to obtain similarity scores of the voice features and each type of the training data; taking the dialect recognition model corresponding to the training data with the highest similarity score as a target dialect recognition model;
and the voice recognition result generation module is used for converting the dialect voice data by using the target dialect recognition model to obtain a voice recognition result corresponding to the dialect voice data.
In order to solve the above problem, the present invention also provides an electronic device, including:
at least one processor; and the number of the first and second groups,
a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores a computer program executable by the at least one processor to enable the at least one processor to perform the dialect identification method described above.
In order to solve the above problem, the present invention also provides a computer-readable storage medium, in which at least one computer program is stored, the at least one computer program being executed by a processor in an electronic device to implement the dialect identifying method described above.
According to the embodiment of the invention, the corresponding dialect recognition model is obtained through the voice characteristics of the dialect voice data and the training data of the dialect recognition model in the dialect model library, each dialect recognition model in the dialect model library respectively corresponds to the dialect, and the voice data is recognized through the dialect recognition model corresponding to the dialect, so that the recognition accuracy is improved; and a speech recognition result corresponding to the dialect speech data is obtained through the dialect recognition model, so that the accuracy of the speech recognition result obtained by dialect recognition is improved. Therefore, the dialect identification method, the dialect identification device, the electronic equipment and the computer readable storage medium provided by the invention can solve the problem of low dialect identification precision.
Drawings
Fig. 1 is a schematic flow chart of a dialect identification method according to an embodiment of the present invention;
fig. 2 is a schematic flow chart of obtaining a similarity score according to an embodiment of the present invention;
fig. 3 is a schematic flow chart of obtaining a speech recognition result according to an embodiment of the present invention;
FIG. 4 is a functional block diagram of a dialect recognition apparatus according to an embodiment of the present invention;
fig. 5 is a schematic structural diagram of an electronic device implementing the dialect identification method according to an embodiment of the present invention.
The implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.
Detailed Description
It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
The embodiment of the application provides a dialect identification method. The dialect identification method includes, but is not limited to, at least one of electronic devices, such as a server and a terminal, capable of being configured to execute the method provided by the embodiments of the present application. In other words, the dialect identifying method may be performed by software or hardware installed in the terminal device or the server device, and the software may be a block chain platform. The server includes but is not limited to: a single server, a server cluster, a cloud server or a cloud server cluster, and the like. The server may be an independent server, or may be a cloud server that provides basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a Network service, cloud communication, a middleware service, a domain name service, a security service, a Content Delivery Network (CDN), a big data and artificial intelligence platform, and the like.
Fig. 1 is a schematic flow chart of a dialect identification method according to an embodiment of the present invention.
In this embodiment, the dialect identification method includes:
s1, receiving dialect voice data input by a user, and extracting voice characteristics of the dialect voice data;
in the embodiment of the present invention, the dialect voice data may be any dialect. The dialect is commonly called local dialect, which only passes through a certain region, is not another language independent of Chinese, but is a language used in local regions, such as cantonese, Minnan, Hakka, and the like.
In an embodiment of the present invention, the extracting the voice feature of the dialect voice data includes:
converting the sound signals in the voice data into digital signals;
and calculating the digital signal by using a preset triangular band-pass filter to obtain the voice characteristics corresponding to the voice data.
In the embodiment of the invention, the sound signal in the voice data can be converted into a digital signal through the steps of sampling, quantizing, encoding and the like.
In detail, the sampling refers to the amplitude of the sound signal acquired at a specific moment; by sampling, a time-continuous analog signal (voice data) can be converted into a discrete signal with discrete time and continuous amplitude.
In one embodiment of the present invention, the language information may be sampled at regular time intervals for a sampling period. In the quantization step, each sample that takes values continuously in amplitude is converted into a discrete value representation.
Further, the calculating the digital signal by using a preset triangular band-pass filter to obtain the voice feature corresponding to the voice data includes:
pre-emphasis, framing and windowing are carried out on the digital signal to obtain frequency domain energy;
performing fast Fourier transform on the frequency domain energy to obtain a frequency spectrum;
calculating the frequency spectrum by using the triangular band-pass filter to obtain logarithmic energy;
discrete cosine transform is carried out on the logarithmic energy to obtain a Mel frequency cepstrum coefficient;
and carrying out differential calculation according to the mel frequency cepstrum coefficient to obtain a dynamic differential parameter, and determining the dynamic differential parameter as a voice characteristic.
In the embodiment of the invention, the pre-emphasis refers to that the digital signal passes through a high-pass filter to promote the high-frequency part, so that the frequency spectrum of the signal becomes flat and is kept in the whole frequency band from low frequency to high frequency. The framing is to divide data in the digital signal by taking data collected in a preset unit time as a frame. The windowing is to multiply each frame of digital signals by a hamming window, which can increase the continuity of the left and right ends of a frame. By pre-emphasizing, framing and windowing the digital signal, vocal cords and lips effects during the sound production process can be eliminated, high frequency portions of the speech signal, which are suppressed by the sound production system, are compensated, and the windowed digital signal is converted into energy distributions on the frequency domain, wherein different energy distributions can represent characteristics of different voices.
The triangular band-pass filter can reduce the operation amount and has the function of eliminating harmonic waves by smoothing the frequency spectrum so as to highlight the formants of the voice. Therefore, the tone or pitch of a piece of speech is not presented in the Mel frequency cepstrum coefficient, so the Mel frequency cepstrum coefficient is not affected by the tone difference of the input speech, the standard Mel frequency cepstrum coefficient only reflects the static characteristics of the speech, the dynamic characteristics of the speech can be described by the differential spectrum of the static characteristics, and the dynamic differential parameter is that the dynamic and static characteristics are combined to effectively improve the recognition performance of the system.
The standard Mel frequency inverse spectrum number only reflects the static characteristics of the voice, the dynamic characteristics of the voice can be described by the differential spectrum of the static characteristics, and the dynamic differential parameters are the voice characteristic characteristics which can effectively improve the voice data by combining the dynamic and static characteristics.
S2, carrying out similarity detection on the speech features one by using training data corresponding to all dialect recognition models in a pre-constructed dialect model library to obtain similarity scores of the speech features and each training data;
in the embodiment of the invention, each dialect recognition model in the pre-constructed dialect model library can be constructed for a convolutional neural network, and after training is carried out through different dialect voice training sets, the corresponding dialect voice can be converted into text data. For example, the dialect recognition model a is obtained by cantonese training, and the cantonese recognition model a after training can realize conversion of cantonese into text data. Each dialect voice training set comprises a dialect label, and the dialect label is used for identifying the dialect type.
In the embodiment of the present invention, referring to fig. 2, the performing similarity detection on the speech features one by using training data corresponding to all dialect recognition models in the pre-constructed dialect model library to obtain similarity scores between the speech features and each of the training data includes:
s21, extracting the voice characteristics of the training data of each dialect recognition model one by one;
s22, calculating a distance value between the voice feature of the training data and the voice feature of the dialect voice data input by the user;
and S23, calculating the similarity score between the voice feature of the dialect voice data input by the user and the voice feature of each training data according to the distance value.
Further, the embodiment of the present invention may calculate a distance value between the speech feature of the training data and the speech feature of the dialect speech data input by the user according to the following formula:
Figure BDA0003394363200000061
wherein D is the distance value, RiAnd T is the voice feature of the dialect voice data input by the user, and theta is a preset coefficient.
For example, the speech feature of the training data of the dialect recognition model is A, and the speech feature of the dialect speech data input by the user is A
Figure BDA0003394363200000062
Calculating the voice characteristic A and the voice characteristic through a formula
Figure BDA0003394363200000063
Is 40, the embodiment of the present invention calculates according to a preset rule, for example: and if the similarity score is 1-distance value/100, the similarity score of the training data voice features of the dialect recognition model is 0.6.
S3, taking the dialect recognition model corresponding to the training data with the highest similarity score as a target dialect recognition model;
in the embodiment of the present invention, the higher the similarity score indicates that the similarity score is closer to the speech characteristics, so the embodiment of the present invention uses the dialect recognition model with the highest score as the target dialect recognition model of the dialect speech data.
In one embodiment of the present invention, the dialect label of the training data with the highest similarity score may be extracted as the dialect label of the dialect voice data, and the dialect label enables a user to know the dialect type obtained by voice data recognition through modes such as front-end display.
And S4, converting the dialect voice data by using the target dialect recognition model to obtain a voice recognition result corresponding to the dialect voice data.
Referring to fig. 3, in the embodiment of the present invention, the S4 includes:
s41, carrying out convolution, pooling and full connection operation on the dialect voice data for preset times by using the target dialect recognition model to obtain a coding vector;
and S42, decoding the coding vector by using a preset activation function to obtain a voice recognition result.
In the embodiment of the present invention, the speech recognition result may be text data, and the embodiment of the present invention may perform convolution, pooling, and full connection operations on the dialect speech data by using a CNN network or an RNN network. For example, performing convolution, pooling and full concatenation on the dialect speech data by using the RNN network to obtain a coding vector; and decoding the coding vector by using a single-layer neural network with a classifier as a decoding layer. The decoding layer adopts a softmax activation function, a sigmoid activation function and other activation functions.
In another embodiment of the present invention, before the dialect recognition model corresponding to the training data with the highest similarity score is used as the target dialect recognition model, the method further includes:
acquiring a random voice training set and a multi-party speech sound training set;
training according to the random voice training set by using a pre-constructed general voice model to obtain a general voice model;
and respectively carrying out self-adaptive training according to the general voice model and the multi-dialect voice training set to obtain a plurality of dialect recognition models.
In the embodiment of the invention, the pre-constructed Universal Background Model is a Universal speech Model obtained by firstly collecting a large amount of random speech and training, and then the dialect recognition Model is obtained by using part of dialect speech data and adjusting the parameters of the Universal speech Model through a self-adaptive algorithm.
In the embodiment of the invention, the speech characteristics of the dialect speech A are self-adapted through the universal speech model, so that the dialect recognition model of the dialect speech A can be acquired more quickly, and the dialect recognition model can be acquired without using excessive dialect speech A as the extraction target of the speech characteristics.
In embodiments of the present invention, the speech features may be scattered around some gaussian distributions of the generic speech model. The adaptive process is to shift each gaussian distribution of the generic speech model to the speech feature, and specifically includes: calculating update parameters (gaussian weight, mean and variance) of a general speech model using the speech features; and fusing the obtained updated parameters with the original parameters of the universal voice model to obtain a dialect recognition model suitable for the voice characteristics. The adaptive algorithms include, but are not limited to, maximum a posteriori probability (MAP), Maximum Likelihood Linear Regression (MLLR).
In the embodiment of the invention, the universal speech model is finely adjusted to the dialect recognition model of the speech characteristics through a self-adaptive algorithm. This approach greatly reduces the amount of samples and training time required for training by reducing the training parameters.
In another embodiment of the present invention, after the converting the dialect speech data by using the target dialect recognition model to obtain a speech recognition result corresponding to the dialect speech data, the method may further include: and performing text conversion on the voice recognition result to obtain a mandarin text corresponding to the dialect voice data.
The embodiment of the invention can lead the user to more intuitively check the text content of the dialect voice by performing text conversion on the common voice data obtained by converting the dialect voice data.
According to the embodiment of the invention, the corresponding dialect recognition model is obtained through the voice characteristics of the dialect voice data and the training data of the dialect recognition model in the dialect model library, each dialect recognition model in the dialect model library respectively corresponds to the dialect, and the dialect voice data is recognized through the dialect recognition model corresponding to the dialect, so that the recognition accuracy is improved. Therefore, the dialect identification method provided by the invention can solve the problem of low dialect identification precision.
Fig. 4 is a functional block diagram of a dialect recognition apparatus according to an embodiment of the present invention.
The dialect recognition apparatus 100 of the present invention may be installed in an electronic device. According to the realized functions, the dialect recognition device 100 may include a speech feature extraction module 101, a target dialect recognition model determination module 102, and a speech recognition result generation module 103. The module of the present invention, which may also be referred to as a unit, refers to a series of computer program segments that can be executed by a processor of an electronic device and that can perform a fixed function, and that are stored in a memory of the electronic device.
In the present embodiment, the functions regarding the respective modules/units are as follows:
the voice feature extraction module 101 is configured to receive dialect voice data input by a user, and extract a voice feature of the dialect voice data;
the target dialect recognition model determining module 102 is configured to perform similarity detection with the speech features one by using training data corresponding to all dialect recognition models in a pre-constructed dialect model library to obtain similarity scores of the speech features and each type of the training data; taking the dialect recognition model corresponding to the training data with the highest similarity score as a target dialect recognition model;
the speech recognition result generation module 103 is configured to convert the dialect speech data by using the target dialect recognition model to obtain a speech recognition result corresponding to the dialect speech data.
In detail, when the dialect identifying apparatus 100 according to the embodiment of the present invention is used, the same technical means as the dialect identifying method described in fig. 1 to 3 are adopted, and the same technical effect can be produced, which is not described herein again.
Fig. 5 is a schematic structural diagram of an electronic device implementing a dialect identification method according to an embodiment of the present invention.
The electronic device 1 may comprise a processor 10, a memory 11, a communication bus 12 and a communication interface 13, and may further comprise a computer program, such as a dialect recognition program, stored in the memory 11 and executable on the processor 10.
In some embodiments, the processor 10 may be composed of an integrated circuit, for example, a single packaged integrated circuit, or may be composed of a plurality of integrated circuits packaged with the same function or different functions, and includes one or more Central Processing Units (CPUs), a microprocessor, a digital Processing chip, a graphics processor, a combination of various control chips, and the like. The processor 10 is a Control Unit (Control Unit) of the electronic device, connects various components of the electronic device by using various interfaces and lines, and executes various functions and processes data of the electronic device by running or executing programs or modules (e.g., dialect recognition programs, etc.) stored in the memory 11 and calling data stored in the memory 11.
The memory 11 includes at least one type of readable storage medium including flash memory, removable hard disks, multimedia cards, card-type memory (e.g., SD or DX memory, etc.), magnetic memory, magnetic disks, optical disks, etc. The memory 11 may in some embodiments be an internal storage unit of the electronic device, for example a removable hard disk of the electronic device. The memory 11 may also be an external storage device of the electronic device in other embodiments, such as a plug-in mobile hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like, which are provided on the electronic device. Further, the memory 11 may also include both an internal storage unit and an external storage device of the electronic device. The memory 11 may be used not only to store application software installed in the electronic device and various types of data, such as codes of dialect identifying programs, etc., but also to temporarily store data that has been output or is to be output.
The communication bus 12 may be a Peripheral Component Interconnect (PCI) bus or an Extended Industry Standard Architecture (EISA) bus. The bus may be divided into an address bus, a data bus, a control bus, etc. The bus is arranged to enable connection communication between the memory 11 and at least one processor 10 or the like.
The communication interface 13 is used for communication between the electronic device and other devices, and includes a network interface and a user interface. Optionally, the network interface may include a wired interface and/or a wireless interface (e.g., WI-FI interface, bluetooth interface, etc.), which are typically used to establish a communication connection between the electronic device and other electronic devices. The user interface may be a Display (Display), an input unit such as a Keyboard (Keyboard), and optionally a standard wired interface, a wireless interface. Alternatively, in some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode) touch device, or the like. The display, which may also be referred to as a display screen or display unit, is suitable, among other things, for displaying information processed in the electronic device and for displaying a visualized user interface.
Fig. 5 only shows an electronic device with components, and it will be understood by a person skilled in the art that the structure shown in fig. 5 does not constitute a limitation of the electronic device 1, and may comprise fewer or more components than shown, or a combination of certain components, or a different arrangement of components.
For example, although not shown, the electronic device may further include a power supply (such as a battery) for supplying power to each component, and preferably, the power supply may be logically connected to the at least one processor 10 through a power management device, so that functions of charge management, discharge management, power consumption management and the like are realized through the power management device. The power supply may also include any component of one or more dc or ac power sources, recharging devices, power failure detection circuitry, power converters or inverters, power status indicators, and the like. The electronic device may further include various sensors, a bluetooth module, a Wi-Fi module, and the like, which are not described herein again.
It is to be understood that the described embodiments are for purposes of illustration only and that the scope of the appended claims is not limited to such structures.
The dialect recognition program stored in the memory 11 of the electronic device 1 is a combination of instructions, which when executed in the processor 10, can implement:
receiving dialect voice data input by a user, and extracting voice characteristics of the dialect voice data;
carrying out similarity detection on the speech features one by utilizing training data corresponding to all dialect recognition models in a pre-constructed dialect model library to obtain similarity scores of the speech features and each type of training data;
taking the dialect recognition model corresponding to the training data with the highest similarity score as a target dialect recognition model;
and converting the dialect voice data by using the target dialect recognition model to obtain a voice recognition result corresponding to the dialect voice data.
Specifically, the specific implementation method of the instruction by the processor 10 may refer to the description of the relevant steps in the embodiment corresponding to the drawings, which is not described herein again.
Further, the integrated modules/units of the electronic device 1, if implemented in the form of software functional units and sold or used as separate products, may be stored in a computer readable storage medium. The computer readable storage medium may be volatile or non-volatile. For example, the computer-readable medium may include: any entity or device capable of carrying said computer program code, recording medium, U-disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM).
The present invention also provides a computer-readable storage medium, storing a computer program which, when executed by a processor of an electronic device, may implement:
receiving dialect voice data input by a user, and extracting voice characteristics of the dialect voice data;
carrying out similarity detection on the speech features one by utilizing training data corresponding to all dialect recognition models in a pre-constructed dialect model library to obtain similarity scores of the speech features and each type of training data;
taking the dialect recognition model corresponding to the training data with the highest similarity score as a target dialect recognition model;
and converting the dialect voice data by using the target dialect recognition model to obtain a voice recognition result corresponding to the dialect voice data.
In the embodiments provided in the present invention, it should be understood that the disclosed apparatus, device and method can be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the modules is only one logical functional division, and other divisions may be realized in practice.
The modules described as separate parts may or may not be physically separate, and parts displayed as modules may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.
In addition, functional modules in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional module.
It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential attributes thereof.
The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference signs in the claims shall not be construed as limiting the claim concerned.
The block chain is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, a consensus mechanism, an encryption algorithm and the like. A block chain (Blockchain), which is essentially a decentralized database, is a series of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, so as to verify the validity (anti-counterfeiting) of the information and generate a next block. The blockchain may include a blockchain underlying platform, a platform product service layer, an application service layer, and the like.
The embodiment of the application can acquire and process related data based on an artificial intelligence technology. Among them, Artificial Intelligence (AI) is a theory, method, technique and application system that simulates, extends and expands human Intelligence using a digital computer or a machine controlled by a digital computer, senses the environment, acquires knowledge and uses the knowledge to obtain the best result.
Furthermore, it is obvious that the word "comprising" does not exclude other elements or steps, and the singular does not exclude the plural. A plurality of units or means recited in the system claims may also be implemented by one unit or means in software or hardware. The terms first, second, etc. are used to denote names, but not any particular order.
Finally, it should be noted that the above embodiments are only for illustrating the technical solutions of the present invention and not for limiting, and although the present invention is described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications or equivalent substitutions may be made on the technical solutions of the present invention without departing from the spirit and scope of the technical solutions of the present invention.

Claims (9)

1. A dialect identification method, the method comprising:
receiving dialect voice data input by a user, and extracting voice characteristics of the dialect voice data;
carrying out similarity detection on the speech features one by utilizing training data corresponding to all dialect recognition models in a pre-constructed dialect model library to obtain similarity scores of the speech features and each type of training data;
taking the dialect recognition model corresponding to the training data with the highest similarity score as a target dialect recognition model;
and converting the dialect voice data by using the target dialect recognition model to obtain a voice recognition result corresponding to the dialect voice data.
2. The dialect recognition method of claim 1, wherein said extracting speech features of the dialect speech data comprises:
converting the sound signals in the voice data into digital signals;
and calculating the digital signal by using a preset triangular band-pass filter to obtain the voice characteristics corresponding to the voice data.
3. The dialect recognition method of claim 2, wherein the calculating the digital signal by using a preset triangular band-pass filter to obtain the voice feature corresponding to the voice data comprises:
pre-emphasis, framing and windowing are carried out on the digital signal to obtain frequency domain energy;
performing fast Fourier transform on the frequency domain energy to obtain a frequency spectrum;
calculating the frequency spectrum by using the triangular band-pass filter to obtain logarithmic energy;
discrete cosine transform is carried out on the logarithmic energy to obtain a Mel frequency cepstrum coefficient;
and carrying out differential calculation according to the mel frequency cepstrum coefficient to obtain a dynamic differential parameter, and determining the dynamic differential parameter as a voice characteristic.
4. The dialect recognition method of claim 1, wherein the performing similarity detection with the speech features one by using training data corresponding to all dialect recognition models in a pre-constructed dialect model library to obtain similarity scores of the speech features and each kind of the training data comprises:
extracting the voice characteristics of the training data of each dialect recognition model one by one;
calculating a distance value between the voice feature of the training data and the voice feature of the dialect voice data input by the user;
and calculating the similarity score of the voice feature of the dialect voice data input by the user and the voice feature of each type of training data according to the distance value.
5. The dialect recognition method of claim 1, wherein the converting the dialect speech data using the target dialect recognition model to obtain the speech recognition result corresponding to the dialect speech data comprises:
carrying out convolution, pooling and full-connection operation on the dialect voice data for preset times by using the target dialect identification model to obtain a coding vector;
and decoding the coding vector by using a preset activation function to obtain a voice recognition result.
6. The dialect recognition method of any one of claims 1 to 5, wherein before the dialect recognition model corresponding to the training data with the highest similarity score is taken as the target dialect recognition model, the method further comprises:
acquiring a random voice training set and a multi-party speech sound training set;
training according to the random voice training set by using a pre-constructed general voice model to obtain a general voice model;
and respectively carrying out self-adaptive training according to the general voice model and the multi-dialect voice training set to obtain a plurality of dialect recognition models.
7. A dialect recognition apparatus, the apparatus comprising:
the voice feature extraction module is used for receiving dialect voice data input by a user and extracting voice features of the dialect voice data;
the target dialect recognition model determining module is used for carrying out similarity detection on the training data corresponding to all the dialect recognition models in the pre-constructed dialect model library and the voice features one by one to obtain similarity scores of the voice features and each type of the training data; taking the dialect recognition model corresponding to the training data with the highest similarity score as a target dialect recognition model;
and the voice recognition result generation module is used for converting the dialect voice data by using the target dialect recognition model to obtain a voice recognition result corresponding to the dialect voice data.
8. An electronic device, characterized in that the electronic device comprises:
at least one processor; and the number of the first and second groups,
a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores a computer program executable by the at least one processor to enable the at least one processor to perform the dialect identification method of any one of claims 1 to 7.
9. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out a dialect identification method according to any one of claims 1 to 7.
CN202111478141.0A 2021-12-06 2021-12-06 Dialect identification method, dialect identification device, dialect identification equipment and storage medium Pending CN114038450A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111478141.0A CN114038450A (en) 2021-12-06 2021-12-06 Dialect identification method, dialect identification device, dialect identification equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111478141.0A CN114038450A (en) 2021-12-06 2021-12-06 Dialect identification method, dialect identification device, dialect identification equipment and storage medium

Publications (1)

Publication Number Publication Date
CN114038450A true CN114038450A (en) 2022-02-11

Family

ID=80139879

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111478141.0A Pending CN114038450A (en) 2021-12-06 2021-12-06 Dialect identification method, dialect identification device, dialect identification equipment and storage medium

Country Status (1)

Country Link
CN (1) CN114038450A (en)

Similar Documents

Publication Publication Date Title
WO2021208287A1 (en) Voice activity detection method and apparatus for emotion recognition, electronic device, and storage medium
CN109859772B (en) Emotion recognition method, emotion recognition device and computer-readable storage medium
CN106683680B (en) Speaker recognition method and device, computer equipment and computer readable medium
CN109817246A (en) Training method, emotion identification method, device, equipment and the storage medium of emotion recognition model
CN114021582B (en) Spoken language understanding method, device, equipment and storage medium combined with voice information
CN113420556B (en) Emotion recognition method, device, equipment and storage medium based on multi-mode signals
CN112951203B (en) Speech synthesis method, device, electronic equipment and storage medium
CN112820269A (en) Text-to-speech method, device, electronic equipment and storage medium
CN113807103A (en) Recruitment method, device, equipment and storage medium based on artificial intelligence
CN113327586A (en) Voice recognition method and device, electronic equipment and storage medium
CN113704410A (en) Emotion fluctuation detection method and device, electronic equipment and storage medium
CN113064994A (en) Conference quality evaluation method, device, equipment and storage medium
CN115457982A (en) Pre-training optimization method, device, equipment and medium of emotion prediction model
CN114999533A (en) Intelligent question-answering method, device, equipment and storage medium based on emotion recognition
CN114155832A (en) Speech recognition method, device, equipment and medium based on deep learning
CN112489628B (en) Voice data selection method and device, electronic equipment and storage medium
CN115240696B (en) Speech recognition method and readable storage medium
CN115512698B (en) Speech semantic analysis method
CN116542783A (en) Risk assessment method, device, equipment and storage medium based on artificial intelligence
CN113555026B (en) Voice conversion method, device, electronic equipment and medium
CN115631748A (en) Emotion recognition method and device based on voice conversation, electronic equipment and medium
CN114038450A (en) Dialect identification method, dialect identification device, dialect identification equipment and storage medium
CN113808577A (en) Intelligent extraction method and device of voice abstract, electronic equipment and storage medium
CN113436621B (en) GPU (graphics processing Unit) -based voice recognition method and device, electronic equipment and storage medium
CN115631745A (en) Emotion recognition method, device, equipment and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination