WO2024093578A1 - 语音识别方法、装置、电子设备、存储介质及计算机程序产品 - Google Patents

语音识别方法、装置、电子设备、存储介质及计算机程序产品 Download PDF

Info

Publication number
WO2024093578A1
WO2024093578A1 PCT/CN2023/121239 CN2023121239W WO2024093578A1 WO 2024093578 A1 WO2024093578 A1 WO 2024093578A1 CN 2023121239 W CN2023121239 W CN 2023121239W WO 2024093578 A1 WO2024093578 A1 WO 2024093578A1
Authority
WO
WIPO (PCT)
Prior art keywords
speech
sub
feature extraction
feature
level
Prior art date
Application number
PCT/CN2023/121239
Other languages
English (en)
French (fr)
Inventor
刘名乐
杨栋
俞一鹏
Original Assignee
腾讯科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯科技(深圳)有限公司 filed Critical 腾讯科技(深圳)有限公司
Publication of WO2024093578A1 publication Critical patent/WO2024093578A1/zh

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/183Speech classification or search using natural language modelling using context dependencies, e.g. language models
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems

Definitions

  • the embodiments of the present application relate to the field of Internet technology, and relate to but are not limited to a speech recognition method, device, electronic device, storage medium, and computer program product.
  • Speech keyword matching technology aims to identify specific words in a speech based on reference speech. Speech keyword matching technology has always been a hot research topic in the field of speech recognition. At present, speech keyword matching technology is mainly divided into traditional methods and deep learning methods.
  • MFCC Mel frequency cepstrum coefficient
  • the embodiments of the present application provide a speech recognition method, device, electronic device, storage medium and computer program product, which are at least applied in the fields of artificial intelligence and games, and can accurately extract sub-speech embedding representation features of sub-speech signals, and then accurately recognize the speech signal to be recognized based on the sub-speech embedding representation features.
  • the embodiment of the present application provides a speech recognition method, which is executed by an electronic device, and comprises: performing sliding window interception on a speech signal to be recognized to obtain at least two sub-speech signals; performing speech feature extraction on each sub-speech signal through a pre-trained embedded feature representation system to obtain a sub-speech embedded representation feature of the corresponding sub-speech signal; wherein the embedded feature representation system comprises a first-level feature extraction network and a second-level feature extraction network; the first-level feature extraction network is used to perform first-level speech feature extraction on the sub-speech signal to obtain a first-level speech feature; the second-level feature extraction network is used to perform second-level speech feature extraction on the sub-speech signal based on the first-level speech feature, and the feature extraction accuracy of the second-level speech feature extraction is greater than the feature extraction accuracy of the first-level speech feature extraction; acquiring the embedded representation feature of each comparison word in a preset comparison word library; performing speech recognition on each of the sub-speech signals according to
  • the embodiment of the present application provides a speech recognition device, the device comprising: a frame interception module, configured to perform sliding window interception on a speech signal to be recognized, and obtain at least two sub-speech signals; a feature extraction module, configured to perform speech feature extraction on each sub-speech signal through a pre-trained embedded feature representation system, and obtain sub-speech embedded representation features of the corresponding sub-speech signal; wherein the embedded feature representation system comprises a first-level feature extraction network and a second-level feature extraction network; the first-level feature extraction network is used to perform first-level speech feature extraction on the sub-speech signal to obtain first-level speech features; the second-level feature extraction network is used to perform second-level speech feature extraction on the sub-speech signal based on the first-level speech features, and the feature extraction accuracy of the second-level speech feature extraction is greater than the feature extraction accuracy of the first-level speech feature extraction; an acquisition module, configured to obtain the embedded representation features of each comparison word in a preset comparison word library;
  • An embodiment of the present application provides a speech recognition device, comprising: a memory for storing executable instructions; and a processor for implementing the above-mentioned speech recognition method when executing the executable instructions stored in the memory.
  • An embodiment of the present application provides a computer program product or a computer program, which includes executable instructions stored in a computer-readable storage medium; when an electronic device reads the executable instructions from the computer-readable storage medium and executes the executable instructions, the above-mentioned speech recognition method is implemented.
  • An embodiment of the present application provides a computer-readable storage medium storing executable instructions for causing a processor to execute the executable instructions to implement the above-mentioned speech recognition method.
  • the embodiment of the present application has the following beneficial effects: through the embedded feature representation system composed of the first-level feature extraction network and the second-level feature extraction network, the speech feature extraction is performed on each sub-speech signal obtained after the sliding window is intercepted, and the sub-speech embedded representation feature is obtained; and according to the sub-speech embedded representation feature and the embedded representation feature of each comparison word in the preset comparison word library, the speech recognition is performed on each sub-speech signal to obtain the sub-speech recognition result; thereby, according to the sub-speech recognition results of at least two sub-speech signals, the speech recognition result corresponding to the speech signal to be recognized is determined.
  • the sub-speech embedded representation feature of each sub-speech signal can be accurately extracted through the embedded feature representation system, so that the speech signal to be recognized can be accurately recognized based on the sub-speech embedded representation feature.
  • FIG1 is a flow chart of a voice keyword matching method in the related art
  • FIG2 is a flow chart of another voice keyword matching method in the related art.
  • FIG3 is a schematic diagram of an optional architecture of a speech recognition system provided in an embodiment of the present application.
  • FIG4 is a schematic diagram of the structure of an electronic device provided in an embodiment of the present application.
  • FIG5 is a schematic diagram of an optional flow chart of a speech recognition method provided in an embodiment of the present application.
  • FIG6 is another optional flowchart of the speech recognition method provided in an embodiment of the present application.
  • FIG7 is a flow chart of a training method for an embedded feature representation system provided in an embodiment of the present application.
  • FIG8 is a flow chart of a training method for a first-level feature extraction network provided in an embodiment of the present application.
  • FIG9 is a flow chart of a training method for a second-level feature extraction network provided in an embodiment of the present application.
  • FIG10 is a schematic diagram of a voice keyword matching system provided in an embodiment of the present application.
  • FIG11 is a schematic diagram of a process of training a wav2vec model provided in an embodiment of the present application.
  • FIG12 is a schematic diagram of a process for training an ecapa-tdnn model provided in an embodiment of the present application
  • FIG13 is a schematic diagram of the structure of a wav2vec model provided in an embodiment of the present application.
  • FIG14 is a schematic diagram of the structure of the ecapa-tdnn model provided in an embodiment of the present application.
  • Figure 15 is a structural diagram of the SE-ResBlock part in the ecapa-tdnn model provided in an embodiment of the present application.
  • FIG. 1 is a flow chart of a voice keyword matching method in the related art.
  • the traditional method is mainly based on DTW.
  • the keyword voice template sample and the voice to be retrieved are preprocessed, including Mel feature extraction in step S101 and voice activity detection (VAD, Voice Activity Detection) in step S102; then, the DTW score of the template sample and the sample to be detected is obtained, that is, the template average of the keyword voice template sample is calculated through step S103, and dynamic time regularization is performed through step S104, and confidence score regularization is performed through step S105, and the scores of the voice to be retrieved and all keyword voice template samples are compared, so as to obtain the final keyword retrieval result according to the threshold.
  • VAD Voice Activity Detection
  • FIG2 is a flow chart of another speech keyword matching method in the related art.
  • step S201 the input speech to be recognized is framed to obtain multiple speech frames; then, in step S202, each speech frame is feature extracted to obtain the Mel cepstral feature coefficient MFCC sequence of each speech frame; in step S203, the MFCC sequence of each speech frame is input into a preset deep neural network model in parallel, and the posterior probability of the MFCC sequence of each speech frame under each neural unit of the output layer of the preset deep neural network model is calculated respectively, and the posterior probability under each neural unit of the output layer is composed of the posterior probability sequence corresponding to multiple speech frames, wherein each neural unit of the output layer corresponds to a keyword; then, in step S204, the posterior probability sequence under each neural unit of the output layer is monitored; finally, in step S205, the keyword of the input speech to be recognized is determined according to the comparison result of the posterior probability sequence and the probability sequence of the preset threshold. That is to say, in the
  • the traditional methods and deep learning methods in the related art extract the embedded features.
  • the defects of DTW are large amount of calculation and easy to be affected by the external environment; the defects of deep learning technology are limited expression ability and low accuracy.
  • the methods in the related art have the problem of low robustness when facing complex game voices.
  • the methods in the related art are all based on Mel features for extraction, so the accuracy of feature extraction is not high. It can be seen that the methods in the related art have the problem of low speech recognition accuracy.
  • the embodiment of the present application provides a speech recognition method, which is a game speech keyword matching method based on a pre-trained model.
  • the method of the embodiment of the present application mainly includes two submodules: an unsupervised pre-trained model and a supervised embedding feature extractor.
  • the role of the unsupervised pre-trained model is to enable the model to perform comparative learning on a large-scale corpus based on sufficient data.
  • the supervised pre-training model is used to concretize the subtask of speech matching, dividing the Chinese corpus into individual words, and allowing the network to further learn the embedded expression of individual words based on the characteristics of the previous sentence.
  • the embedded expression features extracted by the embodiment of the present application have excellent recognition rate and generalization ability, and can quickly complete the speech keyword verification and recognition tasks.
  • a sliding window is cut off for the speech signal to be recognized to obtain at least two sub-speech signals; then, a speech feature is extracted for each sub-speech signal through a pre-trained embedded feature representation system to obtain a sub-speech embedded representation feature of the corresponding sub-speech signal; wherein the embedded feature representation system includes a first-level feature extraction network and a second-level feature extraction network; the first-level feature extraction network is used to perform a first-level speech feature extraction on the sub-speech signal to obtain a first-level speech feature; the second-level feature extraction network is used to perform a second-level speech feature extraction on the sub-speech signal based on the first-level speech feature, and the feature extraction accuracy of the second-level speech feature extraction is greater than the feature extraction accuracy of the first-level speech feature extraction; and, the embedded representation feature of each comparison word in the preset comparison word library is obtained; then, according to the sub-speech embedded representation feature and
  • the speech features of each sub-speech signal are extracted through the embedded feature representation system composed of the first-level feature extraction network and the second-level feature extraction network, so that the sub-speech embedding representation features of the sub-speech signal can be accurately extracted, and then the speech signal to be recognized can be accurately recognized based on the sub-speech embedding representation features.
  • the electronic device provided in the embodiment of the present application may be a voice recognition device, which may be implemented as a terminal or a server.
  • the voice recognition device provided in the embodiment of the present application may be implemented as a laptop, a tablet computer, a desktop computer, a mobile device (e.g., a mobile phone, a portable music player, a personal digital assistant, a dedicated messaging device, a portable gaming device), an intelligent robot, an intelligent home appliance, and an intelligent vehicle-mounted device, and any terminal with a voice data processing function and a game application running function;
  • the voice recognition device provided in the embodiment of the present application may also be implemented as a server, wherein the server may be an independent physical server, or a server cluster or distributed system composed of multiple physical servers, or a cloud server that provides basic cloud computing services such as cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communications, middleware services, domain name services, security services, content
  • Fig. 3 is an optional architecture diagram of a speech recognition system provided in an embodiment of the present application.
  • the embodiment of the present application is described by taking the application of a speech recognition method to a game application as an example.
  • at least a game application is installed on the terminal in the embodiment of the present application.
  • the speech recognition system 10 includes at least a terminal 100, a network 200 and a server 300, wherein the server 300 is an application server for the game application.
  • the server 300 can constitute an electronic device in the embodiment of the present application.
  • the terminal 100 is connected to the server 300 via the network 200, and the network 200 can be a wide area network or a local area network, or a combination of the two.
  • the terminal 100 runs the game application and generates game voice data, wherein the game voice data includes the game running voice and the voice of talking and communicating between players.
  • the terminal 100 After acquiring the game voice data, the terminal 100 encapsulates the game voice data as a voice signal to be recognized into a voice recognition request, and sends the voice recognition request to the server 300 through the network 200, requesting the server 300 to perform voice recognition on the game voice data, and determine whether the game voice data contains dirty words or uncivilized language.
  • the server 300 After receiving the voice recognition request, the server 300 responds to the voice recognition request by performing a sliding window interception on the voice signal to be recognized to obtain at least two sub-voice signals; and extracts voice features from each sub-voice signal through a pre-trained embedding feature representation system to obtain the sub-voice embedding representation features of the corresponding sub-voice signal; at the same time, obtains the embedding representation features of each comparison word in the preset comparison word library; and extracts the sub-voice embedding representation features according to the sub-voice embedding representation features and the embedding representation features of each comparison word.
  • the invention relates to a method for detecting a speech signal to be recognized by a terminal 100.
  • the method comprises: performing speech recognition on each sub-speech signal to obtain a sub-speech recognition result; finally, determining a speech recognition result corresponding to the speech signal to be recognized according to the sub-speech recognition results of at least two sub-speech signals. After obtaining the speech recognition result, the speech recognition result is sent to the terminal 100.
  • the terminal 100 can generate corresponding reminder information based on the speech recognition result and display the reminder information.
  • the above-mentioned speech recognition process can also be implemented by the terminal 100, that is, after collecting the game voice data, the terminal uses the game voice data as the voice signal to be recognized for speech recognition, that is, the terminal performs a sliding window interception on the voice signal to be recognized to obtain at least two sub-voice signals; and, the terminal implements speech feature extraction for each sub-voice signal through a pre-trained embedding feature representation system to obtain a sub-voice embedding representation feature; then, the terminal obtains the embedding representation feature of each comparison word in a preset comparison vocabulary; and based on the sub-voice embedding representation feature and the embedding representation feature of each comparison word, performs speech recognition on each sub-voice signal to obtain a sub-voice recognition result; finally, the terminal determines the speech recognition result corresponding to the voice signal to be recognized based on the sub-voice recognition results of at least two sub-voice signals.
  • the speech recognition method provided in the embodiment of the present application can also be implemented based on a cloud platform and through cloud technology.
  • the server 300 can be a cloud server.
  • the cloud server performs sliding window interception on the speech signal to be recognized, or extracts speech features of each sub-speech signal through the cloud server to obtain the sub-speech embedding representation feature, or obtains the embedding representation feature of each comparison word in the preset comparison word library through the cloud server, or performs speech recognition on each sub-speech signal through the cloud server according to the sub-speech embedding representation feature and the embedding representation feature of each comparison word, or determines the speech recognition result corresponding to the speech signal to be recognized through the cloud server according to the sub-speech recognition results of at least two sub-speech signals.
  • a cloud storage may be provided, and the voice signal to be recognized may be stored in the cloud storage, or the pre-trained embedded feature representation system, the parameters of the embedded feature representation system, and the preset comparison word library may be stored in the cloud storage, or the sub-speech recognition results and the speech recognition results may be stored in the cloud storage.
  • the pre-trained embedded feature representation system, the parameters of the embedded feature representation system, and the preset comparison word library may be directly obtained from the cloud storage, and the voice signal to be recognized may be subjected to voice recognition, which can greatly improve the data reading efficiency and the speech recognition efficiency.
  • cloud technology refers to a hosting technology that unifies hardware, software, network and other resources in a wide area network or local area network to realize data computing, storage, processing and sharing.
  • Cloud technology is a general term for network technology, information technology, integration technology, management platform technology, application technology, etc. based on the cloud computing business model. It can form a resource pool, which is used on demand and flexible and convenient. Cloud computing technology will become an important support.
  • the backend services of technical network systems require a large amount of computing and storage resources, such as video websites, picture websites and more portal websites.
  • each item may have its own identification mark, which needs to be transmitted to the backend system for logical processing. Data of different levels will be processed separately. All kinds of industry data need strong system backing support, which can only be achieved through cloud computing.
  • FIG4 is a schematic diagram of the structure of an electronic device provided in an embodiment of the present application.
  • the electronic device shown in FIG4 may be a speech recognition device, wherein the electronic device includes: at least one processor 310, a memory 350, at least one network interface 320 and a user interface 330.
  • the various components in the electronic device are coupled together through a bus system 340. It is understandable that the bus system 340 is used to achieve connection and communication between these components.
  • the bus system 340 also includes a power bus, a control bus and a status signal bus. However, for the sake of clarity, various buses are marked as bus systems 340 in FIG4.
  • the processor 310 can be an integrated circuit chip with signal processing capabilities, such as a general-purpose processor, a digital signal processor (DSP), or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc., wherein the general-purpose processor can be a microprocessor or any conventional processor, etc.
  • DSP digital signal processor
  • the user interface 330 includes one or more output devices 331 that enable presentation of media content, and one or more A plurality of input devices 332 .
  • the memory 350 may be removable, non-removable, or a combination thereof. Exemplary hardware devices include solid-state memory, hard disk drives, optical disk drives, and the like. The memory 350 may optionally include one or more storage devices physically located away from the processor 310.
  • the memory 350 includes volatile memory or non-volatile memory, and may also include both volatile and non-volatile memory.
  • the non-volatile memory may be a read-only memory (ROM), and the volatile memory may be a random access memory (RAM).
  • ROM read-only memory
  • RAM random access memory
  • the memory 350 described in the embodiments of the present application is intended to include any suitable type of memory. In some embodiments, the memory 350 is capable of storing data to support various operations, examples of which include programs, modules, and data structures, or subsets or supersets thereof, as exemplified below.
  • An operating system 351 includes system programs for processing various basic system services and performing hardware-related tasks, such as a framework layer, a core library layer, a driver layer, etc., for implementing various basic businesses and processing hardware-based tasks; a network communication module 352 is used to reach other computing devices via one or more (wired or wireless) network interfaces 320.
  • Exemplary network interfaces 320 include: Bluetooth, wireless compatibility certification (WiFi), and universal serial bus (USB, Universal Serial Bus), etc.; an input processing module 353 is used to detect one or more user inputs or interactions from one of the one or more input devices 332 and translate the detected inputs or interactions.
  • FIG. 4 shows a speech recognition device 354 stored in a memory 350.
  • the speech recognition device 354 can be a speech recognition device in an electronic device. It can be software in the form of a program or a plug-in, including the following software modules: a frame capture module 3541, a feature extraction module 3542, an acquisition module 3543, a speech recognition module 3544, and a determination module 3545. These modules are logical, and therefore can be arbitrarily combined or further split according to the functions implemented. The functions of each module will be described below.
  • the device provided in the embodiments of the present application can be implemented in hardware.
  • the device provided in the embodiments of the present application can be a processor in the form of a hardware decoding processor, which is programmed to execute the speech recognition method provided in the embodiments of the present application.
  • the processor in the form of a hardware decoding processor can adopt one or more application specific integrated circuits (ASICs), DSPs, programmable logic devices (PLDs), complex programmable logic devices (CPLDs), field programmable gate arrays (FPGAs), or other electronic components.
  • ASICs application specific integrated circuits
  • DSPs digital signal processor
  • PLDs programmable logic devices
  • CPLDs complex programmable logic devices
  • FPGAs field programmable gate arrays
  • the speech recognition method provided in each embodiment of the present application can be executed by an electronic device, wherein the electronic device can be any terminal with a speech data processing function, or it can also be a server, that is, the speech recognition method of each embodiment of the present application can be executed by a terminal, or it can be executed by a server, or it can also be executed by interaction between a terminal and a server.
  • FIG. 5 is an optional flow chart of a speech recognition method provided in an embodiment of the present application. The following will be described in conjunction with the steps shown in FIG. 5 . It should be noted that the speech recognition method in FIG. 5 is described by taking a server as an execution subject as an example, and includes the following steps S501 to S505:
  • Step S501 performing sliding window interception on the speech signal to be recognized to obtain at least two sub-speech signals.
  • the voice signal to be recognized may be a voice signal corresponding to a game voice in a game scene.
  • the game voice may be collected during the running of a game application, and a voice signal of the game voice may be extracted to obtain a voice signal to be recognized.
  • the method of the embodiment of the present application can be applied to the following specific types of speech recognition scenarios in game voice, wherein the specific types of speech recognition scenarios can be determined based on the actual speech recognition tasks, that is, the specific types of speech recognition scenarios can be any type of speech recognition scenarios, such as swear word recognition scenarios, uncivilized use recognition scenarios, game terminology recognition scenarios, game intensity recognition scenarios, etc.
  • the application scenario of the embodiment of the present application is explained by taking the swear word recognition scenario as an example:
  • the process since players can make voice calls with each other, in order to ensure that the game can be run in a benign and healthy environment, it is possible to determine in real time whether there are swear words or uncivilized language in the voice of the players during the game, so as to timely discover the uncivilized language behavior of the players and give timely reminders to the players to ensure the benign operation of the game.
  • the speech recognition method provided in the embodiment of the present application can be used to implement it, that is, the speech between the players is used as the speech to be recognized, and the speech recognition method provided in the embodiment of the present application is used to identify swear words or uncivilized language in the speech to be recognized, and determine whether there are swear words or uncivilized language in the speech between the players.
  • the voice signal to be recognized may include not only the conversation voice of the player, but also the game running voice in the game running scenario.
  • the game running voice includes but is not limited to: the voice when the skill is released, the special effect voice, the voice emitted by the virtual hero, the voice generated when any props are used, etc.
  • the game running voice in the game running environment of the player can be obtained through the game engine, and the conversation voice of the player can be collected through the voice collection device on the terminal, and then the game running voice and the conversation voice are superimposed to form the voice to be recognized.
  • Sliding window interception refers to traversing the speech signal to be recognized through a sliding window with a preset step length, and each time intercepting a sub-speech signal with the same step length as the sliding window.
  • the sub-speech signal can be subjected to speech recognition using the subsequent steps of the embodiment of the present application to obtain a sub-speech recognition result.
  • another sub-speech signal is intercepted through the sliding window, and speech recognition is continued for the sub-speech signal, and this cycle is repeated until the speech recognition process for each sub-speech signal in the speech signal to be recognized is completed.
  • the speech signal to be recognized may be subjected to multiple sliding window interception processes, and multiple sub-speech signals may be obtained accordingly, and an identification mark may be added to each sub-speech signal according to the order of the sub-speech signals in the speech signal to be recognized.
  • the identification mark is used to distinguish the sub-speech signal from other sub-speech signals, and the identification mark may also identify the relative order of the sub-speech signal and other sub-speech signals in the speech signal to be recognized.
  • each sub-speech signal After obtaining multiple sub-speech signals, based on the identification mark of each sub-speech signal, speech recognition is performed on each sub-speech signal in turn according to the relative order of the sub-speech signal in the speech signal to be recognized, and multiple sub-speech recognition results may be obtained accordingly.
  • the two sub-voice signals obtained in the two adjacent interception processes are two adjacent signal segments in the voice signal to be recognized. That is to say, when the sliding window is used to intercept the sub-voice signal, the signals are intercepted sequentially from the starting position of the voice signal to be recognized, and any segment of the voice signal to be recognized will not be lost during the interception process.
  • Step S502 extract speech features from each sub-speech signal using a pre-trained embedding feature representation system to obtain sub-speech embedding representation features of the corresponding sub-speech signal.
  • the embedded feature representation system includes a first-level feature extraction network and a second-level feature extraction network; the first-level feature extraction network is used to perform first-level speech feature extraction on the sub-speech signal; the second-level feature extraction network is used to perform second-level speech feature extraction on the sub-speech signal based on the first-level speech features obtained during the first-level speech feature extraction, and the feature extraction accuracy of the second-level speech feature extraction is greater than the feature extraction accuracy of the first-level speech feature extraction.
  • each sub-speech signal can be input into an embedded feature representation system, and the first-level feature extraction network and the second-level feature extraction network in the embedded feature representation system can be used to perform first-level speech feature extraction and second-level speech feature extraction on the sub-speech signal in sequence. That is to say, coarse-precision speech feature extraction and fine-precision speech feature extraction are performed on the sub-speech signal in sequence to obtain sub-speech embedded representation features.
  • the sub-speech embedding representation feature refers to a feature representation with a fixed size (usually in vector form) obtained after data conversion of the sub-speech signal.
  • the sub-speech embedding representation feature can facilitate subsequent processing and calculation.
  • the sub-speech embedding representation feature can be obtained by feature embedding, which is to convert the input data (for example, it can be a dimensionality reduction process) into a fixed-size feature representation (vector form) for easy processing and calculation (for example, for calculating distance, etc.).
  • the type allows a speech segment to be converted into a digital vector so that another speech segment from the same speaker has a small distance (e.g., Euclidean distance) from the converted digital vector. For example, the distance between another speech segment from the same speaker and the converted digital vector is less than a preset distance threshold.
  • the main purpose of feature embedding is to reduce the dimensionality of the input features.
  • the dimensionality reduction method can be to use a fully connected layer for full connection processing and then use the embedding layer to calculate the weight matrix to achieve the process of reducing the dimensionality.
  • the first-level feature extraction network can be a kind of unsupervised pre-training model, and the first-level feature extraction network can carry out self-supervision pre-training based on large-scale unlabeled speech in advance, and obtain the first-level feature extraction network after training.
  • the second-level feature extraction network can be based on the first-level feature extraction network after training to carry out feature extraction, and then carry out the model obtained after model training.
  • the speech feature extraction of the above-mentioned coarse precision (i.e., feature extraction precision when the first-level speech feature is extracted) can be carried out to the single-word speech in the single-word speech data set by the first-level feature extraction network after training, and the embedding representation feature of the single-word speech is obtained, and then the embedding representation feature of the single-word speech is used as the input feature of the second-level feature extraction network, input into the second-level feature extraction network, and the speech feature extraction of the fine precision (i.e., feature extraction precision when the second-level speech feature is extracted) is carried out to the single-word speech by the second-level feature extraction network.
  • the training process about the first-level feature extraction network, the second-level feature extraction network and the embedding feature representation system will be described in detail below.
  • the sub-speech signal when extracting speech features from a sub-speech signal, since the sub-speech signal can be directly input into the embedded feature representation system for feature extraction, what is extracted is the embedded representation features of the sub-speech signal, without extracting the Mel features of the sub-speech signal. In this way, the amount of calculation of the model can be greatly reduced, and the extracted embedded representation features can more accurately express the speech information in the sub-speech signal, so that accurate speech feature extraction can be performed on the sub-speech signal.
  • each of the at least two sub-speech signals can be input into a pre-trained embedding feature representation system in turn, and the pre-trained embedding feature representation system can be used to extract speech features from each sub-speech signal to obtain multiple sub-speech embedding representation features.
  • the feature extraction accuracy is used to reflect the accuracy of the corresponding sub-speech signal that the extracted embedded representation feature can reflect during the speech feature extraction process.
  • the extracted embedded representation feature can reflect less information of the corresponding sub-speech signal (for example, the extracted embedded representation feature can reflect that the corresponding sub-speech signal is less than the information threshold), so that the accuracy of the information that the extracted embedded representation feature can reflect the corresponding sub-speech signal is lower than the accuracy threshold;
  • the extracted embedded representation feature can reflect more information of the corresponding sub-speech signal (for example, the extracted embedded representation feature can reflect that the corresponding sub-speech signal is greater than or equal to the information threshold), so that the accuracy of the information that the extracted embedded representation feature can reflect the corresponding sub-speech signal is higher than the accuracy threshold.
  • Step S503 Obtain the embedding representation feature of each comparison word in the preset comparison word library.
  • the preset comparison word library includes a plurality of comparison words, and the comparison words in the preset comparison word library have specific attribute information, that is, the comparison words in the preset comparison word library belong to a specific type of words.
  • the comparison words in the preset comparison word library are swear words collected and stored in advance, that is, the preset comparison word library can be a swear word library
  • the comparison words in the preset comparison word library are praise words collected and stored in advance, that is, the preset comparison word library can be a praise word library
  • the comparison words in the preset comparison word library can be words related to game commands collected and stored in advance, that is, the preset comparison word library can be a game command word library.
  • the comparison word speech or comparison word speech signal of each comparison word may be stored in a preset comparison word library, and speech signal recognition may be performed on the comparison word speech to obtain a comparison word speech signal corresponding to the comparison word speech, and then speech feature extraction may be performed on the comparison word speech signal to obtain an embedded representation feature of the comparison word.
  • the above-mentioned pre-trained embedding feature representation system can be used to compare the words in the preset comparison vocabulary.
  • the speech feature extraction is performed on the speech signal of each comparison word to obtain the embedding representation feature of each comparison word, that is, the embedding representation feature of the speech signal of each comparison word.
  • Step S504 performing speech recognition on each sub-speech signal according to the sub-speech embedding representation feature and the embedding representation feature of each comparison word to obtain a sub-speech recognition result.
  • the sub-speech embedding representation feature can be compared with the embedding representation feature of the comparison word to obtain the sub-speech recognition result.
  • the cosine similarity between the sub-speech embedding representation feature and the embedding representation feature of the comparison word can be calculated, and the sub-speech recognition result is determined based on the cosine similarity.
  • the cosine similarity between the sub-speech embedding representation feature of each sub-speech signal and the embedding representation feature of each comparison word can be calculated.
  • the comparison words can also be sorted based on the cosine similarity to form a comparison word sequence; then, the first N comparison words in the comparison word sequence are extracted, N is an integer greater than 1; finally, the sub-speech embedding representation feature of the sub-speech signal is compared with the cosine similarity between the embedding representation features of the first N comparison words, if the N cosine similarities are all greater than the similarity threshold, it indicates that the sub-speech corresponding to the sub-speech signal contains a speech word with the same attribute as the comparison word in the preset comparison word library.
  • N is much smaller than the total number of all comparison words in the preset comparison word library, therefore, when comparing with the similarity threshold, it is only necessary to compare whether the N cosine similarities are greater than the similarity threshold, which will obviously greatly reduce the amount of data calculation for data comparison and improve the efficiency of speech recognition.
  • N is greater than 1, when there are multiple comparison words whose cosine similarities are greater than the similarity threshold, it is determined that the sub-speech signal contains speech words with the same attributes as the comparison words in the preset comparison word library.
  • recognition and verification based on the results of the cosine similarity of multiple comparison words can ensure the accuracy of speech recognition and avoid the impact on the accuracy of the speech recognition results of the embodiment of the present application when there is an error in the calculation of the cosine similarity with individual comparison words.
  • a preset similarity threshold can be obtained for each sub-speech signal; then, all comparison words whose cosine similarity is greater than the similarity threshold are screened out, and the number of all comparison words is obtained.
  • the number of all comparison words is greater than the number threshold, it indicates that the sub-speech corresponding to the sub-speech signal contains speech words with the same attributes as the comparison words in the preset comparison word library.
  • the similarity threshold and the number threshold through the two-fold judgment of the similarity threshold and the number threshold, it is possible to judge the situation with more similar comparison words while ensuring the high cosine similarity, that is, in the preset comparison word library, there are a large number of comparison words with high cosine similarity between the sub-speech embedding representation feature of the sub-speech signal.
  • the two thresholds based on the dual judgment of the two thresholds, it is possible to accurately judge whether the sub-speech corresponding to the sub-speech signal contains speech words with the same attributes as the comparison words in the preset comparison word library, thereby improving the accuracy of speech recognition.
  • the cosine similarity between the sub-speech embedding representation feature of the sub-speech signal and the embedding representation feature of each comparison word can be calculated in turn, and after each cosine similarity is calculated, the cosine similarity is judged to determine whether the cosine similarity is greater than a similarity threshold; as long as it is determined that the cosine similarity between the sub-speech embedding representation feature of the sub-speech signal and the embedding representation feature of any comparison word is greater than the similarity threshold, the cosine similarity between the sub-speech embedding representation feature of the sub-speech signal and the embedding representation features of the remaining comparison words is calculated, and it is determined that the sub-speech corresponding to the sub-speech signal contains speech words with the same attributes as the comparison words in the preset comparison vocabulary.
  • the sub-speech corresponding to the sub-speech signal contains speech words with the same attributes as the comparison words in the preset comparison vocabulary. That is, as long as the cosine similarity between the embedding representation feature of a comparison word and the embedding representation feature of the sub-speech is detected to be greater than the similarity threshold, it can be considered that the sub-speech corresponding to the sub-speech signal contains a speech word with the same attribute as the comparison word in the preset comparison word library.
  • the fourth implementation for each sub-speech signal, first initialize the counter to 0; then, calculate the cosine similarity between the sub-speech embedding representation feature of the sub-speech signal and the embedding representation feature of each comparison word in turn, and after each cosine similarity is calculated, judge the cosine similarity to determine whether the cosine similarity is greater than the similarity threshold; as long as the cosine similarity between the sub-speech embedding representation feature of the sub-speech signal and the embedding representation feature of any comparison word is greater than the similarity threshold, add one to the counter.
  • This cycle is repeated until the count value of the counter is greater than or equal to the numerical threshold, stop calculating the cosine similarity between the sub-speech embedding representation feature of the sub-speech signal and the embedding representation features of the remaining comparison words, and determine that the sub-speech corresponding to the sub-speech signal contains a speech word with the same attributes as the comparison word in the preset comparison word library.
  • the numerical threshold is an integer greater than 1.
  • the judgment result is counted by using a counter, that is, after each cosine similarity is calculated and the cosine similarity is judged against the similarity threshold, the counter is counted and updated based on the judgment result (that is, when the condition that the cosine similarity is greater than the similarity threshold is satisfied, the counter is incremented by one; when the condition that the cosine similarity is greater than the similarity threshold is not satisfied, the counter value remains unchanged).
  • the double judgment by the similarity threshold and the numerical threshold is realized, and the situation with more similar comparison words can be judged while ensuring the high cosine similarity, so that the situation with a large number of comparison words with high cosine similarity between the sub-speech embedded representation features of the sub-speech signal in the preset comparison vocabulary can be accurately identified; on the other hand, since a judgment and a counter are performed once each cosine similarity is calculated, once the count value of the counter is greater than or equal to the numerical threshold, the calculation of the cosine similarity is stopped, that is, there is no need to calculate the cosine similarity between the sub-speech embedded representation features and the embedded representation features of each comparison word in the preset comparison vocabulary, thereby greatly reducing the amount of data calculation for calculating the cosine similarity and improving the efficiency of speech recognition.
  • Step S505 Determine a speech recognition result corresponding to the speech signal to be recognized according to the sub-speech recognition results of at least two sub-speech signals.
  • the sub-voice recognition results of at least two sub-voice signals are processed comprehensively to obtain the speech recognition result corresponding to the speech signal to be recognized.
  • the sub-speech recognition result of the sub-speech signal is determined to be a specific recognition result, that is, it is determined that the sub-speech corresponding to the sub-speech signal contains a speech word with the same attribute as the comparison word in the preset comparison word library.
  • the sub-speech recognition result of the sub-speech signal is determined to be a specific recognition result, that is, it is determined that the sub-speech corresponding to the sub-speech signal contains a speech word with the same attribute as the comparison word in the preset comparison word library.
  • the speech recognition method uses a pre-trained embedded feature representation system to extract speech features from each sub-speech signal obtained after the sliding window is intercepted to obtain a sub-speech embedded representation feature; and based on the sub-speech embedded representation feature and the embedded representation feature of each comparison word in the preset comparison word library, speech recognition is performed on each sub-speech signal to obtain a sub-speech recognition result; thereby, based on the sub-speech recognition results of at least two sub-speech signals, the speech recognition result corresponding to the speech signal to be recognized is determined.
  • the sub-speech embedded representation feature of the sub-speech signal can be accurately extracted, and then based on the sub-speech embedded representation feature, the speech signal to be recognized can be accurately recognized.
  • the speech recognition system includes at least a terminal and a server, wherein the speech recognition method can be used to perform speech recognition on game voice data generated during the operation of a game application to determine whether specific types of terms (such as dirty words and uncivilized terms) exist in the game voice data; or, it can also be used to perform speech recognition on e-sports voice generated in an e-sports scenario to determine whether dirty words or uncivilized terms are contained in the e-sports voice; or, it can also be used to perform speech recognition on short video voice in a short video scenario to determine whether dirty words or uncivilized terms are contained in the short video voice; of course, it can also be applied to other similar scenarios where speech exists and speech recognition is required.
  • game voice data generated during the operation of a game application to determine whether specific types of terms (such as dirty words and uncivilized terms) exist in the game voice data
  • e-sports voice generated in an e-sports scenario to determine whether dirty words or uncivilized terms are contained in the e-sports voice
  • a game application can be used on the terminal.
  • game voice data is collected, and a voice signal corresponding to the game voice data is obtained to obtain a voice signal to be recognized, so as to perform voice recognition on the voice signal to be recognized using the method of an embodiment of the present application.
  • FIG. 6 is another optional flow chart of the speech recognition method provided in an embodiment of the present application. As shown in FIG. 6 , the method includes the following steps S601 to S613:
  • Step S601 when the terminal is running the game application, the terminal obtains the game running voice of the game application and collects the user voice of the player.
  • the game running voice of the game application can be obtained, and the game running voice includes but is not limited to: voice when skills are released, special effect voice, voice emitted by virtual heroes, voice generated when any props are used, etc.
  • the game running voice can be directly obtained through the game engine.
  • the terminal can also collect the conversation voice of the player through the voice collection device on the terminal, that is, collect the user voice.
  • the user voice refers to the voice of the players talking and communicating during the game running process.
  • the user voice can include only the current player's own voice, and can also include the voices of all players in the current game scene.
  • Step S602 The terminal superimposes the game running voice and the user voice to form game voice data.
  • superimposing the game running voice and the user voice can be in the time dimension, fusing the game running voice and the user voice into a fused game voice data on the time axis, and the game voice data includes not only the game running voice but also the user voice.
  • Step S603 The terminal encapsulates the voice signal corresponding to the game voice data as a voice signal to be recognized into a voice recognition request.
  • Step S604 The terminal sends a speech recognition request to the server.
  • Step S605 The server analyzes the speech recognition request to obtain a speech signal to be recognized.
  • Step S606 The server uses a sliding window with a preset step length to perform frame processing on the speech signal to be recognized to obtain at least two sub-speech signals, wherein at least two sub-speech signals have the same frame length.
  • a sliding window with a preset step size can be used to traverse the speech signal to be recognized, and each time a sub-speech signal with the same step size as the sliding window is intercepted.
  • the original speech signal to be recognized is divided into multiple sub-speech signals of fixed size, where each sub-speech signal can be called a frame, and the frame length is generally 10ms to 30ms. All sub-speech signals are connected to form the original speech signal to be recognized.
  • multiple sub-speech signals are obtained accordingly, and an identification mark can be added to each sub-speech signal according to the order of the sub-speech signals in the speech signal to be recognized.
  • the identification mark is used to distinguish the sub-speech signal from other sub-speech signals, and the identification mark can also identify the relative order of the sub-speech signal and other sub-speech signals in the speech signal to be recognized.
  • a preset window function can be obtained; and each sub-speech signal is smoothed using the preset window function, corresponding to obtaining at least two smoothed sub-speech signals.
  • smoothing can also be called windowing.
  • the windowing is performed to reduce the spectral leakage by using the preset window function in order to achieve a smooth transition between frames and maintain the continuity between adjacent frames, that is, to eliminate the signal discontinuity that may be caused at both ends of each frame, i.e., spectral leakage. Setting the window function can reduce the impact of truncation.
  • the preset window function can include a rectangular window and a Hamming window.
  • the speech feature extraction may be performed on each smoothed sub-speech signal.
  • the subsequent speech recognition step is performed based on the smoothed sub-speech signal.
  • step S607 the server inputs each sub-speech signal into the first-level feature extraction network, and performs first-level embedding feature extraction on the sub-speech signal through the first-level feature extraction network to obtain an embedded representation feature with a first feature extraction accuracy.
  • step S608 the server inputs the embedded representation features with the first feature extraction accuracy into the second-level feature extraction network, and performs second-level embedded feature extraction on the sub-speech signal through the second-level feature extraction network to obtain embedded representation features with the second feature extraction accuracy; the first feature extraction accuracy is less than the second feature extraction accuracy.
  • the embedded feature representation system includes a first-level feature extraction network and a second-level feature extraction network; the first-level feature extraction network is used to extract first-level speech features from the sub-speech signal; the second-level feature extraction network is used to extract second-level speech features from the sub-speech signal based on the first-level speech features obtained during the first-level speech feature extraction, and the feature extraction accuracy of the second-level speech feature extraction is greater than the feature extraction accuracy of the first-level speech feature extraction.
  • the feature extraction accuracy is used to reflect the accuracy of the corresponding sub-speech signal that the extracted embedded representation features can reflect during the speech feature extraction process.
  • the first-level feature extraction network is an unsupervised pre-training model.
  • the first-level feature extraction network will be pre-trained based on large-scale unlabeled speech to obtain the trained first-level feature extraction network.
  • the second-level feature extraction network is obtained after feature extraction based on the trained first-level feature extraction network and then model training.
  • the embedded representation features with the second feature extraction accuracy constitute the sub-speech embedded representation features of the corresponding sub-speech signal.
  • Step S609 the server obtains the embedded representation features of each comparison word in the preset comparison word library.
  • the preset comparison word library includes a plurality of comparison words, and the comparison words in the preset comparison word library have specific attribute information, that is, the comparison words in the preset comparison word library belong to a specific type of words.
  • the preset comparison word library includes a comparison word voice signal of each comparison word.
  • the comparison word voice signal of each comparison word can be extracted by a pre-trained embedded feature representation system to obtain an embedded representation feature of each comparison word.
  • Step S610 The server performs speech recognition on each sub-speech signal according to the sub-speech embedding representation feature and the embedding representation feature of each comparison word to obtain a sub-speech recognition result.
  • the speech recognition of each sub-speech signal may be implemented in the following manner:
  • the similarity between the sub-speech embedding representation feature and the embedding representation feature of each comparison word for example, it can be cosine similarity
  • the specific recognition result is used to characterize: the sub-speech corresponding to the sub-speech signal contains a speech word with the same attribute as the comparison word in the preset comparison word library.
  • the specific recognition result is used to characterize that the sub-speech corresponding to the sub-speech signal contains a specific speech word, and the specific speech word is a speech word with the same attribute as the comparison word in the preset comparison word library.
  • the comparison words in the preset comparison word library are swear words collected and stored in advance
  • the sub-speech recognition result of the sub-speech signal is a specific recognition result
  • the comparison words in the preset comparison word library are praise words collected and stored in advance
  • the sub-speech recognition result of the sub-speech signal is a specific recognition result
  • the preset comparison words are praise words collected and stored in advance
  • the sub-speech recognition result of the sub-speech signal is a specific recognition result
  • the comparison words in the library may be words related to game instructions collected and stored in advance
  • the sub-voice recognition result of the sub-voice signal is a specific recognition result
  • Step S611 The server determines a speech recognition result corresponding to the speech signal to be recognized based on the sub-speech recognition results of at least two sub-speech signals.
  • the voice recognition result corresponding to the voice signal to be recognized is determined to be a specific recognition result.
  • the voice recognition result corresponding to the voice signal to be recognized is determined to be a specific recognition result, and the preset number is an integer greater than 1.
  • Step S612 The server sends the speech recognition result to the terminal.
  • Step S613 the terminal generates reminder information based on the voice recognition result and displays the reminder information.
  • the speech recognition result is that the speech to be recognized contains a speech word having the same attribute as a comparison word in a preset comparison word library
  • reminder information corresponding to the speech recognition result is generated and displayed to remind the player.
  • the reminder information can be displayed in the form of a pop-up window, or in the current game interface.
  • the reminder information can be presented in the form of text, special effects pictures, special effects videos or specific reminder videos.
  • the reminder information can also be output in the form of voice.
  • a reminder message such as "Please pay attention to civilized language" is sent in the form of a pop-up window, or a special effect picture can be popped up in the current game interface to remind the user to pay attention to civilized language, or a pre-made swear word reminder video can be played in the current game interface to remind players to pay attention to civilized language, or voice reminders can be given to players.
  • a penalty mechanism can be added during the process of generating and displaying the reminder information to further remind the player to use civilized language.
  • the penalty mechanism here includes but is not limited to: during the time period when the reminder information is displayed, the player cannot operate any object in the current game scene, that is, during the time period when the reminder information is displayed, the player is in an inoperable state; after the reminder information is displayed, the player can re-enter the current game scene.
  • the number and intensity of swear words contained in the game voice currently sent by the player can also be determined. If the number is greater than a quantity threshold, or the intensity of the swear words is greater than an intensity threshold, a preset penalty mechanism can be used to punish the player's game progress. For example, the penalty mechanism can be to prohibit the player from sending voice, prohibit the player from continuing the game, prohibit the player from running the game application again within a certain period of time, etc.
  • the total number of swear words contained in the player's entire game voice process in the current game game can also be determined, as well as the number of times the player's entire game voice process in the current game game is detected to contain swear words. If the total number is greater than a total number threshold, or the number is greater than a number threshold, a preset penalty mechanism can also be used to punish the player's game progress.
  • the display time of the reminder message can be set, and the display time of the reminder message can be preset as an initial time.
  • the initial time is adjusted to increase the display time of the reminder message.
  • the following describes the embedded feature representation system and the training method of the embedded feature representation system.
  • the embedded feature representation system includes a first-level feature extraction network and a second-level feature extraction network; the first-level feature extraction network is used to perform first-level speech feature extraction on the sub-speech signal; the second-level feature extraction network is used to perform second-level speech feature extraction on the sub-speech signal based on the first-level speech features obtained during the first-level speech feature extraction, and the feature extraction accuracy of the second-level speech feature extraction is greater than the feature extraction accuracy of the first-level speech feature extraction.
  • FIG7 is a flow chart of a training method for an embedded feature representation system provided in an embodiment of the present application.
  • the training method for the embedded feature representation system can be implemented by a model training module, wherein the model training module can be a module in a speech recognition device (i.e., an electronic device), i.e., the model training module can be a server or a terminal; or It can be another device independent of the speech recognition device, that is, the model training module is another electronic device different from the server and terminal for implementing the speech recognition method.
  • the embedded feature representation system can be trained by iterating the following steps S701 to S706 in a loop until the embedded feature representation system meets the preset convergence condition and reaches convergence:
  • Step S701 input the first speech data in the unlabeled speech data set into the first-level feature extraction network, train the first-level feature extraction network by contrast learning, and obtain the trained first-level feature extraction network.
  • the unlabeled speech data set includes a plurality of unlabeled unlabeled speech data. Since the first-level feature extraction network can be trained by unsupervised learning, the first speech data in the unlabeled speech data set can be used to train the first-level feature extraction network.
  • contrastive learning is a self-supervised learning method that is used to learn general features of unlabeled speech datasets without labels by letting the first-level feature extraction network learn which data points are similar or different. Contrastive learning allows the first-level feature extraction network to observe which pairs of data points are "similar” and “different” in order to understand higher-order features of the data before performing tasks such as classification or segmentation. In most practical scenarios, since there are no labels set for the two speech signals, in order to create labels, professionals must spend a lot of time manually listening to the speech to manually classify, segment, etc. Through contrastive learning, even if only a small part of the dataset is labeled, the model performance can be significantly improved.
  • the first-level feature extraction network can be implemented as a wav2vec model.
  • a trained wav2vec model is obtained, and the trained wav2vec model is used to distinguish between real data and interference samples, which can help the wav2vec model learn the mathematical representation of audio data.
  • the wav2vec model can distinguish accurate speech sounds from interference species through clipping and comparison.
  • Step S702 input the second speech data in the single-word speech data set into the trained first-level feature extraction network, perform first-level embedding feature extraction on the second speech data through the trained first-level feature extraction network, and obtain sample embedding representation features with third feature extraction accuracy.
  • the third feature extraction accuracy is the feature extraction accuracy corresponding to the trained first-level feature extraction network, that is, the third feature extraction accuracy is the feature extraction accuracy of the sample embedding representation feature extracted by the trained first-level feature extraction network when performing embedding feature extraction on the second speech data.
  • the third feature extraction accuracy corresponds to the above-mentioned first feature extraction accuracy, that is, if the trained first-level feature extraction network is used to perform the first-level embedding feature extraction on the above-mentioned sub-speech signal, the embedding representation feature of the first feature extraction accuracy can be obtained; if the trained first-level feature extraction network is used to perform the first-level embedding feature extraction on the second speech data, the embedding representation feature of the third feature extraction accuracy can be obtained (that is, the sample embedding representation feature with the third feature extraction accuracy).
  • the single-word speech data set includes multiple single-word speech (i.e., second speech data), and each single-word speech is composed of the speech of a single word.
  • a forced alignment method MFA, Montreal Forced Aligner
  • MFA Montreal Forced Aligner
  • the original speech signal corresponding to the original speech can be extracted, and the original speech can be feature extracted by any feature extraction network to obtain multiple speech features corresponding to the original speech, wherein each speech feature is a feature vector corresponding to the speech of a word; then, the original speech signal is matched with each speech feature one by one (i.e., according to each speech feature, the starting position and the ending position of the speech of the single word corresponding to the speech feature in the original speech signal are determined), and the alignment between the original speech signal and the speech feature is realized; after the alignment is completed, the original speech signal is segmented according to the alignment position (i.e., the starting position and the ending position) between the original speech signal and the speech feature to form multiple original speech sub-signals, wherein each original speech sub-signal corresponds to a single-word speech.
  • the implementation process of the MFA technology is to first determine what sentence the user actually reads, and then use the judgment result to perform forced alignment.
  • each single-word speech in the single-word speech data set can be input into the trained first-level feature extraction network, and the first-level embedding feature extraction network is used to extract each single-word speech.
  • the feature extraction is performed to obtain multiple sample embedding representation features, and the second-level feature extraction network is trained through the multiple sample embedding representation features. That is, the multiple sample embedding representation features are used as training samples of the second-level feature extraction network for model training.
  • Step S703 input the sample embedding representation features with the third feature extraction accuracy into the second-level feature extraction network, perform the second-level embedding feature extraction on the second speech data through the second-level feature extraction network, and obtain the sample embedding representation features with the fourth feature extraction accuracy; the third feature extraction accuracy is less than the fourth feature extraction accuracy.
  • the fourth feature extraction accuracy is the feature extraction accuracy corresponding to the second-level feature extraction network, that is, the fourth feature extraction accuracy is the feature extraction accuracy of the sample embedding representation feature extracted when the second-level feature extraction network performs the second-level embedding feature extraction on the second speech data.
  • the fourth feature extraction accuracy corresponds to the above-mentioned second feature extraction accuracy, that is, if the second-level feature extraction network is used to perform the second-level embedding feature extraction on the above-mentioned sub-speech signal, the embedding representation feature of the second feature extraction accuracy can be obtained; if the second-level feature extraction network is used to perform the second-level embedding feature extraction on the second speech data, the embedding representation feature of the fourth feature extraction accuracy can be obtained (that is, the sample embedding representation feature with the fourth feature extraction accuracy).
  • the third feature extraction accuracy is less than the fourth feature extraction accuracy.
  • Step S704 Perform speech recognition on the second speech data based on the sample embedding representation features with the fourth feature extraction accuracy through a preset classification network to obtain a sample recognition result.
  • the second-level feature extraction network performs second-level embedding feature extraction on each sample embedding representation feature to obtain sample embedding representation features with a fourth feature extraction accuracy.
  • speech recognition is performed on the second speech data based on the extracted sample embedding representation features with a fourth feature extraction accuracy, that is, speech classification processing is performed on the second speech data to obtain a sample recognition result.
  • the second voice data contains dirty words is used as an example for explanation.
  • the second voice data can be classified and recognized based on a preset dirty word library, and based on the extracted sample embedding representation features with a fourth feature extraction accuracy, it is determined whether there are dirty words in the second voice data, thereby obtaining a sample recognition result of whether there are dirty words.
  • Step S705 input the sample recognition result and the classification label information of the second voice data into a preset loss model, and output the loss result through the preset loss model.
  • classification label information may be added to each second voice data, and the classification label information is used to identify whether there is a swear word in the single-word voice.
  • sample embedding representation features of the second speech data with a fourth feature extraction accuracy are extracted through the first-level feature extraction network and the second-level feature extraction network, and whether the second speech data contains swear words is identified based on the sample embedding representation features with the fourth feature extraction accuracy.
  • the sample recognition result and the classification label information of the second speech data can be input into a preset loss model, and the loss result is output through the preset loss model.
  • the label similarity between the sample recognition result and the classification label information can be calculated through a preset loss model.
  • the second-level feature extraction network can accurately extract the sample embedding representation features of the second voice data
  • the preset classification network can accurately perform speech recognition on the second voice data based on the sample embedding representation features.
  • the training of the embedding feature representation system can be stopped, and the embedded feature representation system obtained at this time is determined as the trained embedded feature representation system.
  • the label similarity is less than or equal to the label similarity threshold, it indicates that the second-level feature extraction network cannot accurately extract the sample embedding representation features of the second voice data, or that the preset classification network cannot accurately perform speech recognition on the second voice data based on the sample embedding representation features.
  • the training is continued until the label similarity is greater than the label similarity threshold.
  • Step S706 based on the loss result, the model parameters in the second-level feature extraction network are modified to obtain a trained embedded feature representation system.
  • the model parameters in the second-level feature extraction network can be corrected based on the correction parameters; when the label similarity is greater than the label similarity threshold, the training process of the embedded feature representation system is stopped.
  • the correction interval of the model parameters can be set in advance, wherein the model parameters in the second-level feature extraction network include multiple model sub-parameters, and each model sub-parameter corresponds to a correction area.
  • the correction interval of the model parameter refers to the value interval of the correction parameter that can be selected for change during the current round of training.
  • the selection can be made based on the value of label similarity. If the label similarity is small, a larger correction parameter can be selected in the correction interval as the correction parameter during the current round of training; if the label similarity is large, a smaller correction parameter can be selected in the correction interval as the correction parameter during the current round of training.
  • a correction similarity threshold can be set.
  • the label similarity is less than or equal to the correction similarity threshold, it indicates that the label similarity is small, and a correction parameter can be randomly selected in the first sub-interval formed by the median value of the correction interval and the maximum value of the interval as the correction parameter in this round of training;
  • a correction parameter can be randomly selected in the second sub-interval formed by the minimum value of the correction interval and the median value of the interval as the correction parameter in this round of training, wherein the correction similarity threshold is less than the above label similarity threshold.
  • the median value of the interval is The first subinterval is The second sub-interval is If the label similarity is less than or equal to the modified similarity threshold, then A value is randomly selected as the correction parameter; if the label similarity is greater than the correction similarity threshold, then A value is randomly selected as the correction parameter.
  • the corresponding model parameter can be adjusted based on the correction parameter. For example, when the correction parameter is a positive number, the model parameter can be increased; when the correction parameter is a negative number, the model parameter can be decreased.
  • the training method of the embedded feature representation system performs unsupervised training on the first-level feature extraction network through the first speech data in the unlabeled speech data set; extracts the embedded label features of the second speech data in the single-word speech data set through the trained first-level feature extraction network to obtain sample embedded representation features with a third feature extraction accuracy, and then uses these sample embedded representation features with the third feature extraction accuracy as sample data of the second-level feature extraction network to train the second-level feature extraction network.
  • the model parameters in the second-level feature extraction network are learned in combination with the classification label information of the second speech data, so that accurate learning and training of the second-level feature extraction network can be achieved, and an embedded feature representation system that can correct the model parameters in the accurate extraction is obtained.
  • the first-level feature extraction network includes an encoder network and a context network.
  • FIG8 is a flowchart of a training method for the first-level feature extraction network provided in an embodiment of the present application.
  • the training method for the first-level feature extraction network can also be implemented by a model training module, wherein the model training module used to train the first-level feature extraction network can be the same model training module in the same electronic device as the model training module used to train the embedded feature representation system, or different model training modules in the same electronic device, or model training modules in different electronic devices. That is, the model training module used to train the first-level feature extraction network can also be a server or a terminal; or, It is another device independent of the speech recognition device.
  • the first-level feature extraction network can be trained by iterating the following steps S801 to S805 in a loop until the first-level feature extraction network meets the preset convergence condition and reaches convergence:
  • Step S801 input the first speech data in the unlabeled speech data set into the first-level feature extraction network.
  • Step S802 Perform a first convolution process on the first speech data through an encoder network to obtain a low-frequency representation feature.
  • the first-level feature extraction network can be implemented as a wav2vec model.
  • the wav2vec model can extract unsupervised speech features of audio through a multi-layer convolutional neural network.
  • wav2vec is a convolutional neural network that takes raw audio as input and calculates a general representation that can be input into a speech recognition system.
  • the wav2vec model is divided into an encoder network (including 5 convolutional processing layers) that encodes the raw audio x into a latent space z, and a context network (including 9 convolutional processing layers) that converts z into a contextualized representation.
  • the final feature dimension is 512-dimensional frames. The goal is to use the current frame to predict future frames at the feature level.
  • the encoder network includes multiple layers of convolution processing layers, and the first voice data is subjected to multiple convolution processing through the multiple layers of convolution processing layers, thereby encoding the first voice data and obtaining low-frequency representation features.
  • Step S803 Perform a second convolution process on the low-frequency representation features through the context network to obtain embedded representation features with a preset dimension.
  • the context network includes multiple layers of convolution processing layers, through which the low-frequency representation features output by the encoder network are subjected to multiple convolution processing, thereby converting the low-frequency representation features into contextualized representations, that is, obtaining embedded representation features with preset dimensions.
  • Step S804 input the embedded representation feature with preset dimensions into the first loss model, and determine the first loss result corresponding to the embedded representation feature with preset dimensions through the first loss function in the first loss model.
  • the loss function for model training can be contrastive loss.
  • the distance between positive samples is shortened and the distance between negative samples is increased during training.
  • Step S805 based on the first loss result, the network parameters in the encoder network and the context network are modified to obtain a trained first-level feature extraction network.
  • the training method of the first-level feature extraction network implements the encoding processing of the first speech data through the encoder network to obtain low-frequency representation features; the low-frequency representation features are converted into contextual representations through the context network, and have embedded representation features of preset dimensions. Then, the contrast loss function is used to calculate the contrast loss to achieve the goal of shortening the distance between positive samples and increasing the distance between negative samples. In this way, through the self-supervised learning process, the first-level feature extraction network can be trained quickly and accurately.
  • the second-level feature extraction network includes: a temporal information extraction layer, an attention mechanism layer and a loss calculation layer, wherein the loss calculation layer includes a second loss function.
  • FIG9 is a flow chart of a training method for a second-level feature extraction network provided in an embodiment of the present application.
  • the training method for the second-level feature extraction network can also be implemented by a model training module in a speech recognition device.
  • the training method for the second-level feature extraction network can also be implemented by a model training module, wherein the model training module used to train the second-level feature extraction network can be the same model training module in the same electronic device as the model training module used to train the first-level feature extraction network, or a different model training module in the same electronic device, or a model training module in a different electronic device.
  • the model training module used to train the second-level feature extraction network can also be a server or a terminal; or, it can also be another device independent of the speech recognition device.
  • the second-level feature extraction network can be trained by looping and iterating the following steps S901 to S906 until the second-level feature extraction network meets the preset convergence conditions and reaches convergence:
  • Step S901 embedding the sample with the third feature extraction accuracy into a representation feature and inputting it into the second-level feature extraction network.
  • Step S902 extract key timing information of sample embedding representation features in different channels through the timing information extraction layer.
  • the second-level feature extraction network can be implemented as an ecapa-tdnn model.
  • the temporal information extraction layer can be the squeeze-excitation module (SE) part in the ecapa-tdnn model.
  • SE squeeze-excitation module
  • the SE part considers the attention mechanism on the time axis.
  • the SE part enables the ecapa-tdnn model to learn the key temporal information in the input sample embedding representation feature.
  • Step S903 performing accumulation processing on the key timing information under different channels on the time axis through the attention mechanism layer to obtain the accumulation processing result; and performing weighted calculation on the accumulation processing result to obtain the sample embedding representation feature with the fourth feature extraction accuracy.
  • the attention mechanism layer can be the attentive-stat pool part of the ecapa-tdnn model.
  • the attentive-stat pool part can be based on the self-attention mechanism, so that the ecapa-tdnn model focuses on the time dimension and accumulates the information of different channels on the time axis.
  • the learned embedding representation features are more robust and discriminative.
  • Step S904 embedding the sample with the fourth feature extraction accuracy into the representative feature and the feature label information of the second speech data, and inputting them into the loss calculation layer.
  • the feature label information refers to whether the speech data is a word that the user is interested in, that is, whether it is a label corresponding to the word whose features need to be extracted. For example, for the input speech "I like reading very much", the words that the user is interested in may be “like” and “reading”. Therefore, "like” and “reading” can be identified in the feature label information to indicate that when performing embedded feature extraction of the input speech, the feature data corresponding to the two words "like” and “reading” must be extracted.
  • Step S905 Determine a second loss result corresponding to the sample embedding representation feature with a fourth feature extraction accuracy through a second loss function of the loss calculation layer.
  • a feature vector corresponding to the feature label information can be obtained, and the similarity between the sample embedding representation feature and the feature vector can be calculated to obtain a second loss result.
  • the second loss function may be an Aam-softmax loss function, through which the angle of similar features can be reduced during training, while the angle of different features can be increased, so that the embedded representation features learned by the second-level feature extraction network can be better.
  • the cosine similarity between the sample embedded representation features and the feature vector can be calculated by the Aam-softmax loss function, wherein the embedded representation features and the feature vector not only have features belonging to the same category (i.e., similar features), but also have features belonging to different categories (i.e., different features), the angle of similar features refers to the vector angle between two feature vectors corresponding to two similar features, and the angle of different features refers to the vector angle between two feature vectors corresponding to two different features.
  • the cosine similarity is calculated through the Aam-softmax loss function, and the second-level feature extraction network is trained based on the second loss result corresponding to the cosine similarity.
  • the trained second-level feature extraction network is used to extract sample embedding representation features, the vector angle between the feature vectors corresponding to the same type of features of the extracted sample embedding representation features and the feature vectors is less than the angle threshold, and the vector angle between the feature vectors corresponding to different types of features is greater than or equal to the angle threshold.
  • the similarity between features of the same type can be higher, and the similarity between features of different types can be lower.
  • Step S906 based on the second loss result, the network parameters in the temporal information extraction layer and the attention mechanism layer are modified to obtain a trained second-level feature extraction network.
  • the training method of the second-level feature extraction network extracts the key timing information of the sample embedding representation feature in different channels through the timing information extraction layer; the key timing information in different channels is sequentially accumulated and weighted on the time axis through the attention mechanism layer to obtain the sample embedding representation feature with the fourth feature extraction accuracy. Then, the loss calculation is performed through the second loss function to reduce the angle of the same type and increase the angle of different types during training. In this way, the second-level feature extraction network can be trained quickly and accurately through a supervised learning process.
  • the above-mentioned embedded feature representation system (including the preset classification network) and the embedded feature table
  • the training process of the first-level feature extraction network and the second-level feature extraction network in the embedding feature representation system can be carried out in parallel after the first-level feature extraction network is trained, or can be carried out sequentially. That is, the first-level feature extraction network can be trained first, and then the second-level feature extraction network and the entire embedding feature representation system can be trained in parallel. Alternatively, the first-level feature extraction network can be trained first, and then the second-level feature extraction network and the entire embedding feature representation system can be trained sequentially.
  • the speech recognition method provided in the embodiment of the present application first uses a contrastive learning method to train a self-supervised pre-training model on a large-scale unlabeled speech, and the model can fully learn the embedded representation features of the speech; then, the forced alignment method (MFA, Montreal Forced Aligner) based on the hidden Markov model is used to segment the Chinese single-word speech, and the embedded representation features are further learned through the Aam-softmax loss function.
  • MFA Montreal Forced Aligner
  • the entire speech recognition model i.e., the embedded feature representation system
  • the embodiment of the present application can greatly improve the generalization ability and anti-interference ability of the speech recognition model, and can effectively distinguish different words, so that the game voice keyword matching can be more accurate.
  • the speech recognition method of the embodiment of the present application is used for secondary verification of civilized speech, as shown in Figure 10, which is a schematic diagram of the speech keyword matching system provided by the embodiment of the present application.
  • the embodiment of the present application adopts an embedded feature representation system 1001 to extract the embedded representation feature x of the speech x1 in the form of a sliding window; secondly, traverse the embedded representation features of the dirty word library (i.e., the preset comparison word library), and obtain the cosine similarity 1002 between the embedded representation feature x of the reported speech x1 and the embedded representation feature y of the dirty word y1 in the dirty word library. If the cosine similarity is greater than the preset similarity threshold, it is determined that the reported speech x1 contains dirty words.
  • the dirty word library i.e., the preset comparison word library
  • the above-mentioned embedded feature representation system includes a first-level feature extraction network and a second-level feature extraction network.
  • the embodiment of the present application is illustrated by taking the first-level feature extraction network as a wav2vec model and the second-level feature extraction network as an ecapa-tdnn model as an example.
  • FIG 11 is a schematic flow chart of the training Wav2vec model provided by an embodiment of the present application.
  • first contrastive learning is used to train the Wav2vec model 1101 on large-scale unlabeled speech. This step is a self-supervisory process to obtain the trained Wav2vec model.
  • Figure 12 is a schematic flow chart of the training Ecapa-TDNN model provided by an embodiment of the present application.
  • the Wav2vec model is fixed based on the single-word speech data set, and the embedded expression features of the single-word speech are extracted using the Wav2vec model.
  • the embedded expression features are input into the Ecapa-TDNN model 1201, and the Ecapa-TDNN model 1201 is trained by the AAM-Softmax loss function.
  • FIG13 is a schematic diagram of the structure of the wav2vec model provided in an embodiment of the present application.
  • the wav2vec model includes an encoder network 1301 and a context network 1302.
  • the encoder network 1301 includes 5 layers of one-dimensional convolutions, the input is an audio waveform, and the output is a low-frequency representation feature;
  • the context network 1302 includes 9 layers of one-dimensional convolutions, the input is a plurality of low-frequency representation features, and the output is a 512-dimensional embedded representation feature.
  • the first loss function used in the wav2vec model training process is shown in the following formula (1):
  • L is the first loss function
  • k represents the time step
  • T represents the sequence duration
  • Z represents the encoder network output
  • C represents the context network output
  • h represents the radiation transformation
  • represents the number of negative samples
  • pn represents uniform distribution.
  • represents the encoder network output of the negative sample
  • the loss function means: make the distance between positive samples as small as possible, while increasing the distance between positive samples and negative samples. The final effect is that each embedded representation feature has good representation.
  • Figure 14 is a schematic diagram of the structure of the ecapa-tdnn model provided in an embodiment of the present application
  • Figure 15 is a schematic diagram of the structure of the SE-ResBlock part in the ecapa-tdnn model provided in an embodiment of the present application.
  • the SE part time series information extraction layer
  • the attention mechanism on the time axis is considered, and the SE part enables the ecapa-tdnn model to learn the key time series information in the input features.
  • the ecapa-tdnn model can be focused on the time dimension based on the self-attention mechanism, and the information of different channels can be accumulated on the time axis. Moreover, by introducing the form of weighted average and weighted variance, the learned embedded representation features are more robust and have discriminability.
  • the Aam-softmax loss function (corresponding to the second loss function mentioned above) can be used for loss calculation, as shown in the following formula (2):
  • L 3 is the second loss function; s and m are both set constants; the second loss function can reduce the angle between features of the same type and increase the angle ⁇ between features of different types (for example, ⁇ yi +m), so that the learned embedding representation features can be better.
  • the speech recognition method provided in the embodiment of the present application can be applied to the field of game speech as a secondary verification part of civilized speech.
  • the speech recognition device 354 includes:
  • the frame interception module 3541 is configured to perform sliding window interception on the speech signal to be recognized to obtain at least two sub-speech signals;
  • the feature extraction module 3542 is configured to extract speech features from each sub-speech signal through a pre-trained embedded feature representation system to obtain sub-speech embedded representation features of the corresponding sub-speech signal;
  • the embedded feature representation system includes a first-level feature extraction network and a second-level feature extraction network;
  • the first-level feature extraction network is used to perform first-level speech feature extraction on the sub-speech signal to obtain first-level speech features;
  • the second-level feature extraction network is used to extract first-level speech features from the sub-speech signal based on the first-level speech features.
  • the sub-speech signal is subjected to second-level speech feature extraction, and the feature extraction accuracy of the second-level speech feature extraction is greater than the feature extraction accuracy of the first-level speech feature extraction;
  • the acquisition module 3543 is configured to obtain the embedding representation features of each comparison word in the preset comparison word library;
  • the speech recognition module 3544 is configured to perform speech recognition on each of the sub-speech signals according to the sub-speech embedding representation features and the embedding representation features of each of the comparison words to obtain a sub-speech recognition result;
  • the determination module 3545 is configured to determine the speech recognition result corresponding to the speech signal to be recognized according to the sub-speech recognition result of each of the sub-speech signals.
  • the frame capture module is further configured to: use a sliding window with a preset step length to perform frame processing on the speech signal to be recognized to obtain at least two sub-speech signals, and the at least two sub-speech signals have the same frame length.
  • the device also includes: a window function acquisition module, configured to acquire a preset window function; a smoothing processing module, configured to use the preset window function to smooth each of the sub-speech signals, and obtain at least two smoothed sub-speech signals; the feature extraction module is also configured to: perform speech feature extraction on each smoothed sub-speech signal to obtain the sub-speech embedding representation features of the corresponding sub-speech signal.
  • the feature extraction module is further configured to: input each of the sub-speech signals into the first-level feature extraction network, and perform a first-level feature extraction on the sub-speech signal through the first-level feature extraction network.
  • Embedding feature extraction is performed to obtain an embedded representation feature with a first feature extraction accuracy; the embedded representation feature with the first feature extraction accuracy is input into the second-level feature extraction network, and the second-level embedded feature extraction is performed on the sub-speech signal through the second-level feature extraction network to obtain an embedded representation feature with a second feature extraction accuracy; the first feature extraction accuracy is less than the second feature extraction accuracy, and the embedded representation feature with the second feature extraction accuracy constitutes the sub-speech embedded representation feature of the sub-speech signal.
  • the speech recognition module is further configured to: determine the similarity between the sub-speech embedding representation feature and the embedding representation feature of each comparison word; when the similarity between the sub-speech embedding representation feature and the embedding representation feature of any comparison word is greater than a similarity threshold, determine that the sub-speech recognition result of the sub-speech signal is a specific recognition result; the specific recognition result is used to characterize: the sub-speech corresponding to the sub-speech signal contains a specific speech word, and the specific speech word is a speech word with the same attributes as the comparison word in the preset comparison vocabulary.
  • the determination module is further configured to: when the sub-voice recognition result of any sub-voice signal is the specific recognition result, determine that the speech recognition result corresponding to the speech signal to be recognized is the specific recognition result.
  • the preset comparison word library includes a comparison word speech signal of each of the comparison words; the acquisition module is also configured to: extract speech features of the comparison word speech signal of each of the comparison words through the pre-trained embedded feature representation system to obtain the embedded representation features of each of the comparison words.
  • the device further includes a model training module for training the embedding feature representation system; wherein the model training module is configured to input the first speech data in the unlabeled speech data set into the first-level feature extraction network, train the first-level feature extraction network by contrast learning, and obtain the trained first-level feature extraction network; input the second speech data in the single-word speech data set into the trained first-level feature extraction network, perform first-level embedding feature extraction on the second speech data by the trained first-level feature extraction network, and obtain sample embedding representation features with a third feature extraction accuracy; input the sample embedding representation features with the third feature extraction accuracy into In the second-level feature extraction network, the second-level embedding feature extraction is performed on the second speech data through the second-level feature extraction network to obtain a sample embedding representation feature with a fourth feature extraction accuracy; the third feature extraction accuracy is less than the fourth feature extraction accuracy; based on the sample embedding representation feature with the fourth feature extraction accuracy, speech recognition is performed on the second speech data through a preset classification network
  • the first-level feature extraction network includes an encoder network and a context network; the model training module is also configured to: input the first speech data in the unlabeled speech data set into the first-level feature extraction network; perform a first convolution process on the first speech data through the encoder network to obtain a low-frequency representation feature; perform a second convolution process on the low-frequency representation feature through the context network to obtain an embedded representation feature with a preset dimension; input the embedded representation feature with a preset dimension into a first loss model, and determine a first loss result corresponding to the embedded representation feature with a preset dimension through a first loss function in the first loss model; based on the first loss result, modify the network parameters in the encoder network and the context network to obtain the trained first-level feature extraction network.
  • the second-level feature extraction network includes: a timing information extraction layer and an attention mechanism layer; the model training module is further configured to: input the sample embedding representation feature with the third feature extraction accuracy into the second-level feature extraction network; extract the key timing information of the sample embedding representation feature under different channels through the timing information extraction layer; perform accumulation processing on the key timing information under different channels in sequence on the time axis through the attention mechanism layer to obtain the accumulation processing result; and weight the accumulation processing result. Calculate and obtain the sample embedding representation feature with the fourth feature extraction accuracy.
  • the second-level feature extraction network also includes a loss calculation layer, and the loss calculation layer includes a second loss function;
  • the model training module is also configured to: input the sample embedding representation features with the fourth feature extraction accuracy and the feature label information of the second speech data into the loss calculation layer; determine the second loss result corresponding to the sample embedding representation features with the fourth feature extraction accuracy through the second loss function of the loss calculation layer; based on the second loss result, correct the network parameters in the timing information extraction layer and the attention mechanism layer to obtain the trained second-level feature extraction network.
  • the embodiment of the present application provides a computer program product, which includes a computer program or an executable instruction, which is a computer instruction; the computer program or executable instruction is stored in a computer-readable storage medium.
  • the processor of the speech recognition device reads the executable instruction from the computer-readable storage medium and the processor executes the executable instruction, the speech recognition device executes the above-mentioned method of the embodiment of the present application.
  • An embodiment of the present application provides a storage medium storing executable instructions, wherein executable instructions are stored.
  • the processor will execute the method provided by the embodiment of the present application, for example, the method shown in FIG. 5 .
  • the storage medium can be a computer-readable storage medium, for example, a ferroelectric random access memory (FRAM), a read-only memory (ROM), a programmable read-only memory (PROM), an erasable programmable read-only memory (EPROM), an electrically erasable programmable read-only memory (EEPROM), a flash memory, a magnetic surface memory, an optical disk, or a compact disk read-only memory (CD-ROM), etc.; it can also be various devices including one or any combination of the above memories.
  • FRAM ferroelectric random access memory
  • ROM read-only memory
  • PROM programmable read-only memory
  • EPROM erasable programmable read-only memory
  • EEPROM electrically erasable programmable read-only memory
  • flash memory a magnetic surface memory
  • CD-ROM compact disk read-only memory
  • executable instructions may be in the form of a program, software, software module, script or code, written in any form of programming language (including compiled or interpreted languages, or declarative or procedural languages), and may be deployed in any form, including as a stand-alone program or as a module, component, subroutine or other unit suitable for use in a computing environment.
  • executable instructions may, but need not, correspond to a file in a file system, may be stored as part of a file storing other programs or data, such as one or more scripts in a Hyper Text Markup Language (HTML) document, in a single file dedicated to the program in question, or in multiple coordinated files (e.g., files storing one or more modules, subroutines, or code portions).
  • executable instructions may be deployed to be executed on one electronic device, or on multiple electronic devices located at one location, or on multiple electronic devices distributed at multiple locations and interconnected by a communication network.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Telephonic Communication Services (AREA)

Abstract

一种语音识别方法、装置、电子设备、存储介质及计算机程序产品,应用于人工智能和游戏领域,其中,方法由电子设备执行,方法包括:对待识别语音信号进行滑动窗截取,得到至少两个子语音信号(S501);通过预先训练的嵌入特征表示系统,对每一子语音信号进行语音特征提取,得到相应子语音信号的子语音嵌入表示特征(S502);其中,嵌入特征表示系统包括第一级特征提取网络和第二级特征提取网络;第一级特征提取网络用于对子语音信号进行第一级语音特征提取,得到第一级语音特征;第二级特征提取网络用于基于第一级语音特征,对子语音信号进行第二级语音特征提取,第二级语音特征提取的特征提取精度大于第一级语音特征提取的特征提取精度;获取预设比对词库中的每一比对词的嵌入表示特征(S503);根据子语音嵌入表示特征和每一比对词的嵌入表示特征,对每一子语音信号进行语音识别,得到子语音识别结果(S504);根据每一子语音信号子语音识别结果,确定待识别语音信号对应的语音识别结果(S505)。

Description

语音识别方法、装置、电子设备、存储介质及计算机程序产品
相关申请的交叉引用
本申请基于申请号为202211373304.3、申请日为2022年11月04日的中国专利申请提出,并要求该中国专利申请的优先权,该中国专利申请的全部内容在此引入本申请作为参考。
技术领域
本申请实施例涉及互联网技术领域,涉及但不限于一种语音识别方法、装置、电子设备、存储介质及计算机程序产品。
背景技术
语音关键词匹配技术旨在基于参考语音,识别出一段语音中特定的词语,语音关键词匹配技术在语音识别领域一直都是研究的热点。目前,语音关键词匹配技术主要分为传统方法与深度学习方法。
传统方法主要包括动态时间归整(DTW,Dynamic Time Warping)方法与相关方法;深度学习方法则是通过有监督或无监督的方法训练得到嵌入特征提取器,基于嵌入特征提取器提取音频的Mel频率倒谱系数(MFCC,Mel Frequency Cepstrum Coefficient),并通过求取目标音频与标注音频的MFCC特征之间的相似度,从而判断目标音频是否包含关键词。
但是,上述传统方法的计算量大,计算准确率容易受到外界环境影响,从而会存在识别准确率较低的问题;深度学习方法存在表达能力有限,且识别准确率低的问题。
发明内容
本申请实施例提供一种语音识别方法、装置、电子设备、存储介质及计算机程序产品,至少应用于人工智能领域和游戏领域,能够准确的提取到子语音信号的子语音嵌入表示特征,进而基于子语音嵌入表示特征能够对待识别语音信号进行准确的识别。
本申请实施例的技术方案是这样实现的:
本申请实施例提供一种语音识别方法,所述方法由电子设备执行,所述方法包括:对待识别语音信号进行滑动窗截取,得到至少两个子语音信号;通过预先训练的嵌入特征表示系统,对每一子语音信号进行语音特征提取,得到相应子语音信号的子语音嵌入表示特征;其中,所述嵌入特征表示系统包括第一级特征提取网络和第二级特征提取网络;所述第一级特征提取网络用于对所述子语音信号进行第一级语音特征提取,得到第一级语音特征;所述第二级特征提取网络用于基于所述第一级语音特征,对所述子语音信号进行第二级语音特征提取,所述第二级语音特征提取的特征提取精度大于所述第一级语音特征提取的特征提取精度;获取预设比对词库中的每一比对词的嵌入表示特征;根据所述子语音嵌入表示特征和每一所述比对词的嵌入表示特征,对每一所述子语音信号进行语音识别,得到子语音识别结果;根据每一所述子语音信号的子语音识别结果,确定所述待识别语音信号对应的语音识别结果。
本申请实施例提供一种语音识别装置,所述装置包括:帧截取模块,配置为对待识别语音信号进行滑动窗截取,得到至少两个子语音信号;特征提取模块,配置为通过预先训练的嵌入特征表示系统,对每一子语音信号进行语音特征提取,得到相应子语音信号的子语音嵌入表示特征;其中,所述嵌入特征表示系统包括第一级特征提取网络和第二级特征提取网络;所述第一级特征提取网络用于对所述子语音信号进行第一级语音特征提取,得到第一级语音特征;所述第二级特征提取网络用于基于所述第一级语音特征,对所述子语音信号进行第二级语音特征提取,所述第二级语音特征提取的特征提取精度大于所述第一级语音特征提取的特征提取精度;获取模块,配置为获取预设比对词库中的每一比对词的嵌入表示特征;语音识别模块,配置为根据所述子语音嵌入表示特征和每一所述比对词的嵌入表示特征,对每一所述子语音信号进行语音识别,得到子语音识别结果;确定模块,配置为根据每一所述子语音信号的子语音识别结果,确定所述待识别语音信号对应的语音识别结果。
本申请实施例提供一种语音识别设备,包括:存储器,用于存储可执行指令;处理器,用于执行所述存储器中存储的可执行指令时,实现上述语音识别方法。
本申请实施例提供一种计算机程序产品或计算机程序,所述计算机程序产品或计算机程序包括可执行指令,可执行指令存储在计算机可读存储介质中;当电子设备从所述计算机可读存储介质读取所述可执行指令,并执行所述可执行指令时,实现上述的语音识别方法。
本申请实施例提供一种计算机可读存储介质,存储有可执行指令,用于引起处理器执行所述可执行指令时,实现上述语音识别方法。
本申请实施例具有以下有益效果:通过由第一级特征提取网络和第二级特征提取网络构成的嵌入特征表示系统,对滑动窗截取后得到的每一子语音信号进行语音特征提取,得到子语音嵌入表示特征;并根据子语音嵌入表示特征和预设比对词库中的每一比对词的嵌入表示特征,对每一子语音信号进行语音识别,得到子语音识别结果;从而根据至少两个子语音信号的子语音识别结果,确定待识别语音信号对应的语音识别结果。如此,由于嵌入特征表示系统中的第二级特征提取网络在进行第二级语音特征提取时的特征提取精度,大于第一级特征提取网络在进行第一级语音特征提取时的特征提取精度,因此通过嵌入特征表示系统能够准确的提取到每一子语音信号的子语音嵌入表示特征,从而基于子语音嵌入表示特征能够对待识别语音信号进行准确的语音识别。
附图说明
图1是相关技术中的一种语音关键词匹配方法的流程示意图;
图2是相关技术中的另一种语音关键词匹配方法的流程示意图;
图3是本申请实施例提供的语音识别系统的一个可选的架构示意图;
图4是本申请实施例提供的电子设备的结构示意图;
图5是本申请实施例提供的语音识别方法的一个可选的流程示意图;
图6是本申请实施例提供的语音识别方法的另一个可选的流程示意图;
图7是本申请实施例提供的嵌入特征表示系统的训练方法的流程示意图;
图8是本申请实施例提供的第一级特征提取网络的训练方法的流程示意图;
图9是本申请实施例提供的第二级特征提取网络的训练方法的流程示意图;
图10是本申请实施例提供的语音关键词匹配系统示意图;
图11是本申请实施例提供的训练wav2vec模型的流程示意图;
图12是本申请实施例提供的训练ecapa-tdnn模型的流程示意图;
图13是本申请实施例提供的wav2vec模型的结构示意图;
图14是本申请实施例提供的ecapa-tdnn模型的结构示意图;
图15是本申请实施例提供的ecapa-tdnn模型中SE-ResBlock部分的结构示意图。
具体实施方式
为了使本申请的目的、技术方案和优点更加清楚,下面将结合附图对本申请作进一步地详细描述,所描述的实施例不应视为对本申请的限制,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其它实施例,都属于本申请保护的范围。
在以下的描述中,涉及到“一些实施例”,其描述了所有可能实施例的子集,但是可以理解,“一些实施例”可以是所有可能实施例的相同子集或不同子集,并且可以在不冲突的情况下相互结合。除非另有定义,本申请实施例所使用的所有的技术和科学术语与属于本申请实施例的技术领域的技术人员通常理解的含义相同。本申请实施例所使用的术语只是为了描述本申请实施例的目的,不是旨在限制本申请。
在解释本申请实施例的语音识别方法之前,首先对相关技术中的语音识别方法进行说明。
相关技术中的方案主要包括传统方法和深度学习方法。图1是相关技术中的一种语音关键词匹配方法的流程示意图,如图1所示,传统方法主要基于DTW,首先对关键词语音模版样例和待检索语音进行预处理,包括步骤S101中的梅尔特征提取与步骤S102中的语音活动性检测(VAD,Voice Activity Detection);随后,求取模版样例与待检测样例的DTW得分,即通过步骤S103计算关键词语音模版样例的模板平均,并通过步骤S104进行动态时间规整、通过步骤S105进行置信度得分规整,比较待检索语音与所有关键词语音模版样例的得分,从而根据阈值得到最终的关键词检索结果。
图2是相关技术中的另一种语音关键词匹配方法的流程示意图,如图2所示,在深度学习领域,首先在步骤S201,对待识别的输入语音进行分帧得到多个语音帧;然后,在步骤S202,对每个语音帧进行特征提取,得到每个语音帧的梅尔倒谱特征系数MFCC序列;在步骤S203,并行将每个语音帧的MFCC序列输入到预设的深度神经网络模型,分别计算每个语音帧的MFCC序列在预设的深度神经网络模型的输出层的每个神经单元下的后验概率,将输出层的每个神经单元下的后验概率组成多个语音帧对应的后验概率序列,其中,输出层的每个神经单元对应一个关键词;再然后,在步骤S204,监测输出层每个神经单元下的后验概率序列;最后,在步骤S205,根据后验概率序列与预设阈值的概率序列的比较结果确定待识别的输入语音的关键词。也就是说,在深度学习方法中,是提取训练音频数据的MFCC特征,然后构建相应的深度神经网络,最后基于特征数据训练相应的分类模型。
但是,相关技术中的传统方法和深度学习方法提取嵌入特征的过程,其中DTW的缺陷在于计算量大,易受到外界环境影响;深度学习技术的缺陷在于表达能力有限,准确率不高。且相关技术中的方法在面对复杂的游戏语音时,均存在鲁棒性不高的问题。另外,相关技术中的方法均是基于梅尔特征进行提取的,因此特征提取的准确率不高。由此可见,相关技术中的方法均存在语音识别准确率低的问题。
基于相关技术中的方法所存在的至少一个问题,本申请实施例提供一种语音识别方法,该方法是一种基于预训练模型的游戏语音关键词匹配方法。本申请实施例的方法主要包括两个子模块:无监督预训练模型和有监督嵌入特征提取器。其中,无监督预训练模型的作用是通过在大规模语料上进行对比学习,能够让模型基于充分的数据量,在句 子的层面学习到一个具有区分性的嵌入表示特征;有监督预训练模型的作用是具体化语音匹配的子任务,将中文语料切分成单个字,让网络基于之前句的特征,进一步学习到单个字的嵌入表达。本申请实施例提取的嵌入表达特征,具备优秀的识别率与泛化能力,能够快速完成语音关键词校验和识别任务。
本申请实施例提供的语音识别方法中,首先,对待识别语音信号进行滑动窗截取,得到至少两个子语音信号;然后,通过预先训练的嵌入特征表示系统,对每一子语音信号进行语音特征提取,得到相应子语音信号的子语音嵌入表示特征;其中,嵌入特征表示系统包括第一级特征提取网络和第二级特征提取网络;第一级特征提取网络用于对子语音信号进行第一级语音特征提取,得到第一级语音特征;第二级特征提取网络用于基于第一级语音特征,对子语音信号进行第二级语音特征提取,第二级语音特征提取的特征提取精度大于第一级语音特征提取的特征提取精度;并且,获取预设比对词库中的每一比对词的嵌入表示特征;再然后,根据子语音嵌入表示特征和每一比对词的嵌入表示特征,对每一子语音信号进行语音识别,得到子语音识别结果;最后,根据每一子语音信号子语音识别结果,确定待识别语音信号对应的语音识别结果。如此,通过具有第一级特征提取网络和第二级特征提取网络构成的嵌入特征表示系统对每一子语音信号进行语音特征提取,从而能够准确的提取到子语音信号的子语音嵌入表示特征,进而基于子语音嵌入表示特征能够对待识别语音信号进行准确的识别。
下面说明本申请实施例的电子设备的示例性应用,本申请实施例提供的电子设备可以是语音识别设备,语音识别设备可以实施为终端,也可以实施为服务器。在一种实现方式中,本申请实施例提供的语音识别设备可以实施为笔记本电脑,平板电脑,台式计算机,移动设备(例如,移动电话,便携式音乐播放器,个人数字助理,专用消息设备,便携式游戏设备)、智能机器人、智能家电和智能车载设备等任意的具备语音数据处理功能和游戏应用运行功能的终端;在另一种实现方式中,本申请实施例提供的语音识别设备还可以实施为服务器,其中,服务器可以是独立的物理服务器,也可以是多个物理服务器构成的服务器集群或者分布式系统,还可以是提供云服务、云数据库、云计算、云函数、云存储、网络服务、云通信、中间件服务、域名服务、安全服务、内容分发网络(CDN,Content Delivery Network)、以及大数据和人工智能平台等基础云计算服务的云服务器。终端以及服务器可以通过有线或无线通信方式进行直接或间接地连接,本申请实施例中不做限制。下面,将说明电子设备实施为服务器时的示例性应用。
参见图3,图3是本申请实施例提供的语音识别系统的一个可选的架构示意图,本申请实施例以语音识别方法应用于游戏应用为例进行说明。为实现支撑任意一个游戏应用,并对游戏应用运行过程中玩家的语音进行检测和识别,本申请实施例的终端上至少安装有游戏应用。本申请实施例中,语音识别系统10中至少包括终端100、网络200和服务器300,其中服务器300是游戏应用的应用服务器。服务器300可以构成本申请实施例的电子设备。终端100通过网络200连接服务器300,网络200可以是广域网或者局域网,又或者是二者的组合。在运行游戏应用时,终端100运行游戏应用并生成游戏语音数据,其中,游戏语音数据中包括游戏运行语音和玩家间说话和沟通的语音,终端100在获取到游戏语音数据后,将游戏语音数据作为待识别语音信号封装至语音识别请求中,通过网络200将语音识别请求发送给服务器300,请求服务器300对游戏语音数据进行语音识别,判断游戏语音数据中是否含有脏话或不文明用语。服务器300在接收到语音识别请求之后,响应于语音识别请求,对待识别语音信号进行滑动窗截取,得到至少两个子语音信号;并通过预先训练的嵌入特征表示系统,对每一子语音信号进行语音特征提取,得到相应子语音信号的子语音嵌入表示特征;同时,获取预设比对词库中的每一比对词的嵌入表示特征;根据子语音嵌入表示特征和每一比对词的嵌入表示特 征,对每一子语音信号进行语音识别,得到子语音识别结果;最后,根据至少两个子语音信号的子语音识别结果,确定待识别语音信号对应的语音识别结果。在得到语音识别结果后,将语音识别结果发送给终端100。终端100可以基于语音识别结果生成相应的提醒信息并显示提醒信息。
在一些实施例中,上述语音识别过程还可以由终端100来实现,即终端在采集到游戏语音数据后,将游戏语音数据作为待识别语音信号进行语音识别,即通过终端对待识别语音信号进行滑动窗截取,得到至少两个子语音信号;以及,由终端实现通过预先训练的嵌入特征表示系统,对每一子语音信号进行语音特征提取,得到子语音嵌入表示特征;然后,终端获取预设比对词库中的每一比对词的嵌入表示特征;并根据子语音嵌入表示特征和每一比对词的嵌入表示特征,对每一子语音信号进行语音识别,得到子语音识别结果;最后,终端根据至少两个子语音信号的子语音识别结果,确定待识别语音信号对应的语音识别结果。
本申请实施例所提供的语音识别方法还可以基于云平台并通过云技术来实现,例如,上述服务器300可以是云端服务器。通过云端服务器对待识别语音信号进行滑动窗截取,或者,通过云端服务器对每一子语音信号进行语音特征提取,得到子语音嵌入表示特征,或者,通过云端服务器获取预设比对词库中的每一比对词的嵌入表示特征,或者,通过云端服务器根据子语音嵌入表示特征和每一比对词的嵌入表示特征,对每一子语音信号进行语音识别,或者,通过云端服务器根据至少两个子语音信号的子语音识别结果,确定待识别语音信号对应的语音识别结果等。
在一些实施例中,还可以具有云端存储器,可以将待识别语音信号存储至云端存储器中,或者,还可以将预先训练的嵌入特征表示系统、该嵌入特征表示系统的参数和预设比对词库存储至云端存储器中,或者,还可以将子语音识别结果和语音识别结果等存储至云端存储器中。这样,在运行游戏应用的过程中,可以直接从云端存储器中获取预先训练的嵌入特征表示系统、该嵌入特征表示系统的参数和预设比对词库,对待识别语音信号进行语音识别,如此,能够极大的提高数据的读取效率,提高语音识别效率。
这里需要说明的是,云技术(Cloud technology)是指在广域网或局域网内将硬件、软件、网络等系列资源统一起来,实现数据的计算、储存、处理和共享的一种托管技术。云技术基于云计算商业模式应用的网络技术、信息技术、整合技术、管理平台技术、应用技术等的总称,可以组成资源池,按需所用,灵活便利。云计算技术将变成重要支撑。技术网络系统的后台服务需要大量的计算、存储资源,如视频网站、图片类网站和更多的门户网站。伴随着互联网行业的高度发展和应用,将来每个物品都有可能存在自己的识别标志,都需要传输到后台系统进行逻辑处理,不同程度级别的数据将会分开处理,各类行业数据皆需要强大的系统后盾支撑,只能通过云计算来实现。
图4是本申请实施例提供的电子设备的结构示意图,图4所示的电子设备可以是语音识别设备,其中,电子设备包括:至少一个处理器310、存储器350、至少一个网络接口320和用户接口330。电子设备中的各个组件通过总线系统340耦合在一起。可理解,总线系统340用于实现这些组件之间的连接通信。总线系统340除包括数据总线之外,还包括电源总线、控制总线和状态信号总线。但是为了清楚说明起见,在图4中将各种总线都标为总线系统340。
处理器310可以是一种集成电路芯片,具有信号的处理能力,例如通用处理器、数字信号处理器(DSP,Digital Signal Processor),或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等,其中,通用处理器可以是微处理器或者任何常规的处理器等。
用户接口330包括使得能够呈现媒体内容的一个或多个输出装置331,以及一个或 多个输入装置332。
存储器350可以是可移除的,不可移除的或其组合。示例性的硬件设备包括固态存储器,硬盘驱动器,光盘驱动器等。存储器350可选地包括在物理位置上远离处理器310的一个或多个存储设备。存储器350包括易失性存储器或非易失性存储器,也可包括易失性和非易失性存储器两者。非易失性存储器可以是只读存储器(ROM,Read Only Memory),易失性存储器可以是随机存取存储器(RAM,Random Access Memory)。本申请实施例描述的存储器350旨在包括任意适合类型的存储器。在一些实施例中,存储器350能够存储数据以支持各种操作,这些数据的示例包括程序、模块和数据结构或者其子集或超集,下面示例性说明。
操作系统351,包括用于处理各种基本系统服务和执行硬件相关任务的系统程序,例如框架层、核心库层、驱动层等,用于实现各种基础业务以及处理基于硬件的任务;网络通信模块352,用于经由一个或多个(有线或无线)网络接口320到达其他计算设备,示例性的网络接口320包括:蓝牙、无线相容性认证(WiFi)、和通用串行总线(USB,Universal Serial Bus)等;输入处理模块353,用于对一个或多个来自一个或多个输入装置332之一的一个或多个用户输入或互动进行检测以及翻译所检测的输入或互动。
在一些实施例中,本申请实施例提供的装置可采用软件方式实现,图4示出了存储在存储器350中的一种语音识别装置354,该语音识别装置354可以是电子设备中的语音识别装置,其可以是程序和插件等形式的软件,包括以下软件模块:帧截取模块3541、特征提取模块3542、获取模块3543、语音识别模块3544和确定模块3545,这些模块是逻辑上的,因此根据所实现的功能可以进行任意的组合或进一步拆分。将在下文中说明各个模块的功能。
在另一些实施例中,本申请实施例提供的装置可以采用硬件方式实现,作为示例,本申请实施例提供的装置可以是采用硬件译码处理器形式的处理器,其被编程以执行本申请实施例提供的语音识别方法,例如,硬件译码处理器形式的处理器可以采用一个或多个应用专用集成电路(ASIC,Application Specific Integrated Circuit)、DSP、可编程逻辑器件(PLD,Programmable Logic Device)、复杂可编程逻辑器件(CPLD,Complex Programmable Logic Device)、现场可编程门阵列(FPGA,Field-Programmable Gate Array)或其他电子元件。
本申请各实施例提供的语音识别方法可以由电子设备来执行,其中,该电子设备可以是任意一种具备语音数据处理功能的终端,或者也可以是服务器,即本申请各实施例的语音识别方法可以通过终端来执行,也可以通过服务器来执行,或者还可以通过终端与服务器进行交互来执行。
参见图5,图5是本申请实施例提供的语音识别方法的一个可选的流程示意图,下面将结合图5示出的步骤进行说明,需要说明的是,图5中的语音识别方法是通过服务器作为执行主体为例来说明的,包括以下步骤S501至步骤S505:
步骤S501,对待识别语音信号进行滑动窗截取,得到至少两个子语音信号。
这里,待识别语音信号可以是游戏场景下的游戏语音所对应的语音信号,可以在运行游戏应用的过程中采集游戏语音,并对游戏语音进行语音信号提取,得到待识别语音信号。
本申请实施例的方法可以应用于以下游戏语音中特定类型的语音识别场景,其中,特定类型的语音识别场景可以根据实际语音识别任务来确定,也就是说,特定类型的语音识别场景可以是任意一种类型的语音识别场景,例如可以是脏话识别场景、不文明用于识别场景、游戏用语识别场景、游戏激烈程度识别场景等。
这里以脏话识别场景为例对本申请实施例的应用场景进行说明:在运行游戏应用的 过程中,由于玩家之间可以进行语音通话,为了保证游戏能够在一个良性和健康的环境下运行,可以实时的判断玩家在玩游戏过程中的语音中是否存在脏话或者不文明用语,从而及时发现玩家的不文明语言行为,对玩家进行及时的提醒,以保证游戏的良性运行。在进行脏话或者不文明用语识别时,则可以采用本申请实施例提供的语音识别方法来实现,即将玩家之间的语音作为待识别语音,通过本申请实施例提供的语音识别方法对待识别语音进行脏话或者不文明用语识别,确定玩家之间的语音中是否存在脏话或者不文明用语。
本申请实施例中,待识别语音信号中可以不仅包括玩家的对话语音,还可以包括游戏运行场景下的游戏运行语音,这里,游戏运行语音包括但不限于:技能释放时的语音、特效语音、虚拟英雄发出的语音、使用任意道具时生成的语音等。也就是说,可以通过游戏引擎获取玩家游戏运行环境下的游戏运行语音,并通过终端上的语音采集装置采集玩家的对话语音,然后,将游戏运行语音与对话语音叠加之后构成待识别语音。
滑动窗截取是指通过具有预设步长的滑动窗遍历待识别语音信号,每次截取到与滑动窗具有相同步长的一段子语音信号。
在一种实现方式中,可以在每次截取到一段子语音信号之后,采用本申请实施例的后续步骤对该子语音信号进行语音识别,得到子语音识别结果。之后,再通过滑动窗截取得到另一段子语音信号,并继续对该段子语音信号进行语音识别,如此循环往复,直至完成对待识别语音信号中的每一段子语音信号的语音识别过程。
在另一种实现方式中,可以对待识别语音信号执行多次滑动窗截取过程,对应得到多个子语音信号,并按照子语音信号在待识别语音信号中的先后顺序,为每一子语音信号添加识别标识。该识别标识用于区分子语音信号与其他子语音信号,且该识别标识还能够识别出子语音信号与其他子语音信号在待识别语音信号中的相对先后位置。在得到多个子语音信号之后,基于每一子语音信号的识别标识,按照子语音信号在待识别语音信号中的相对先后位置,依次对每一子语音信号进行语音识别,对应得到多个子语音识别结果。
这里需要说明的是,在进行滑动窗截取子语音信号时,相邻两次截取过程中得到的两个子语音信号在待识别语音信号中是相邻的两段信号,也就是说,在进行滑动窗截取子语音信号时,是从待识别语音信号的信号开始位置依次进行截取,且截取的过程中不会丢失待识别语音信号的任意一段信号。
步骤S502,通过预先训练的嵌入特征表示系统,对每一子语音信号进行语音特征提取,得到相应子语音信号的子语音嵌入表示特征。
这里,嵌入特征表示系统包括第一级特征提取网络和第二级特征提取网络;第一级特征提取网络用于对子语音信号进行第一级语音特征提取;第二级特征提取网络用于基于第一级语音特征提取时得到的第一级语音特征,对子语音信号进行第二级语音特征提取,第二级语音特征提取的特征提取精度大于第一级语音特征提取的特征提取精度。
本申请实施例中,可以将每一子语音信号输入至嵌入特征表示系统中,通过嵌入特征表示系统中的第一级特征提取网络和第二级特征提取网络依次对子语音信号进行第一级语音特征提取和第二级语音特征提取,也就是说,依次对子语音信号进行粗精度的语音特征提取和细精度的语音特征提取,得到子语音嵌入表示特征。
这里,子语音嵌入表示特征是指对子语音信号进行数据转换后得到的具有固定大小的特征表示(通常为矢量形式),子语音嵌入表示特征能够便于进行后续的处理和计算。在实现的过程中,可以通过特征嵌入方式来得到子语音嵌入表示特征,特征嵌入即将输入数据转换(例如可以是降维处理)为固定大小的特征表示(矢量形式),以便于处理和计算(例如,用于求距离等)。举例来说,针对用于说话者识别的语音信号训练的模 型,可以允许将语音片段转换为数字向量,使得来自相同说话者的另一语音片段与转换得到的数字向量具有较小的距离(例如,欧几里德距离),例如,来自相同说话者的另一语音片段与转换得到的数字向量之间的距离小于预设的距离阈值。特征嵌入的主要目的是对输入特征进行降维,降维的方式可以是使用一个全连接层进行全连接处理之后,再通过嵌入层进行权重矩阵计算,从而实现降低维度的过程。
这里需要说明的是,第一级特征提取网络可以是一种无监督预训练模型,第一级特征提取网络会预先基于大规模的无标注语音进行自监督预训练,得到训练后的第一级特征提取网络。第二级特征提取网络可以是基于训练后的第一级特征提取网络进行特征提取后,再进行模型训练后得到的模型。在实现的过程中,可以通过训练后的第一级特征提取网络,对单字语音数据集中的单字语音进行上述粗精度(即第一级语音特征提取时的特征提取精度)的语音特征提取,得到单字语音的嵌入表示特征,然后将单字语音的嵌入表示特征作为第二级特征提取网络的输入特征,输入至第二级特征提取网络中,通过第二级特征提取网络对单字语音进行细精度(即第二级语音特征提取时的特征提取精度)的语音特征提取。关于第一级特征提取网络、第二级特征提取网络以及嵌入特征表示系统的训练过程,将在下文中进行详细说明。
本申请实施例中,在对子语音信号进行语音特征提取时,由于可以直接将子语音信号输入至嵌入特征表示系统中进行特征提取,所提取到的是子语音信号的嵌入表示特征,而无需提取子语音信号的梅尔特征。如此,能够极大的降低模型的计算量,且提取的嵌入表示特征能够更加准确的表达子语音信号中的语音信息,因此,能够对子语音信号进行准确的语音特征提取。
本申请实施例中,可以将至少两个子语音信号中的每一子语音信号依次输入至预先训练的嵌入特征表示系统中,通过预先训练的嵌入特征表示系统对每一子语音信号进行语音特征提取,得到多个子语音嵌入表示特征。
需要说明的是,特征提取精度用于反映语音特征提取过程中,所提取的嵌入表示特征所能够反映相应的子语音信号的准确度。对于粗精度的语音特征提取过程,所提取到的嵌入表示特征能够反映相应的子语音信号较少的信息(例如,可以是所提取到的嵌入表示特征能够反映相应的子语音信号小于信息量阈值),从而使得所提取的嵌入表示特征能够反映相应的子语音信号的信息的准确度低于准确度阈值;对于细精度的语音特征提取过程,所提取到的嵌入表示特征能够反映相应的子语音信号较多的信息(例如,可以是所提取到的嵌入表示特征能够反映相应的子语音信号大于或等于信息量阈值),从而使得所提取的嵌入表示特征能够反映相应的子语音信号的信息的准确度高于准确度阈值。
步骤S503,获取预设比对词库中的每一比对词的嵌入表示特征。
这里,预设比对词库中包括多个比对词,预设比对词库中的比对词具有特定的属性信息,即预设比对词库中的比对词是属于特定类型的词。举例来说,当需要对待识别语音信号进行脏话识别时,预设比对词库中的比对词为预先采集和存储的脏话词,即预设比对词库可以是脏词库;当需要对待识别语音信号进行赞美词识别时,预设比对词库中的比对词为预设采集和存储的赞美词,即预设比对词库可以是赞美词库;当需要对待识别语音信号进行游戏指令识别时,预设比对词库中的比对词可以是预先采集和存储的游戏指令相关的词,即预设比对词库可以是游戏指令词库。
在一些实施例中,在预设比对词库中,可以存储有每一比对词的比对词语音或者比对词语音信号,可以对比对词语音进行语音信号识别,得到比对词语音对应的比对词语音信号,进而可以对比对词语音信号进行语音特征提取,得到比对词的嵌入表示特征。
在实现的过程中,可以采用上述预先训练的嵌入特征表示系统对预设比对词库中的 每一比对词的比对词语音信号进行语音特征提取,得到每一比对词的嵌入表示特征,也即每一比对词语音信号的嵌入表示特征。
步骤S504,根据子语音嵌入表示特征和每一比对词的嵌入表示特征,对每一子语音信号进行语音识别,得到子语音识别结果。
这里,可以将子语音嵌入表示特征与比对词的嵌入表示特征进行比较,从而得到子语音识别结果。在进行比较时,可以计算子语音嵌入表示特征与比对词的嵌入表示特征之间的余弦相似度,基于余弦相似度确定子语音识别结果。本申请实施例中,可以计算每一子语音信号的子语音嵌入表示特征与每一比对词的嵌入表示特征之间的余弦相似度。
本申请实施例中,在根据子语音嵌入表示特征和每一比对词的嵌入表示特征,对每一子语音信号进行语音识别,得到子语音识别结果时,包括但不限于以下几种实现方式:
在第一种实现方式中,对于每一子语音信号来说,在得到该子语音信号的子语音嵌入表示特征与每一比对词的嵌入表示特征之间的余弦相似度之后,还可以基于余弦相似度对比对词进行排序,形成比对词序列;然后,提取比对词序列中的前N个比对词,N为大于1的整数;最后,比较该子语音信号的子语音嵌入表示特征,与这前N个比对词的嵌入表示特征之间的余弦相似度,如果这N个余弦相似度均大于相似度阈值,则表明该子语音信号对应的子语音中含有与预设比对词库中的比对词具有相同属性的语音词。本申请实施例中,一方面,由于在基于余弦相似度形式比对词序列之后,是筛选出前N个比对词,N远小于预设比对词库中全部比对词的总数量,因此,在与相似度阈值进行比较时,只需要比较N个余弦相似度是否大于相似度阈值,显然会极大的降低数据比较的数据计算量,提高语音识别的效率。另一方面,由于N大于1,因此是在存在多个比对词的余弦相似度均大于相似度阈值的情况下,认定子语音信号含有与预设比对词库中的比对词具有相同属性的语音词,如此,基于多个比对词的余弦相似度的结果进行识别和验证,能够保证语音识别的准确率,避免在计算与个别比对词的余弦相似度存在误差的情况下,对本申请实施例语音识别结果准确性的影响。
在第二种实现方式中,对于每一子语音信号来说,在得到该子语音信号的子语音嵌入表示特征与每一比对词的嵌入表示特征之间的余弦相似度之后,可以获取预设的相似度阈值;然后,筛选出余弦相似度大于相似度阈值的全部比对词,并获取这全部比对词的数量,当全部比对词的数量大于数量阈值时,则表明该子语音信号对应的子语音中含有与预设比对词库中的比对词具有相同属性的语音词。本申请实施例中,通过相似度阈值和数量阈值这两重判断,能够在保证余弦相似度高的情况下,判断出具有较多相似比对词的情况,也就是说,在预设比对词库中,存在大量与子语音信号的子语音嵌入表示特征之间具有较高余弦相似度的比对词。如此,基于这两个阈值的双重判断,能够对子语音信号对应的子语音中是否含有与预设比对词库中的比对词具有相同属性的语音词进行准确的判断,进而提高语音识别的准确率。
在第三种实现方式中,对于每一子语音信号来说,可以依次计算该子语音信号的子语音嵌入表示特征与每一比对词的嵌入表示特征之间的余弦相似度,且在每计算出一个余弦相似度之后,即对该余弦相似度进行判断,判断余弦相似度是否大于相似度阈值;只要判断出该子语音信号的子语音嵌入表示特征与任一比对词的嵌入表示特征之间的余弦相似度,大于相似度阈值时,停止计算该子语音信号的子语音嵌入表示特征与剩余比对词的嵌入表示特征之间的余弦相似度,并且,确定出该子语音信号对应的子语音中含有与预设比对词库中的比对词具有相同属性的语音词。本申请实施例中,可以预先定义只要存在至少一个比对词的嵌入表示特征与子语音嵌入表示特征之间的余弦相似度大于相似度阈值,即认为子语音信号对应的子语音中含有与预设比对词库中的比对词具 有相同属性的语音词,也就是说,只要检测到一个比对词的嵌入表示特征与子语音嵌入表示特征之间的余弦相似度大于相似度阈值,就可以认为子语音信号对应的子语音中含有与预设比对词库中的比对词具有相同属性的语音词。本申请实施例在实现的过程中,通过边计算余弦相似度边进行判断,一旦确定出存在一个计算出的余弦相似度大于相似度阈值,即停止继续对其他比对词的余弦相似度进行计算,如此,能够极大的提高检测的效率,进而提高语音识别的效率。
在第四种实现方式中,对于每一子语音信号来说,首先初始化计数器为0;然后,依次计算该子语音信号的子语音嵌入表示特征与每一比对词的嵌入表示特征之间的余弦相似度,且在每计算出一个余弦相似度之后,即对该余弦相似度进行判断,判断余弦相似度是否大于相似度阈值;只要判断出该子语音信号的子语音嵌入表示特征与任一比对词的嵌入表示特征之间的余弦相似度,大于相似度阈值时,对计数器进行加一。如此循环往复直至计数器的计数值大于等于数值阈值时,停止计算该子语音信号的子语音嵌入表示特征与剩余比对词的嵌入表示特征之间的余弦相似度,并且,确定出该子语音信号对应的子语音中含有与预设比对词库中的比对词具有相同属性的语音词。这里,数值阈值为大于1的整数。本申请实施例中,通过使用计数器对判断结果进行计数,即每计算出一个余弦相似度并对该余弦相似度与相似度阈值进行判断之后,基于判断结果对计数器进行计数更新(即满足余弦相似度大于相似度阈值这一条件时,计数器加一;不满足余弦相似度大于相似度阈值这一条件时,计数器数值不变),如此,至少具有以下有益效果:一方面,实现了通过相似度阈值和数值阈值的两重判断,能够在保证余弦相似度高的情况下,判断出具有较多相似比对词的情况,从而能够在预设比对词库中,对存在大量与子语音信号的子语音嵌入表示特征之间具有较高余弦相似度的比对词的情况进行准确的识别;另一方面,由于每计算一个余弦相似度进行一次判断和计数器计数,一旦计数器的计数值大于等于数值阈值时,停止计算余弦相似度,也就是说,无需计算出子语音嵌入表示特征与预设比对词库中的每一比对词的嵌入表示特征之间的余弦相似度,从而能够极大的降低计算余弦相似度的数据计算量,提高语音识别的效率。
步骤S505,根据至少两个子语音信号的子语音识别结果,确定待识别语音信号对应的语音识别结果。
这里,在得到每一子语音信号的子语音识别结果之后,对至少两个子语音信号的子语音识别结果进行结果综合处理,得到待识别语音信号对应的语音识别结果。
在进行结果综合处理时,可以是当子语音嵌入表示特征与任一比对词的嵌入表示特征之间的余弦相似度大于相似度阈值时,确定子语音信号的子语音识别结果为特定识别结果,即确定出该子语音信号对应的子语音中含有与预设比对词库中的比对词具有相同属性的语音词。或者,可以是当子语音嵌入表示特征与预设数量的比对词的嵌入表示特征之间的余弦相似度大于相似度阈值时,确定子语音信号的子语音识别结果为特定识别结果,即确定出该子语音信号对应的子语音中含有与预设比对词库中的比对词具有相同属性的语音词。
本申请实施例提供的语音识别方法,通过预先训练的嵌入特征表示系统,对滑动窗截取后得到的每一子语音信号进行语音特征提取,得到子语音嵌入表示特征;并根据子语音嵌入表示特征和预设比对词库中的每一比对词的嵌入表示特征,对每一子语音信号进行语音识别,得到子语音识别结果;从而根据至少两个子语音信号的子语音识别结果,确定待识别语音信号对应的语音识别结果。如此,通过具有第一级特征提取网络和第二级特征提取网络构成的嵌入特征表示系统对每一子语音信号进行语音特征提取,从而能够准确的提取到子语音信号的子语音嵌入表示特征,进而基于子语音嵌入表示特征能够对待识别语音信号进行准确的识别。
在一些实施例中,语音识别系统中至少包括终端和服务器,其中,该语音识别方法可以用于对游戏应用运行过程中生成的游戏语音数据进行语音识别,以确定游戏语音数据中是否存在特定类型的用语(例如脏话和不文明用语)的情况;或者,还可以用于对电竞场景下生成的电竞语音进行语音识别,以确定电竞语音中是否存在脏话或不文明用语的情况;或者,还可以用于在短视频场景下对短视频中的短视频语音进行语音识别,以确定短视频语音中是否存在脏话或不文明用语的情况;当然也可以应用于其他类似的存在语音以及需要进行语音识别的场景。
在实现的过程中,终端上可以运用有游戏应用,在运行游戏应用的过程中采集得到游戏语音数据,并获取游戏语音数据对应的语音信号,得到待识别语音信号,从而采用本申请实施例的方法对待识别语音信号进行语音识别。
图6是本申请实施例提供的语音识别方法的另一个可选的流程示意图,如图6所示,方法包括以下步骤S601至步骤S613:
步骤S601,终端在运行游戏应用的过程中,获取游戏应用的游戏运行语音,以及,采集玩家的用户语音。
这里,终端在运行游戏应用的过程中,可以获取游戏应用的游戏运行语音,游戏运行语音包括但不限于:技能释放时的语音、特效语音、虚拟英雄发出的语音、使用任意道具时生成的语音等。在实现的过程中,可以通过游戏引擎直接获取到游戏运行语音。
本申请实施例中,终端在运行游戏应用的过程中,还可以通过终端上的语音采集装置采集玩家的对话语音,即采集得到用户语音。这里,用户语音是指游戏运行过程中玩家间说话和沟通的语音,用户语音可以仅包括当前玩家自己的语音,还可以包括当前游戏场景下的全部玩家的语音。
步骤S602,终端对游戏运行语音和用户语音进行叠加,形成游戏语音数据。
这里,对游戏运行语音和用户语音进行叠加可以是在时间维度上,将游戏运行语音和用户语音融合成在时间轴上的一段融合后的游戏语音数据,该游戏语音数据中不仅包括游戏运行语音,还包括用户语音。
步骤S603,终端将游戏语音数据对应的语音信号作为待识别语音信号封装至语音识别请求中。
步骤S604,终端将语音识别请求发送给服务器。
步骤S605,服务器解析语音识别请求,得到待识别语音信号。
步骤S606,服务器采用具有预设步长的滑动窗,对待识别语音信号进行分帧处理,得到至少两个子语音信号,其中,至少两个子语音信号具有相同的帧长。
这里,可以采用具有预设步长的滑动窗遍历待识别语音信号,每次截取到与滑动窗具有相同步长的一段子语音信号。也就是说,将原始的待识别语音信号分成大小固定的多段子语音信号,这里每一段子语音信号都可以被称为一帧,帧长一般取10ms到30ms。全部子语音信号连接后构成原始的待识别语音信号。
在一些实施例中,在对待识别语音信号执行多次滑动窗截取的过程中,对应得到了多个子语音信号,还可以按照子语音信号在待识别语音信号中的先后顺序,为每一子语音信号添加识别标识。该识别标识用于区分子语音信号与其他子语音信号,且该识别标识还能够识别出子语音信号与其他子语音信号在待识别语音信号中的相对先后位置。
在一些实施例中,在对待识别语音信号进行分帧处理之后,还可以获取预设窗函数;并采用预设窗函数对每一子语音信号进行平滑处理,对应得到至少两个平滑处理后的子语音信号。这里,平滑处理也可以称为加窗处理,加窗处理在对待识别语音信号分帧后,为了使帧与帧之间平滑过渡,保持相邻帧之间的连续性,也就是消除各个帧两端可能会造成的信号不连续性,即谱泄露(spectral leakage),通过预设窗函数来减小谱泄露,预 设窗函数可以减少截断带来的影响。
本申请实施例中,可以将每一帧带入预设窗函数,形成加窗语音信号sw(n)=s(n)*w(n),其中,sw(n)为加窗语音信号,即平滑处理后的子语音信号;s(n)为每一帧,即每一子语音信号;w(n)为预设窗函数。在一些实施例中,预设窗函数可以包括矩形窗和汉明窗。
需要说明的是,在后续对每一子语音信号进行语音特征提取时,可以是对每一平滑处理后的子语音信号进行语音特征提取。也就是说,是基于平滑处理后的子语音信号进行后续的语音识别步骤。
步骤S607,服务器将每一子语音信号输入至第一级特征提取网络中,通过第一级特征提取网络,对子语音信号进行第一级嵌入特征提取,得到具有第一特征提取精度的嵌入表示特征。
步骤S608,服务器将具有第一特征提取精度的嵌入表示特征,输入至第二级特征提取网络中,通过第二级特征提取网络,对子语音信号进行第二级嵌入特征提取,得到具有第二特征提取精度的嵌入表示特征;第一特征提取精度小于第二特征提取精度。
这里,嵌入特征表示系统包括第一级特征提取网络和第二级特征提取网络;第一级特征提取网络用于对子语音信号进行第一级语音特征提取;第二级特征提取网络用于基于第一级语音特征提取时得到的第一级语音特征,对子语音信号进行第二级语音特征提取,第二级语音特征提取的特征提取精度大于第一级语音特征提取的特征提取精度。特征提取精度用于反映语音特征提取过程中,所提取的嵌入表示特征所能够反映相应的子语音信号的准确度。
第一级特征提取网络是一种无监督预训练模型,第一级特征提取网络会预先基于大规模的无标注语音进行自监督预训练,得到训练后的第一级特征提取网络。第二级特征提取网络是基于训练后的第一级特征提取网络进行特征提取后,再进行模型训练后得到的。
本申请实施例中,具有第二特征提取精度的嵌入表示特征构成相应子语音信号的子语音嵌入表示特征。
步骤S609,服务器获取预设比对词库中的每一比对词的嵌入表示特征。
在一些实施例中,预设比对词库中包括多个比对词,预设比对词库中的比对词具有特定的属性信息,即预设比对词库中的比对词是属于特定类型的词。预设比对词库中包括每一比对词的比对词语音信号。可以通过预先训练的嵌入特征表示系统,对每一比对词的比对词语音信号进行语音特征提取,得到每一比对词的嵌入表示特征。
步骤S610,服务器根据子语音嵌入表示特征和每一比对词的嵌入表示特征,对每一子语音信号进行语音识别,得到子语音识别结果。
在一些实施例中,对每一所述子语音信号进行语音识别,可以通过以下方式实现:
首先,确定子语音嵌入表示特征与每一比对词的嵌入表示特征之间的相似度(例如可以是余弦相似度);然后,当子语音嵌入表示特征与任一比对词的嵌入表示特征之间的相似度大于相似度阈值时,确定子语音信号的子语音识别结果为特定识别结果;这里,特定识别结果用于表征:子语音信号对应的子语音中含有与预设比对词库中的比对词具有相同属性的语音词。也就是说,特定识别结果用于表征子语音信号对应的子语音中含有特定的语音词,该特定的语音词是与预设比对词库中的比对词具有相同属性的语音词。
举例来说,当预设比对词库中的比对词为预先采集和存储的脏话词时,如果子语音信号的子语音识别结果为特定识别结果,则表明子语音信号对应的子语音中含有脏话词;当预设比对词库中的比对词为预设采集和存储的赞美词时,如果子语音信号的子语音识别结果为特定识别结果,则表明子语音信号对应的子语音中含有赞美词;当预设比对词 库中的比对词可以是预先采集和存储的游戏指令相关的词时,如果子语音信号的子语音识别结果为特定识别结果,则表明子语音信号对应的子语音中含有游戏指令。
步骤S611,服务器根据至少两个子语音信号的子语音识别结果,确定待识别语音信号对应的语音识别结果。
本申请实施例中,当任一子语音信号的子语音识别结果为特定识别结果时,确定待识别语音信号对应的语音识别结果为特定识别结果。或者,当具有预设数量的子语音信号的子语音识别结果为特定识别结果时,确定待识别语音信号对应的语音识别结果为特定识别结果,预设数量为大于1的整数。
步骤S612,服务器将语音识别结果发送给终端。
步骤S613,终端基于语音识别结果生成提醒信息,并显示提醒信息。
这里,当语音识别结果为待识别语音中包含有与预设比对词库中的比对词具有相同属性的语音词时,生成与该语音识别结果对应的提醒信息并显示提醒信息,以提醒玩家。
在实现的过程中,可以以弹窗的形式显示提醒信息,也可以在当前游戏界面中显示提醒信息。提醒信息可以是以文字的形式呈现、以特效图的形式呈现、以特效视频或者特定提醒视频的形式呈现,在一些实施例中,提醒信息也可以以语音的形式输出。
举例来说,当检测到用户的游戏语音(即待识别语音信号)中含有脏话词时,以弹窗的形式发送提醒信息“请您注意文明用语”等文字提醒,或者,还可以在当前游戏界面中弹出特效图片,提醒用户注意文明用语,或者,还可以在当前游戏界面中播放预先制作的脏话提醒视频,以提醒玩家注意文明用语,或者,还可以语音提醒玩家。
在一些实施例中,当检测到玩家的游戏语音中含有脏话词时,在生成并显示提醒信息的过程中,还可以添加惩罚机制,以进一步提醒玩家注意文明用语。这里惩罚机制包括但不限于:在显示提醒信息的时间段内,玩家不能对当前游戏场景下的任一对象进行操作,即在显示提醒信息的时间段内,玩家处于不可操作状态;待提醒信息显示结束后,玩家才能够重新进入当前游戏场景。
在一些实施例中,还可以确定玩家当前所发出的游戏语音中所包含的脏话词的数量和脏话强度,如果数量大于数量阈值,或者脏话强度大于强度阈值,可以采用预设的惩罚机制对玩家的游戏进度进行惩罚。例如,惩罚机制可以是禁止玩家发送语音、禁止玩家继续进行游戏对局、禁止玩家在一定的时长内再次运行该游戏应用等。
在另一些实施例中,还可以确定玩家在当前游戏对局中的整个游戏语音过程中所包含的脏话词的总数量,以及,玩家在当前游戏对局过程中的整个游戏语音过程中被检测到含有脏话词的次数,如果总数量大于总数量阈值,或者次数大于次数阈值,也可以采用预设的惩罚机制对玩家的游戏进度进行惩罚。
这里,可以设置提醒信息的显示时长,可以预先设置提醒信息的显示时长为一初始时长。在本次游戏对局过程中,如果检测到玩家的游戏语音中含有脏话词的次数大于次数阈值时,对初始时长进行调整,以增大提醒信息的显示时长。
下面对嵌入特征表示系统及嵌入特征表示系统的训练方法进行说明。
本申请实施例中,嵌入特征表示系统包括第一级特征提取网络和第二级特征提取网络;第一级特征提取网络用于对子语音信号进行第一级语音特征提取;第二级特征提取网络用于基于第一级语音特征提取时得到的第一级语音特征,对子语音信号进行第二级语音特征提取,第二级语音特征提取的特征提取精度大于第一级语音特征提取的特征提取精度。
图7是本申请实施例提供的嵌入特征表示系统的训练方法的流程示意图,该嵌入特征表示系统的训练方法可以由模型训练模块实现,其中,模型训练模块可以是语音识别设备(即电子设备)中的模块,即模型训练模块可以是服务器也可以是终端;或者,也 可以是独立于语音识别设备的另一设备,即模型训练模块是区别于上述用于实现语音识别方法的服务器和终端之外的其他电子设备。如图7所示,可以通过循环迭代以下步骤S701至步骤S706,对嵌入特征表示系统进行训练,直至嵌入特征表示系统满足预设收敛条件达到收敛为止:
步骤S701,将无标注语音数据集中的第一语音数据输入至第一级特征提取网络中,通过对比学习方式对第一级特征提取网络进行训练,得到训练后的第一级特征提取网络。
这里,无标注语音数据集中包括多个未进行标注的无标签语音数据。由于第一级特征提取网络可以采用无监督学习方式进行训练,因此可以采用无标注语音数据集中的第一语音数据,对第一级特征提取网络进行训练。
这里,对比学习是一种自监督学习方法,对比学习用于在没有标签的情况下,通过让第一级特征提取网络学习哪些数据点相似或不同,进而来学习无标注语音数据集的一般特征。对比学习允许第一级特征提取网络观察哪些数据点对是“相似”和“不同”,以便在执行分类或分割等任务之前了解数据更高阶的特征。在大多数实际场景中,由于没有为两段语音信号设置标签,为了创建标签,专业人士必须花费大量时间人工听取语音以手动分类、分割等。通过对比学习,即使只有一小部分数据集被标记,也可以显著提高模型性能。
在一种实现方式中,第一级特征提取网络可以实现为wav2vec模型。这里,通过训练wav2vec模型,得到训练后的wav2vec模型,并通过训练后的wav2vec模型区分真实数据和干扰项样本,这可以帮助wav2vec模型学习音频数据的数学表示形式。有了这些数据表示形式,wav2vec模型可以通过剪辑和比较,从干扰物种分辨出准确的语音声音。
步骤S702,将单字语音数据集中的第二语音数据输入至训练后的第一级特征提取网络中,通过训练后的第一级特征提取网络对第二语音数据进行第一级嵌入特征提取,得到具有第三特征提取精度的样本嵌入表示特征。
这里,第三特征提取精度是训练后的第一级特征提取网络对应的特征提取精度,即,第三特征提取精度是训练后的第一级特征提取网络在对第二语音数据进行嵌入特征提取时,所提取的样本嵌入表示特征的特征提取精度。本申请实施例中,第三特征提取精度对应于上述第一特征提取精度,也就是说,如果采用训练后的第一级特征提取网络对上述子语音信号进行第一级嵌入特征提取时,则可以得到第一特征提取精度的嵌入表示特征;如果采用训练后的第一级特征提取网络对第二语音数据进行第一级嵌入特征提取,则可以得到第三特征提取精度的嵌入表示特征(即具有第三特征提取精度的样本嵌入表示特征)。
单字语音数据集中包括多个单字语音(即第二语音数据),每一单字语音是由单个字的语音构成。本申请实施例中,可以对一段原始语音采用强制对齐方法(MFA,Montreal Forced Aligner)切分得到单字语音。在实现的过程中,可以提取原始语音对应的原始语音信号,并且,通过任意一种特征提取网络对原始语音进行特征提取,得到原始语音对应的多个语音特征,其中,每一语音特征是一个字的语音对应的特征向量;然后,将原始语音信号与每一语音特征一一对应(即根据每一语音特征,确定该语音特征对应的单个字的语音在原始语音信号中的起始位置和结束位置),实现原始语音信号与语音特征之间的对齐;在完成对齐之后,根据原始语音信号与语音特征之间的对齐位置(即起始位置和结束位置)对原始语音信号进行切分,形成多个原始语音子信号,其中,每一原始语音子信号对应一个单字语音。也就是说,MFA技术的实现过程是,先判断用户真正读的句子是什么,再用该判断结果去进行强制对齐。
本申请实施例中,可以将单字语音数据集中的每一单字语音输入至训练后的第一级特征提取网络中,通过训练后的第一级特征提取网络对每一单字语音进行第一级嵌入特 征提取,得到多个样本嵌入表示特征,通过多个样本嵌入表示特征对第二级特征提取网络进行训练。即将多个样本嵌入表示特征作为第二级特征提取网络的训练样本进行模型训练。
步骤S703,将具有第三特征提取精度的样本嵌入表示特征输入至第二级特征提取网络中,通过第二级特征提取网络对第二语音数据进行第二级嵌入特征提取,得到具有第四特征提取精度的样本嵌入表示特征;第三特征提取精度小于第四特征提取精度。
这里,第四特征提取精度是第二级特征提取网络对应的特征提取精度,即,第四特征提取精度是第二级特征提取网络对第二语音数据进行第二级嵌入特征提取时,所提取的样本嵌入表示特征的特征提取精度。本申请实施例中,第四特征提取精度对应于上述第二特征提取精度,也就是说,如果采用第二级特征提取网络对上述子语音信号进行第二级嵌入特征提取,则可以得到第二特征提取精度的嵌入表示特征;如果采用第二级特征提取网络对第二语音数据进行第二级嵌入特征提取,则可以得到第四特征提取精度的嵌入表示特征(即具有第四特征提取精度的样本嵌入表示特征)。
本申请实施例中,由于第二级语音特征提取的特征提取精度大于第一级语音特征提取的特征提取精度,因此,第三特征提取精度小于第四特征提取精度。
步骤S704,通过预设分类网络基于具有第四特征提取精度的样本嵌入表示特征,对第二语音数据进行语音识别,得到样本识别结果。
这里,第二级特征提取网络对每一个样本嵌入表示特征进行第二级嵌入特征提取,得到具有第四特征提取精度的样本嵌入表示特征。之后,再基于预设分类网络基于提取到的具有第四特征提取精度的样本嵌入表示特征,对第二语音数据进行语音识别,即对第二语音数据进行语音分类处理,得到样本识别结果。
这里以对第二语音数据是否包含脏话词为例进行说明。通过预设分类网络基于具有第四特征提取精度的样本嵌入表示特征,对第二语音数据进行语音识别时,可以是基于预设脏词库对第二语音数据进行分类和识别,基于提取到的具有第四特征提取精度的样本嵌入表示特征,确定第二语音数据中是否存在脏话词,从而得到是否存在脏话词的样本识别结果。
步骤S705,将样本识别结果与第二语音数据的分类标签信息输入至预设损失模型中,通过预设损失模型输出损失结果。
这里,在基于MFA切分得到多个单字语音(即第二语音数据)之后,还可以为每一第二语音数据添加分类标签信息,该分类标签信息用于标识该单字语音中是否存在脏话词。
本申请实施例中,通过第一级特征提取网络和第二级特征提取网络,提取到第二语音数据的具有第四特征提取精度的样本嵌入表示特征,并基于该具有第四特征提取精度的样本嵌入表示特征对第二语音数据是否包含脏话词进行识别,得到样本识别结果之后,可以将样本识别结果与第二语音数据的分类标签信息输入至预设损失模型中,通过预设损失模型输出损失结果。
这里,可以通过预设损失模型计算样本识别结果与分类标签信息之间的标签相似度。
当标签相似度大于标签相似度阈值时,表明第二级特征提取网络能够准确的提取到第二语音数据的样本嵌入表示特征,且,预设分类网络能够基于样本嵌入表示特征,对第二语音数据进行准确的语音识别。则此时可以停止对嵌入特征表示系统的训练,且将此时得到的嵌入特征表示系统确定为训练好的嵌入特征表示系统。
当标签相似度小于或等于标签相似度阈值时,表明第二级特征提取网络不能准确的提取到第二语音数据的样本嵌入表示特征,或者,表明预设分类网络不能基于样本嵌入表示特征,对第二语音数据进行准确的语音识别。则此时可以继续对嵌入特征表示系统 进行训练,直至标签相似度大于标签相似度阈值时停止训练。
步骤S706,基于损失结果对第二级特征提取网络中的模型参数进行修正,得到训练后的嵌入特征表示系统。
这里,当标签相似度小于或等于标签相似度阈值时,则可以基于修正参数对第二级特征提取网络中的模型参数进行修正;当标签相似度大于标签相似度阈值,停止对嵌入特征表示系统的训练过程。在对模型参数进行修正时,可以预先设置模型参数的修正区间,其中,第二级特征提取网络中的模型参数包括多个模型子参数,每一模型子参数均对应一修正区域。
模型参数的修正区间是指该模型参数在本轮训练过程中能够选择进行更改的修正参数的取值区间。在从修正区间中选取修正参数时,可以基于标签相似度的值来进行选择。如果标签相似度较小,则可以在修正区间中选择一个较大的修正参数作为本轮训练过程中的修正参数;如果标签相似度较大,则可以在修正区间中选择一个较小的修正参数作为本轮训练过程中的修正参数。
在实现的过程中,可以设置修正相似度阈值。当标签相似度小于或等于该修正相似度阈值时,表明标签相似度较小,则可以在修正区间的区间中值与区间极大值所形成的第一子区间中,随机选择一个修正参数作为本轮训练过程中的修正参数;当标签相似度大于该修正相似度阈值时,表明标签相似度较大,则可以在修正区间的区间极小值于区间中值所形成的第二子区间中,随机选择一个修正参数作为本轮训练过程中的修正参数,其中,修正相似度阈值小于上述标签相似度阈值。例如,假设修正区间为[a,b],则区间中值为第一子区间为第二子区间为如果标签相似度小于或等于修正相似度阈值,则可以在第一子区间中随机选择一个值作为修正参数;如果标签相似度大于修正相似度阈值,则可以在第二子区间中随机选择一个值作为修正参数。
本申请实施例中,在选择出修正参数之后,可以基于该修正参数对相应的模型参数进行调整。例如,当修正参数为正数时,可以调大模型参数;当修正参数为负数时,可以调小模型参数。
本申请实施例提供的嵌入特征表示系统的训练方法,通过无标注语音数据集中的第一语音数据,对第一级特征提取网络进行无监督训练;通过训练后的第一级特征提取网络提取单字语音数据集中的第二语音数据的嵌入标签特征,得到具有第三特征提取精度的样本嵌入表示特征,从而将这些具有第三特征提取精度的样本嵌入表示特征作为第二级特征提取网络的样本数据,对第二级特征提取网络进行训练,在训练第二级特征提取网络的过程中,进行有监督的学习,结合第二语音数据的分类标签信息对第二级特征提取网络中的模型参数进行学习,能够实现对第二级特征提取网络进行准确的学习和训练,得到能够准确提取中的模型参数进行修正的嵌入特征表示系统。
下面分别对第一级特征提取网络和第二级特征提取网络的训练过程进行说明。
第一级特征提取网络包括编码器网络和上下文网络,图8是本申请实施例提供的第一级特征提取网络的训练方法的流程示意图,该第一级特征提取网络的训练方法也可以由模型训练模块实现,其中,用于训练第一级特征提取网络的模型训练模块可以与用于训练嵌入特征表示系统的模型训练模块为同一电子设备中的同一模型训练模块,或是同一电子设备中的不同的模型训练模块,也可以是不同电子设备中的模型训练模块。即用于训练第一级特征提取网络的模型训练模块也可以是服务器或者是终端;或者,也可以 是独立于语音识别设备的另一设备。如图8所示,可以通过循环迭代以下步骤S801至步骤S805,对第一级特征提取网络进行训练,直至第一级特征提取网络满足预设收敛条件达到收敛为止:
步骤S801,将无标注语音数据集中的第一语音数据输入至第一级特征提取网络中。
步骤S802,通过编码器网络对第一语音数据进行第一卷积处理,得到低频表示特征。
这里,第一级特征提取网络可以实现为wav2vec模型。wav2vec模型可以通过多层的卷积神经网络来提取音频的无监督语音特征。wav2vec是一个卷积神经网络,wav2vec将原始音频作为输入并计算可以输入到语音识别系统的一般表示。wav2vec模型分为将原始音频x编码为潜在空间z的编码器网络(包括5层卷积处理层),和将z转换为语境化表征(contextualized representation)的上下文网络(包括9层卷积处理层),最终特征维度为512维帧数。目标是在特征层面使用当前帧预测未来帧。
也就是说,编码器网络包括多层卷积处理层,通过多层卷积处理层对第一语音数据进行多次卷积处理,从而实现对第一语音数据的编码,得到低频表示特征。
步骤S803,通过上下文网络对低频表示特征进行第二卷积处理,得到具有预设维度的嵌入表示特征。
这里,上下文网络包括多层卷积处理层,通过多层卷积处理层对编码器网络输出的低频表示特征进行多次卷积处理,从而实现将低频表示特征转换为语境化表征,即得到具有预设维度的嵌入表示特征。
步骤S804,将具有预设维度的嵌入表示特征输入至第一损失模型中,通过第一损失模型中的第一损失函数,确定具有预设维度的嵌入表示特征对应的第一损失结果。
这里,模型训练时的损失函数可以选取对比损失函数(contrastive loss)。通过对比损失函数,在训练时将正样本间的距离拉近,负样本间的距离拉远。
步骤S805,基于第一损失结果对编码器网络和上下文网络中的网络参数进行修正,得到训练后的第一级特征提取网络。
本申请实施例提供的第一级特征提取网络的训练方法,通过编码器网络实现对将第一语音数据的编码处理,得到低频表示特征;通过上下文网络将低频表示特征转换为语境化表征,具有预设维度的嵌入表示特征。进而通过对比损失函数进行对比损失计算,以实现将正样本间的距离拉近,负样本间的距离拉远。如此,通过自监督的学习过程,能够对第一级特征提取网络进行快速和准确的训练。
第二级特征提取网络包括:时序信息提取层、注意力机制层和损失计算层,其中,损失计算层包括第二损失函数。图9是本申请实施例提供的第二级特征提取网络的训练方法的流程示意图,该第二级特征提取网络的训练方法也可以由语音识别设备中的模型训练模块实现,该第二级特征提取网络的训练方法也可以由模型训练模块实现,其中,用于训练第二级特征提取网络的模型训练模块可以与用于训练第一级特征提取网络的模型训练模块为同一电子设备中的同一模型训练模块,或是同一电子设备中的不同的模型训练模块,也可以是不同电子设备中的模型训练模块。即用于训练第二级特征提取网络的模型训练模块也可以是服务器或者是终端;或者,也可以是独立于语音识别设备的另一设备。如图9所示,可以通过循环迭代以下步骤S901至步骤S906,对第二级特征提取网络进行训练,直至第二级特征提取网络满足预设收敛条件达到收敛为止:
步骤S901,将具有第三特征提取精度的样本嵌入表示特征,输入至第二级特征提取网络中。
步骤S902,通过时序信息提取层,提取样本嵌入表示特征在不同通道下的关键时序信息。
这里,第二级特征提取网络可以实现为ecapa-tdnn模型。时序信息提取层可以是ecapa-tdnn模型中的挤压激励模块(SE,Squeeze-Excitation)部分。SE部分在计算过程中,考虑的是时间轴上的注意力机制,SE部分能够让ecapa-tdnn模型学习到输入的样本嵌入表示特征中关键的时序信息。
步骤S903,通过注意力机制层对不同通道下的关键时序信息,在时间轴上进行累加处理,得到累加处理结果;以及,对累加处理结果进行加权计算,得到具有第四特征提取精度的样本嵌入表示特征。
这里,注意力机制层可以是ecapa-tdnn模型的注意力状态池化(attentive-stat pool)部分,注意力状态池化部分可以基于自注意力机制,使得ecapa-tdnn模型聚焦于时间维度,将不同通道的信息在时间轴上累加,并且,通过引入加权平均与加权方差的形式,使得所学习到的嵌入表示特征更加鲁棒,且具有区分度。
步骤S904,将具有第四特征提取精度的样本嵌入表示特征和第二语音数据的特征标签信息,输入至损失计算层。
这里,特征标签信息是指该语音数据是否是用户感兴趣的字,即是否是需要提取到特征的字对应的标签。举例来说,对于输入语音“我非常喜欢读书”,则用户感兴趣的字可以是“喜欢”和“读书”,因此,特征标签信息中可以将“喜欢”和“读书”标识出来,以表征在进行该输入语音的嵌入特征提取时,必须要提取到“喜欢”和“读书”这两个词对应的特征数据。
步骤S905,通过损失计算层的第二损失函数,确定具有第四特征提取精度的样本嵌入表示特征对应的第二损失结果。
这里,可以基于特征标签信息,获取与该特征标签信息对应的特征向量,并计算样本嵌入表示特征和特征向量之间的相似度,从而得到第二损失结果。
在一些实施例中,第二损失函数可以是Aam-softmax损失函数,通过Aam-softmax损失函数,在训练时能够减小同类特征的角度,同时增大不同类特征的角度,如此,便可使得第二级特征提取网络学习的嵌入表示特征更优。在实现的过程中,可以通过Aam-softmax损失函数计算样本嵌入表示特征和特征向量之间的余弦相似度,其中,嵌入表示特征和特征向量中不仅具有属于同一类别的特征(即同类特征),还具有属于不同类别的特征(即不同类特征),同类特征的角度是指两个同类特征对应的两个特征向量之间的向量夹角,不同类特征的角度是指两个不同类特征对应的两个特征向量之间的向量夹角。通过Aam-softmax损失函数计算余弦相似度,从而基于余弦相似度对应的第二损失结果对第二级特征提取网络进行训练,能够使得采用训练后的第二级特征提取网络提取样本嵌入表示特征时,提取的样本嵌入表示特征与特征向量的同类特征对应的特征向量之间的向量夹角小于角度阈值,不同类特征对应的特征向量之间的向量夹角大于或等于角度阈值,也就是说,能够使得同类特征之间的相似度更高,不同类特征之间的相似度更低。
步骤S906,基于第二损失结果对时序信息提取层和注意力机制层中的网络参数进行修正,得到训练后的第二级特征提取网络。
本申请实施例提供的第二级特征提取网络的训练方法,通过时序信息提取层,提取样本嵌入表示特征在不同通道下的关键时序信息;通过注意力机制层对不同通道下的关键时序信息,在时间轴上依次进行累加处理和加权计算,得到具有第四特征提取精度的样本嵌入表示特征。进而通过第二损失函数进行损失计算,以实现在训练时减小同类的角度,同时增大不同类的角度。如此,通过有监督的学习过程,能够对第二级特征提取网络进行快速和准确的训练。
需要说明的是,上述针对嵌入特征表示系统(包含有预设分类网络)、嵌入特征表 示系统中的第一级特征提取网络、第二级特征提取网络的训练过程,可以在先训练好第一级特征提取网络之后并行进行,也可以依次进行。也就是说,可以先训练第一级特征提取网络,之后,再并行进行第二级特征提取网络和整个嵌入特征表示系统的训练。或者,也可以先训练第一级特征提取网络,之后再依次训练第二级特征提取网络和整个嵌入特征表示系统。
下面,将说明本申请实施例在一个实际的应用场景中的示例性应用。
本申请实施例提供的语音识别方法,首先在大规模的无标注语音上,使用对比学习的方法训练自监督预训练模型,该模型可以充分学习到语音的嵌入表示特征;然后,使用基于隐马尔可夫模型的强制对齐方法(MFA,Montreal Forced Aligner)切分中文单字语音,通过Aam-softmax损失函数进一步学习嵌入表示特征。通过上述深度学习的方法,整个语音识别模型(即嵌入特征表示系统)首先充分学习到单句话的嵌入表示特征,然后基于单字音频,进一步学习嵌入表示特征。如此以来,本申请实施例在进行语音关键词匹配时,便可极大提升语音识别模型的泛化能力与抗干扰能力,能有效的区分不同的字,从而能更精准的进行游戏语音关键词匹配。
本申请实施例的语音识别方法用于文明语音的二次校验,如图10所示,是本申请实施例提供的语音关键词匹配系统示意图。对于上报的可能包含脏话的语音x1,本申请实施例通过采用嵌入特征表示系统1001,通过滑动窗的形式提取语音x1的嵌入表示特征x;其次,遍历脏词库(即预设比对词库)的嵌入表示特征,求取上报的语音x1嵌入表示特征x与脏词库中脏话y1的嵌入表示特征y之间的余弦相似度1002,如果该余弦相似度大于预设的相似度阈值,则判定该上报语音x1中包含脏词。
上述嵌入特征表示系统包括第一级特征提取网络和第二级特征提取网络,本申请实施例以第一级特征提取网络为wav2vec模型、第二级特征提取网络为ecapa-tdnn模型为例进行说明。
图11是本申请实施例提供的训练wav2vec模型的流程示意图,如图11所示,首先在大规模无标注语音上使用对比学习训练wav2vec模型1101,该步骤为自监督过程,得到训练后的wav2vec模型。图12是本申请实施例提供的训练ecapa-tdnn模型的流程示意图,如图12所示,在wav2vec模型训练完成后,再基于单字语音数据集,固定wav2vec模型,使用wav2vec模型提取单字语音的嵌入表达特征,然后,将嵌入表达特征输入到ecapa-tdnn模型1201中,通过aam-softmax损失函数训练ecapa-tdnn模型1201。
下面分别对wav2vec模型和ecapa-tdnn模型的训练流程进行说明。
图13是本申请实施例提供的wav2vec模型的结构示意图,如图13所示,wav2vec模型包括编码器网络1301和上下文网络1302。编码器网络1301包含5层一维卷积,输入为音频波形,输出为低频表示特征;上下文网络1302包含9层一维卷积,输入为多个低频表示特征,输出为512维的嵌入表示特征。wav2vec模型训练过程中所使用的第一损失函数以下公式(1)所示:
其中,L为第一损失函数,k表示时间步,T表示序列时长,Z表示编码器网络输出,C表示上下文网络输出,h表示放射变换,λ表示负样本个数,pn表示均匀分布,表示负样本的编码器网络输出;σ表示f(x)=1/(1+exp(-x))的函数,值域为(0,1),x为负无穷到正无穷;为正样本相似度,正样本相似度最高为1;为与负样本相似度,由于函数中有负号,则整体最大值也为1。Lκ整体的 损失函数意思为:使正样本的距离尽可能小,同时拉大与负样本之间的距离,最终达到的效果是每个嵌入表示特征具备很好的表示性。
图14是本申请实施例提供的ecapa-tdnn模型的结构示意图,图15是本申请实施例提供的ecapa-tdnn模型中SE-ResBlock部分的结构示意图,请同时参照图14和图15,其中:SE部分(即时序信息提取层),包括图14中的SE层141、SE层142和SE层143。这里,SE部分在计算过程中,考虑的是时间轴上的注意力机制,SE部分能够让ecapa-tdnn模型学习到输入特征中关键的时序信息。注意力机制层144部分,此处可以基于自注意力机制使得ecapa-tdnn模型聚焦于时间维度,将不同通道的信息在时间轴上累加,并且,通过引入加权平均与加权方差的形式,使得所学习到的嵌入表示特征更加鲁棒,且具有区分度。损失计算层145部分,可以采用Aam-softmax损失函数(对应上述第二损失函数)进行损失计算,如以下公式(2)所示:
其中,L3为第二损失函数;s和m均为设置的常数;该第二损失函数可以减小同类特征之间的角度,同时增大不同类特征之间的角度θ(例如θyi+m),如此即可使得学习的嵌入表示特征更优。
本申请实施例提供的语音识别方法可以应用于游戏语音领域,作为文明语音的二次校验部分,通过求取待识别语音与脏词库中的语音的嵌入表示特征之间的余弦相似度,从而判别待识别语音中是否含有脏词。在测试过程中,能有效精准的定位脏词。
下面继续说明本申请实施例提供的语音识别装置354实施为软件模块的示例性结构,在一些实施例中,如图4所示,语音识别装置354包括:
帧截取模块3541,配置为对待识别语音信号进行滑动窗截取,得到至少两个子语音信号;特征提取模块3542,配置为通过预先训练的嵌入特征表示系统,对每一子语音信号进行语音特征提取,得到相应子语音信号的子语音嵌入表示特征;其中,所述嵌入特征表示系统包括第一级特征提取网络和第二级特征提取网络;所述第一级特征提取网络用于对所述子语音信号进行第一级语音特征提取,得到第一级语音特征;所述第二级特征提取网络用于基于所述第一级语音特征,对所述子语音信号进行第二级语音特征提取,所述第二级语音特征提取的特征提取精度大于所述第一级语音特征提取的特征提取精度;获取模块3543,配置为获取预设比对词库中的每一比对词的嵌入表示特征;语音识别模块3544,配置为根据所述子语音嵌入表示特征和每一所述比对词的嵌入表示特征,对每一所述子语音信号进行语音识别,得到子语音识别结果;确定模块3545,配置为根据每一所述子语音信号的子语音识别结果,确定所述待识别语音信号对应的语音识别结果。
在一些实施例中,所述帧截取模块还配置为:采用具有预设步长的滑动窗,对所述待识别语音信号进行分帧处理,得到至少两个子语音信号,所述至少两个子语音信号具有相同的帧长。
在一些实施例中,所述装置还包括:窗函数获取模块,配置为获取预设窗函数;平滑处理模块,配置为采用所述预设窗函数对每一所述子语音信号进行平滑处理,对应得到至少两个平滑处理后的子语音信号;所述特征提取模块还配置为:对每一平滑处理后的子语音信号进行语音特征提取,得到相应子语音信号的子语音嵌入表示特征。
在一些实施例中,所述特征提取模块还配置为:将每一所述子语音信号输入至所述第一级特征提取网络中,通过所述第一级特征提取网络,对所述子语音信号进行第一级 嵌入特征提取,得到具有第一特征提取精度的嵌入表示特征;将所述具有第一特征提取精度的嵌入表示特征,输入至所述第二级特征提取网络中,通过所述第二级特征提取网络,对所述子语音信号进行第二级嵌入特征提取,得到具有第二特征提取精度的嵌入表示特征;所述第一特征提取精度小于所述第二特征提取精度,所述具有第二特征提取精度的嵌入表示特征构成所述子语音信号的子语音嵌入表示特征。
在一些实施例中,所述语音识别模块还配置为:确定所述子语音嵌入表示特征与所述每一比对词的嵌入表示特征之间的相似度;当所述子语音嵌入表示特征与任一比对词的嵌入表示特征之间的相似度大于相似度阈值时,确定所述子语音信号的子语音识别结果为特定识别结果;所述特定识别结果用于表征:所述子语音信号对应的子语音中含有特定的语音词,所述特定的语音词是与所述预设比对词库中的比对词具有相同属性的语音词。
在一些实施例中,所述确定模块还配置为:当任一子语音信号的子语音识别结果为所述特定识别结果时,确定所述待识别语音信号对应的语音识别结果为所述特定识别结果。
在一些实施例中,所述预设比对词库中包括每一所述比对词的比对词语音信号;所述获取模块还配置为:通过所述预先训练的嵌入特征表示系统,对每一所述比对词的比对词语音信号进行语音特征提取,得到每一所述比对词的嵌入表示特征。
在一些实施例中,所述装置还包括模型训练模块,用于训练所述嵌入特征表示系统;其中,所述模型训练模块,配置为将无标注语音数据集中的第一语音数据输入至所述第一级特征提取网络中,通过对比学习方式对所述第一级特征提取网络进行训练,得到训练后的第一级特征提取网络;将单字语音数据集中的第二语音数据输入至所述训练后的第一级特征提取网络中,通过所述训练后的第一级特征提取网络对所述第二语音数据进行第一级嵌入特征提取,得到具有第三特征提取精度的样本嵌入表示特征;将所述具有第三特征提取精度的样本嵌入表示特征输入至所述第二级特征提取网络中,通过所述第二级特征提取网络对所述第二语音数据进行第二级嵌入特征提取,得到具有第四特征提取精度的样本嵌入表示特征;所述第三特征提取精度小于所述第四特征提取精度;通过预设分类网络基于所述具有第四特征提取精度的样本嵌入表示特征,对所述第二语音数据进行语音识别,得到样本识别结果;将所述样本识别结果与所述第二语音数据的分类标签信息输入至预设损失模型中,通过所述预设损失模型输出损失结果;基于所述损失结果对所述第二级特征提取网络中的模型参数进行修正,得到训练后的嵌入特征表示系统。
在一些实施例中,所述第一级特征提取网络包括编码器网络和上下文网络;所述模型训练模块还配置为:将无标注语音数据集中的第一语音数据输入至所述第一级特征提取网络中;通过所述编码器网络对所述第一语音数据进行第一卷积处理,得到低频表示特征;通过所述上下文网络对所述低频表示特征进行第二卷积处理,得到具有预设维度的嵌入表示特征;将所述具有预设维度的嵌入表示特征输入至第一损失模型中,通过所述第一损失模型中的第一损失函数,确定所述具有预设维度的嵌入表示特征对应的第一损失结果;基于所述第一损失结果对所述编码器网络和所述上下文网络中的网络参数进行修正,得到所述训练后的第一级特征提取网络。
在一些实施例中,所述第二级特征提取网络包括:时序信息提取层和注意力机制层;所述模型训练模块还配置为:将所述具有第三特征提取精度的样本嵌入表示特征,输入至所述第二级特征提取网络中;通过所述时序信息提取层,提取所述样本嵌入表示特征在不同通道下的关键时序信息;通过所述注意力机制层对所述不同通道下的关键时序信息,在时间轴上依次进行累加处理,得到累加处理结果;对所述累加处理结果进行加权 计算,得到所述具有第四特征提取精度的样本嵌入表示特征。
在一些实施例中,所述第二级特征提取网络还包括损失计算层,所述损失计算层包括第二损失函数;所述模型训练模块还配置为:将所述具有第四特征提取精度的样本嵌入表示特征和所述第二语音数据的特征标签信息,输入至所述损失计算层;通过所述损失计算层的第二损失函数,确定所述具有第四特征提取精度的样本嵌入表示特征对应的第二损失结果;基于所述第二损失结果对所述时序信息提取层和所述注意力机制层中的网络参数进行修正,得到训练后的第二级特征提取网络。
需要说明的是,本申请实施例装置的描述,与上述方法实施例的描述是类似的,具有同方法实施例相似的有益效果,因此不做赘述。对于本装置实施例中未披露的技术细节,请参照本申请方法实施例的描述而理解。
本申请实施例提供了一种计算机程序产品,该计算机程序产品包括计算机程序或可执行指令,该可执行指令是一种计算机指令;该计算机程序或可执行指令存储在计算机可读存储介质中。当语音识别设备的处理器从计算机可读存储介质读取该可执行指令,处理器执行该可执行指令时,使得该语音识别设备执行本申请实施例上述的方法。
本申请实施例提供一种存储有可执行指令的存储介质,其中存储有可执行指令,当可执行指令被处理器执行时,将引起处理器执行本申请实施例提供的方法,例如,如图5示出的方法。
在一些实施例中,存储介质可以是计算机可读存储介质,例如,铁电存储器(FRAM,Ferromagnetic Random Access Memory)、只读存储器(ROM,Read Only Memory)、可编程只读存储器(PROM,Programmable Read Only Memory)、可擦除可编程只读存储器(EPROM,Erasable Programmable Read Only Memory)、带电可擦可编程只读存储器(EEPROM,Electrically Erasable Programmable Read Only Memory)、闪存、磁表面存储器、光盘、或光盘只读存储器(CD-ROM,Compact Disk-Read Only Memory)等存储器;也可以是包括上述存储器之一或任意组合的各种设备。
在一些实施例中,可执行指令可以采用程序、软件、软件模块、脚本或代码的形式,按任意形式的编程语言(包括编译或解释语言,或者声明性或过程性语言)来编写,并且其可按任意形式部署,包括被部署为独立的程序或者被部署为模块、组件、子例程或者适合在计算环境中使用的其它单元。
作为示例,可执行指令可以但不一定对应于文件系统中的文件,可以可被存储在保存其它程序或数据的文件的一部分,例如,存储在超文本标记语言(HTML,Hyper Text Markup Language)文档中的一个或多个脚本中,存储在专用于所讨论的程序的单个文件中,或者,存储在多个协同文件(例如,存储一个或多个模块、子程序或代码部分的文件)中。作为示例,可执行指令可被部署为在一个电子设备上执行,或者在位于一个地点的多个电子设备上执行,又或者,在分布在多个地点且通过通信网络互连的多个电子设备上执行。
以上所述,仅为本申请的实施例而已,并非用于限定本申请的保护范围。凡在本申请的精神和范围之内所作的任何修改、等同替换和改进等,均包含在本申请的保护范围之内。

Claims (15)

  1. 一种语音识别方法,所述方法由电子设备执行,所述方法包括:
    对待识别语音信号进行滑动窗截取,得到至少两个子语音信号;
    通过预先训练的嵌入特征表示系统,对每一子语音信号进行语音特征提取,得到相应子语音信号的子语音嵌入表示特征;其中,所述嵌入特征表示系统包括第一级特征提取网络和第二级特征提取网络;所述第一级特征提取网络用于对所述子语音信号进行第一级语音特征提取,得到第一级语音特征;所述第二级特征提取网络用于基于所述第一级语音特征,对所述子语音信号进行第二级语音特征提取,所述第二级语音特征提取的特征提取精度大于所述第一级语音特征提取的特征提取精度;
    获取预设比对词库中的每一比对词的嵌入表示特征;
    根据所述子语音嵌入表示特征和每一所述比对词的嵌入表示特征,对每一所述子语音信号进行语音识别,得到子语音识别结果;
    根据每一所述子语音信号的子语音识别结果,确定所述待识别语音信号对应的语音识别结果。
  2. 根据权利要求1所述的方法,其中,所述对待识别语音信号进行滑动窗截取,得到至少两个子语音信号,包括:
    采用具有预设步长的滑动窗,对所述待识别语音信号进行分帧处理,得到所述至少两个子语音信号;所述至少两个子语音信号具有相同的帧长。
  3. 根据权利要求1所述的方法,其中,在对每一子语音信号进行语音特征提取,得到相应子语音信号的子语音嵌入表示特征之前,所述方法还包括:
    获取预设窗函数;
    采用所述预设窗函数对每一所述子语音信号进行平滑处理,对应得到至少两个平滑处理后的子语音信号;
    所述对每一子语音信号进行语音特征提取,得到相应子语音信号的子语音嵌入表示特征,包括:
    对每一所述平滑处理后的子语音信号进行语音特征提取,得到相应子语音信号的子语音嵌入表示特征。
  4. 根据权利要求1所述的方法,其中,所述通过预先训练的嵌入特征表示系统,对每一子语音信号进行语音特征提取,得到相应子语音信号的子语音嵌入表示特征,包括:
    将每一所述子语音信号输入至所述第一级特征提取网络中,通过所述第一级特征提取网络,对所述子语音信号进行第一级嵌入特征提取,得到具有第一特征提取精度的嵌入表示特征;
    将所述具有第一特征提取精度的嵌入表示特征,输入至所述第二级特征提取网络中,通过所述第二级特征提取网络,对所述子语音信号进行第二级嵌入特征提取,得到具有第二特征提取精度的嵌入表示特征;所述第一特征提取精度小于所述第二特征提取精度,所述具有第二特征提取精度的嵌入表示特征构成所述子语音信号的子语音嵌入表示特征。
  5. 根据权利要求1所述的方法,其中,所述根据所述子语音嵌入表示特征和每一所述比对词的嵌入表示特征,对每一所述子语音信号进行语音识别,得到子语音识别结果,包括:
    确定所述子语音嵌入表示特征与每一所述比对词的嵌入表示特征之间的相似度;
    当所述子语音嵌入表示特征与任一比对词的嵌入表示特征之间的相似度大于相似 度阈值时,确定所述子语音信号的子语音识别结果为特定识别结果;所述特定识别结果用于表征:所述子语音信号对应的子语音中含有特定的语音词,所述特定的语音词是与所述预设比对词库中的比对词具有相同属性的语音词。
  6. 根据权利要求5所述的方法,其中,所述根据每一所述子语音信号的子语音识别结果,确定所述待识别语音信号对应的语音识别结果,包括:
    当任一子语音信号的子语音识别结果为所述特定识别结果时,确定所述待识别语音信号对应的语音识别结果为所述特定识别结果。
  7. 根据权利要求1所述的方法,其中,所述预设比对词库中包括每一所述比对词的比对词语音信号;所述获取预设比对词库中的每一比对词的嵌入表示特征,包括:
    通过所述预先训练的嵌入特征表示系统,对每一所述比对词的比对词语音信号进行语音特征提取,得到每一所述比对词的嵌入表示特征。
  8. 根据权利要求1至7任一项所述的方法,其中,所述嵌入特征表示系统通过以下方式进行训练:
    将无标注语音数据集中的第一语音数据输入至所述第一级特征提取网络中,通过对比学习方式对所述第一级特征提取网络进行训练,得到训练后的第一级特征提取网络;
    将单字语音数据集中的第二语音数据输入至所述训练后的第一级特征提取网络中,通过所述训练后的第一级特征提取网络对所述第二语音数据进行第一级嵌入特征提取,得到具有第三特征提取精度的样本嵌入表示特征;
    将所述具有第三特征提取精度的样本嵌入表示特征输入至所述第二级特征提取网络中,通过所述第二级特征提取网络对所述第二语音数据进行第二级嵌入特征提取,得到具有第四特征提取精度的样本嵌入表示特征;所述第三特征提取精度小于所述第四特征提取精度;
    通过预设分类网络基于所述具有第四特征提取精度的样本嵌入表示特征,对所述第二语音数据进行语音识别,得到样本识别结果;
    将所述样本识别结果与所述第二语音数据的分类标签信息输入至预设损失模型中,通过所述预设损失模型输出损失结果;
    基于所述损失结果对所述第二级特征提取网络中的模型参数进行修正,得到训练后的嵌入特征表示系统。
  9. 根据权利要求8所述的方法,其中,所述第一级特征提取网络包括编码器网络和上下文网络;
    所述将无标注语音数据集中的第一语音数据输入至所述第一级特征提取网络中,通过对比学习方式对所述第一级特征提取网络进行训练,得到训练后的第一级特征提取网络,包括:
    将所述无标注语音数据集中的第一语音数据输入至所述第一级特征提取网络中;
    通过所述编码器网络对所述第一语音数据进行第一卷积处理,得到低频表示特征;
    通过所述上下文网络对所述低频表示特征进行第二卷积处理,得到具有预设维度的嵌入表示特征;
    将所述具有预设维度的嵌入表示特征输入至第一损失模型中,通过所述第一损失模型中的第一损失函数,确定所述具有预设维度的嵌入表示特征对应的第一损失结果;
    基于所述第一损失结果对所述编码器网络和所述上下文网络中的网络参数进行修正,得到所述训练后的第一级特征提取网络。
  10. 根据权利要求8所述的方法,其中,所述第二级特征提取网络包括:时序信息提取层和注意力机制层;
    所述将所述具有第三特征提取精度的样本嵌入表示特征输入至所述第二级特征提 取网络中,通过所述第二级特征提取网络对所述第二语音数据进行第二级嵌入特征提取,得到具有第四特征提取精度的样本嵌入表示特征,包括:
    将所述具有第三特征提取精度的样本嵌入表示特征,输入至所述第二级特征提取网络中;
    通过所述时序信息提取层,提取所述样本嵌入表示特征在不同通道下的关键时序信息;
    通过所述注意力机制层对所述不同通道下的关键时序信息,在时间轴上进行累加处理,得到累加处理结果;
    对所述累加处理结果进行加权计算,得到所述具有第四特征提取精度的样本嵌入表示特征。
  11. 根据权利要求10所述的方法,其中,所述第二级特征提取网络还包括损失计算层,所述损失计算层包括第二损失函数;所述方法还包括:
    将所述具有第四特征提取精度的样本嵌入表示特征和所述第二语音数据的特征标签信息,输入至所述损失计算层;
    通过所述损失计算层的第二损失函数,确定所述具有第四特征提取精度的样本嵌入表示特征对应的第二损失结果;
    基于所述第二损失结果对所述时序信息提取层和所述注意力机制层中的网络参数进行修正,得到训练后的第二级特征提取网络。
  12. 一种语音识别装置,所述装置包括:
    帧截取模块,配置为对待识别语音信号进行滑动窗截取,得到至少两个子语音信号;
    特征提取模块,配置为通过预先训练的嵌入特征表示系统,对每一子语音信号进行语音特征提取,得到相应子语音信号的子语音嵌入表示特征;其中,所述嵌入特征表示系统包括第一级特征提取网络和第二级特征提取网络;所述第一级特征提取网络用于对所述子语音信号进行第一级语音特征提取,得到第一级语音特征;所述第二级特征提取网络用于基于所述第一级语音特征,对所述子语音信号进行第二级语音特征提取,所述第二级语音特征提取的特征提取精度大于所述第一级语音特征提取的特征提取精度;
    获取模块,配置为获取预设比对词库中的每一比对词的嵌入表示特征;
    语音识别模块,配置为根据所述子语音嵌入表示特征和每一所述比对词的嵌入表示特征,对每一所述子语音信号进行语音识别,得到子语音识别结果;
    确定模块,配置为根据每一所述子语音信号的子语音识别结果,确定所述待识别语音信号对应的语音识别结果。
  13. 一种电子设备,包括:
    存储器,用于存储可执行指令;处理器,用于执行所述存储器中存储的可执行指令时,实现权利要求1至11任一项所述的语音识别方法。
  14. 一种计算机可读存储介质,存储有可执行指令,用于引起处理器执行所述可执行指令时,实现权利要求1至11任一项所述的语音识别方法。
  15. 一种计算机程序产品或计算机程序,所述计算机程序产品或计算机程序包括可执行指令,所述可执行指令存储在计算机可读存储介质上;
    当电子设备从所述计算机可读存储介质读取所述可执行指令,并执行所述可执行指令时,实现权利要求1至11任一项所述的语音识别方法。
PCT/CN2023/121239 2022-11-04 2023-09-25 语音识别方法、装置、电子设备、存储介质及计算机程序产品 WO2024093578A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202211373304.3A CN115512692B (zh) 2022-11-04 2022-11-04 语音识别方法、装置、设备及存储介质
CN202211373304.3 2022-11-04

Publications (1)

Publication Number Publication Date
WO2024093578A1 true WO2024093578A1 (zh) 2024-05-10

Family

ID=84512101

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/121239 WO2024093578A1 (zh) 2022-11-04 2023-09-25 语音识别方法、装置、电子设备、存储介质及计算机程序产品

Country Status (2)

Country Link
CN (1) CN115512692B (zh)
WO (1) WO2024093578A1 (zh)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115512692B (zh) * 2022-11-04 2023-02-28 腾讯科技(深圳)有限公司 语音识别方法、装置、设备及存储介质

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109545186A (zh) * 2018-12-16 2019-03-29 初速度(苏州)科技有限公司 一种语音识别训练系统及方法
JP2019133046A (ja) * 2018-02-01 2019-08-08 日本電信電話株式会社 学習装置、学習方法及び学習プログラム
CN112530408A (zh) * 2020-11-20 2021-03-19 北京有竹居网络技术有限公司 用于识别语音的方法、装置、电子设备和介质
CN113823262A (zh) * 2021-11-16 2021-12-21 腾讯科技(深圳)有限公司 一种语音识别方法、装置、电子设备和存储介质
CN115512692A (zh) * 2022-11-04 2022-12-23 腾讯科技(深圳)有限公司 语音识别方法、装置、设备及存储介质

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110322872A (zh) * 2019-06-05 2019-10-11 平安科技(深圳)有限公司 会议语音数据处理方法、装置、计算机设备和存储介质
AU2019101150A4 (en) * 2019-09-30 2019-10-31 Li, Guanchen MR Speaker Identity Recognition System Based on Deep Learning
CN111462735B (zh) * 2020-04-10 2023-11-28 杭州网易智企科技有限公司 语音检测方法、装置、电子设备及存储介质
CN114242066A (zh) * 2021-12-31 2022-03-25 科大讯飞股份有限公司 语音处理方法、语音处理模型的训练方法、设备及介质

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2019133046A (ja) * 2018-02-01 2019-08-08 日本電信電話株式会社 学習装置、学習方法及び学習プログラム
CN109545186A (zh) * 2018-12-16 2019-03-29 初速度(苏州)科技有限公司 一种语音识别训练系统及方法
CN112530408A (zh) * 2020-11-20 2021-03-19 北京有竹居网络技术有限公司 用于识别语音的方法、装置、电子设备和介质
CN113823262A (zh) * 2021-11-16 2021-12-21 腾讯科技(深圳)有限公司 一种语音识别方法、装置、电子设备和存储介质
CN115512692A (zh) * 2022-11-04 2022-12-23 腾讯科技(深圳)有限公司 语音识别方法、装置、设备及存储介质

Also Published As

Publication number Publication date
CN115512692B (zh) 2023-02-28
CN115512692A (zh) 2022-12-23

Similar Documents

Publication Publication Date Title
US11615308B2 (en) Generating responses to queries about videos utilizing a multi-modal neural network with attention
KR102315732B1 (ko) 음성 인식 방법, 디바이스, 장치, 및 저장 매체
CN110473531B (zh) 语音识别方法、装置、电子设备、系统及存储介质
US20240312184A1 (en) System and method for neural network orchestration
Cai et al. Audio‐Textual Emotion Recognition Based on Improved Neural Networks
CN110457432B (zh) 面试评分方法、装置、设备及存储介质
Zhang et al. Advanced data exploitation in speech analysis: An overview
JP2021067939A (ja) 音声インタラクション制御のための方法、装置、機器及び媒体
CN105096935A (zh) 一种语音输入方法、装置和系统
US11200885B1 (en) Goal-oriented dialog system
CN111414513B (zh) 音乐流派的分类方法、装置及存储介质
WO2024093578A1 (zh) 语音识别方法、装置、电子设备、存储介质及计算机程序产品
CN111126084B (zh) 数据处理方法、装置、电子设备和存储介质
WO2024140434A1 (zh) 基于多模态知识图谱的文本分类方法、设备及存储介质
CN112580669B (zh) 一种对语音信息的训练方法及装置
WO2021159756A1 (zh) 基于多模态的响应义务检测方法、系统及装置
CN117709339A (zh) 一种基于直播间用户行为网络的语言模型增强方法及系统
CN116959418A (zh) 一种音频处理方法及装置
CN115455142A (zh) 文本检索方法、计算机设备和存储介质
Chung et al. Unsupervised discovery of structured acoustic tokens with applications to spoken term detection
Tang et al. Hierarchical residual-pyramidal model for large context based media presence detection
CN112820274B (zh) 一种语音信息识别校正方法和系统
US12045700B1 (en) Systems and methods of generative machine-learning guided by modal classification
US11934794B1 (en) Systems and methods for algorithmically orchestrating conversational dialogue transitions within an automated conversational system
Li et al. On‐device audio‐visual multi‐person wake word spotting

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23884510

Country of ref document: EP

Kind code of ref document: A1